Functions Reference 

otherwise(value)[source]

End a CASE WHEN expression with default value.

Parameters:: value (Any)
Return type:: CaseWhen

over(window_spec)[source]

Apply window function over window specification.

Parameters:: window_spec (WindowSpec)
Return type:: WindowFunction

count()[source]

Count non-null values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the count operation.

avg()[source]

Average values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the avg function (PySpark-compatible).

sum()[source]

Sum values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the sum function (PySpark-compatible).

max()[source]

Maximum value in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the max function (PySpark-compatible).

min()[source]

Minimum value in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the min function (PySpark-compatible).

stddev()[source]

Standard deviation of values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the stddev function (PySpark-compatible).

variance()[source]

Variance of values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the variance function (PySpark-compatible).

bitwise_not()[source]

Bitwise NOT operation on this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the bitwise_not function.

class sparkless.functions.functions.ColumnOperation(column, operation, value=None, name=None)[source]

Bases: Column

Represents a column operation (comparison, arithmetic, etc.).

This class encapsulates column operations and their operands for evaluation during DataFrame operations. Inherits from Column to ensure isinstance() checks pass for PySpark compatibility.

Parameters:

column (Any)
operation (str)
value (Any)
name (Optional[str])

Initialize ColumnOperation.

Parameters:

column (Any) – The column being operated on (can be None for some operations).
operation (str) – The operation being performed.
value (Any) – The value or operand for the operation.
name (Optional[str]) – Optional custom name for the operation.

__init__(column, operation, value=None, name=None)[source]

Initialize ColumnOperation.

Parameters:

column (Any) – The column being operated on (can be None for some operations).
operation (str) – The operation being performed.
value (Any) – The value or operand for the operation.
name (Optional[str]) – Optional custom name for the operation.

property name: str: Get column name.

__str__()[source]

Generate SQL representation of this operation.

Return type:: str

alias(*alias_names)[source]

Create an alias for this operation (PySpark: one or more names, e.g. posexplode).

Parameters:: alias_names (str)
Return type:: ColumnOperation

getField(index_or_name)[source]

Access array element by index or struct field by name (PySpark getField).

Parameters:: index_or_name (Union[int, str])
Return type:: ColumnOperation

over(window_spec)[source]

Apply window function over window specification.

Parameters:: window_spec (WindowSpec)
Return type:: WindowFunction

class sparkless.functions.functions.Literal(value, data_type=None, resolver=None)[source]

Bases: IColumn

Literal value for DataFrame operations.

Represents a literal value that can be used in column expressions and transformations, maintaining compatibility with PySpark’s lit function.

Parameters:

value (Any)
data_type (Optional[DataType])
resolver (Optional[Callable[[], Any]])

Initialize Literal.

Parameters:

value (Any) – The literal value.
data_type (Optional[DataType]) – Optional data type. Inferred from value if not specified.
resolver (Optional[Callable[[], Any]]) – Optional callable that returns the resolved value at evaluation time. The resolver should handle session resolution internally.

__init__(value, data_type=None, resolver=None)[source]

Initialize Literal.

Parameters:

value (Any) – The literal value.
data_type (Optional[DataType]) – Optional data type. Inferred from value if not specified.
resolver (Optional[Callable[[], Any]]) – Optional callable that returns the resolved value at evaluation time. The resolver should handle session resolution internally.

property name: str: Get literal name.

__eq__(other)[source]

Equality comparison.

Note: Returns ColumnOperation instead of bool for PySpark compatibility.

Parameters:: other (Any)
Return type:: ColumnOperation

__ne__(other)[source]

Inequality comparison.

Note: Returns ColumnOperation instead of bool for PySpark compatibility.

Parameters:: other (Any)
Return type:: ColumnOperation

__lt__(other)[source]

Less than comparison.

Parameters:: other (Any)
Return type:: IColumn

__le__(other)[source]

Less than or equal comparison.

Parameters:: other (Any)
Return type:: IColumn

__gt__(other)[source]

Greater than comparison.

Parameters:: other (Any)
Return type:: IColumn

__ge__(other)[source]

Greater than or equal comparison.

Parameters:: other (Any)
Return type:: IColumn

__add__(other)[source]

Addition operation.

Parameters:: other (Any)
Return type:: IColumn

__sub__(other)[source]

Subtraction operation.

Parameters:: other (Any)
Return type:: IColumn

__mul__(other)[source]

Multiplication operation.

Parameters:: other (Any)
Return type:: IColumn

__truediv__(other)[source]

Division operation.

Parameters:: other (Any)
Return type:: IColumn

__mod__(other)[source]

Modulo operation.

Parameters:: other (Any)
Return type:: IColumn

__and__(other)[source]

Logical AND operation.

Parameters:: other (Any)
Return type:: IColumn

__or__(other)[source]

Logical OR operation.

Parameters:: other (Any)
Return type:: IColumn

__invert__()[source]

Logical NOT operation.

Return type:: IColumn

__neg__()[source]

Unary minus operation (-literal).

Return type:: ColumnOperation

isnull()[source]

Check if literal value is null.

Return type:: ColumnOperation

isnotnull()[source]

Check if literal value is not null.

Return type:: ColumnOperation

isNull()[source]

Check if literal value is null (PySpark compatibility).

Return type:: ColumnOperation

isNotNull()[source]

Check if literal value is not null (PySpark compatibility).

Return type:: ColumnOperation

eqNullSafe(other)[source]

Null-safe equality comparison (PySpark eqNullSafe).

This behaves like PySpark’s eqNullSafe: - If both sides are null, the comparison is True. - If exactly one side is null, the comparison is False. - Otherwise, it behaves like standard equality, including any backend-specific type coercion rules.

Parameters:: other (Any)
Return type:: ColumnOperation

isin(*values)[source]

Check if literal value is in list of values.

Parameters:: values (Any)
Return type:: ColumnOperation

between(lower, upper)[source]

Check if literal value is between lower and upper bounds.

Parameters:

lower (Any)
upper (Any)

Return type:

like(pattern)[source]

SQL LIKE pattern matching.

Parameters:: pattern (str)
Return type:: ColumnOperation

rlike(pattern)[source]

Regular expression pattern matching.

Parameters:: pattern (str)
Return type:: ColumnOperation

alias(name)[source]

Create an alias for the literal.

Parameters:: name (str)
Return type:: Literal

asc()[source]

Ascending sort order.

Return type:: ColumnOperation

desc()[source]

Descending sort order.

Return type:: ColumnOperation

cast(data_type)[source]

Cast literal to different data type.

Parameters:: data_type (Union[DataType, str])
Return type:: ColumnOperation

astype(data_type)[source]

Cast literal to different data type (alias for cast).

This method is an alias for cast() and matches PySpark’s API.

Parameters:: data_type (Union[DataType, str]) – The target data type (DataType object or string name).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cast operation.

Example

>>> F.lit(1).astype("string")

when(condition, value)[source]

Start a CASE WHEN expression.

Parameters:

condition (ColumnOperation)
value (Any)

Return type:

otherwise(value)[source]

End a CASE WHEN expression with default value.

Parameters:: value (Any)
Return type:: Any

over(window_spec)[source]

Apply window function over window specification.

Parameters:: window_spec (Any)
Return type:: Any

class sparkless.functions.functions.AggregateFunction(column, function_name, data_type=None, ignorenulls=None)[source]

Bases: object

Base class for aggregate functions.

This class provides the base functionality for all aggregate functions including count, sum, avg, max, min, etc.

Parameters:

column (Union[Column, ColumnOperation, str, None])
function_name (str)
data_type (Optional[DataType])
ignorenulls (Optional[bool])

Initialize AggregateFunction.

Parameters:

column (Union[Column, ColumnOperation, str, None]) – The column to aggregate (None for count(*)).
function_name (str) – Name of the aggregate function.
data_type (Optional[DataType]) – Optional return data type.
ignorenulls (Optional[bool]) – Optional flag to ignore nulls (for first/last functions).

__init__(column, function_name, data_type=None, ignorenulls=None)[source]

Initialize AggregateFunction.

Parameters:

column (Union[Column, ColumnOperation, str, None]) – The column to aggregate (None for count(*)).
function_name (str) – Name of the aggregate function.
data_type (Optional[DataType]) – Optional return data type.
ignorenulls (Optional[bool]) – Optional flag to ignore nulls (for first/last functions).

property column_name: str: Get the column name for compatibility.

evaluate(data)[source]

Evaluate the aggregate function on the given data.

Parameters:: data (List[Dict[str, Any]]) – List of data rows to aggregate.
Return type:: Any
Returns:: The aggregated result.

over(window_spec)[source]

Apply window function over window specification.

Parameters:: window_spec (Any)
Return type:: WindowFunction

alias(name)[source]

Create an alias for this aggregate function.

Parameters:: name (str) – The alias name.
Return type:: AggregateFunction
Returns:: Self for method chaining.

cast(data_type)[source]

Cast the aggregate function result to a different data type.

Parameters:: data_type (Union[DataType, str]) – The target data type (DataType instance or string type name).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cast operation.

Example

>>> F.mean(F.col("value")).cast("string")

__add__(other)[source]

Addition operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__sub__(other)[source]

Subtraction operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__mul__(other)[source]

Multiplication operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__truediv__(other)[source]

Division operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__mod__(other)[source]

Modulo operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__radd__(other)[source]

Reverse addition operation (for 2 + agg_func).

Parameters:: other (Any)
Return type:: ColumnOperation

__rsub__(other)[source]

Reverse subtraction operation (for 2 - agg_func).

Parameters:: other (Any)
Return type:: ColumnOperation

__rmul__(other)[source]

Reverse multiplication operation (for 2 * agg_func).

Parameters:: other (Any)
Return type:: ColumnOperation

__rtruediv__(other)[source]

Reverse division operation (for 2 / agg_func).

Parameters:: other (Any)
Return type:: ColumnOperation

__rmod__(other)[source]

Reverse modulo operation (for 2 % agg_func).

Parameters:: other (Any)
Return type:: ColumnOperation

class sparkless.functions.functions.CaseWhen(column=None, condition=None, value=None)[source]

Bases: object

Represents a CASE WHEN expression.

This class handles complex conditional logic with multiple conditions and default values, similar to SQL CASE WHEN statements.

Parameters:

column (Any)
condition (Any)
value (Any)

Initialize CaseWhen.

Parameters:

column (Any) – The column or expression being evaluated.
condition (Any) – The condition for this case.
value (Any) – The value to return if condition is true.

__init__(column=None, condition=None, value=None)[source]

Initialize CaseWhen.

Parameters:

column (Any) – The column or expression being evaluated.
condition (Any) – The condition for this case.
value (Any) – The value to return if condition is true.

property else_value: Any: Get the else value (alias for default_value for compatibility).

when(condition, value)[source]

Add another WHEN condition.

Parameters:

condition (Any) – The condition to check.
value (Any) – The value to return if condition is true.

Return type:

Returns:

Self for method chaining.

otherwise(value)[source]

Set the default value for the CASE WHEN expression.

Parameters:: value (Any) – The default value to return if no conditions match.
Return type:: CaseWhen
Returns:: Self for method chaining.

alias(name)[source]

Create an alias for the CASE WHEN expression.

Parameters:: name (str) – The alias name.
Return type:: CaseWhen
Returns:: Self for method chaining.

cast(data_type)[source]

Cast the CASE WHEN expression to a different data type.

Parameters:: data_type (Any) – The target data type (DataType instance or string type name).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cast operation.

Example

>>> F.when(F.col("value") == "A", F.lit(100)).otherwise(F.lit(200)).cast("long")

__add__(other)[source]

Addition operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__sub__(other)[source]

Subtraction operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__mul__(other)[source]

Multiplication operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__truediv__(other)[source]

Division operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__mod__(other)[source]

Modulo operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__radd__(other)[source]

Reverse addition operation (for 2 + case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__rsub__(other)[source]

Reverse subtraction operation (for 2 - case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__rmul__(other)[source]

Reverse multiplication operation (for 2 * case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__rtruediv__(other)[source]

Reverse division operation (for 2 / case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__rmod__(other)[source]

Reverse modulo operation (for 2 % case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__or__(other)[source]

Bitwise OR operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__and__(other)[source]

Bitwise AND operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__invert__()[source]

Bitwise NOT operation (unary ~, PySpark-compatible).

Return type:: ColumnOperation

evaluate(row)[source]

Evaluate the CASE WHEN expression for a given row.

Parameters:: row (Dict[str, Any]) – The data row to evaluate against.
Return type:: Any
Returns:: The evaluated result.

get_result_type()[source]

Infer the result type from condition values.

Return type:: DataType

class sparkless.functions.functions.WindowFunction(function, window_spec)[source]

Bases: object

Represents a window function.

This class handles window functions like row_number(), rank(), etc. that operate over a window specification.

Parameters:

function (Any)
window_spec (WindowSpec)

Initialize WindowFunction.

Parameters:

function (Any) – The window function (e.g., row_number(), rank()).
window_spec (WindowSpec) – The window specification.

__init__(function, window_spec)[source]

Initialize WindowFunction.

Parameters:

function (Any) – The window function (e.g., row_number(), rank()).
window_spec (WindowSpec) – The window specification.

alias(name)[source]

Create an alias for this window function.

Parameters:: name (str) – The alias name.
Return type:: WindowFunction
Returns:: Self for method chaining.

cast(data_type)[source]

Cast the window function result to a different data type.

Parameters:: data_type (Any) – The target data type (DataType instance or string type name).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cast operation.

Example

>>> F.row_number().over(window_spec).cast("long")

__mul__(other)[source]

Multiply window function result by a value.

Parameters:: other (Any) – The value to multiply by.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the multiplication.

Example

>>> F.percent_rank().over(window) * 100

__rmul__(other)[source]

Reverse multiply (e.g., 100 * window_func).

Parameters:: other (Any) – The value to multiply.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the multiplication.

Example

>>> 100 * F.percent_rank().over(window)

__add__(other)[source]

Add a value to window function result.

Parameters:: other (Any) – The value to add.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the addition.

Example

>>> F.row_number().over(window) + 1

__radd__(other)[source]

Reverse add (e.g., 1 + window_func).

Parameters:: other (Any) – The value to add.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the addition.

Example

>>> 1 + F.row_number().over(window)

__sub__(other)[source]

Subtract a value from window function result.

Parameters:: other (Any) – The value to subtract.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the subtraction.

Example

>>> F.row_number().over(window) - 1

__rsub__(other)[source]

Reverse subtract (e.g., 10 - window_func).

Parameters:: other (Any) – The value to subtract from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the subtraction.

Example

>>> 10 - F.row_number().over(window)

__truediv__(other)[source]

Divide window function result by a value.

Parameters:: other (Any) – The value to divide by.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the division.

Example

>>> F.row_number().over(window) / 10

__rtruediv__(other)[source]

Reverse divide (e.g., 100 / window_func).

Parameters:: other (Any) – The value to divide.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the division.

Example

>>> 100 / F.row_number().over(window)

__neg__()[source]

Negate window function result.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the negation.

Example

>>> -F.row_number().over(window)

__eq__(other)[source]

Equality comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the equality comparison.

Example

>>> F.row_number().over(window) == 1

__ne__(other)[source]

Inequality comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the inequality comparison.

Example

>>> F.row_number().over(window) != 0

__lt__(other)[source]

Less than comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the less than comparison.

Example

>>> F.row_number().over(window) < 5

__le__(other)[source]

Less than or equal comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the less than or equal comparison.

Example

>>> F.row_number().over(window) <= 10

__gt__(other)[source]

Greater than comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the greater than comparison.

Example

>>> F.row_number().over(window) > 0

__ge__(other)[source]

Greater than or equal comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the greater than or equal comparison.

Example

>>> F.row_number().over(window) >= 1

isnull()[source]

Check if window function result is null.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the isnull check.

Example

>>> F.lag("value", 1).over(window).isnull()

isnotnull()[source]

Check if window function result is not null.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the isnotnull check.

Example

>>> F.lag("value", 1).over(window).isnotnull()

eqNullSafe(other)[source]

Null-safe equality comparison (PySpark eqNullSafe).

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the null-safe equality comparison.

Example

>>> F.row_number().over(window).eqNullSafe(1)

evaluate(data)[source]

Evaluate the window function over the data.

Parameters:: data (List[Dict[str, Any]]) – List of data rows.
Return type:: List[Any]
Returns:: List of window function results.

class sparkless.functions.functions.Functions(*args, **kwargs)[source]

Bases: object

Main functions namespace (F) for Sparkless.

This class provides access to all functions in a PySpark-compatible way.

Parameters:

self (Any)
args (Any)
kwargs (Any)

Warn when Functions() is instantiated directly.

Parameters:

self (Any)
args (Any)
kwargs (Any)

static col(name)[source]

Create a column reference.

Note

In PySpark, col() can be called without an active SparkSession. The column expression is evaluated later when used with a DataFrame.

Parameters:: name (str)
Return type:: Column

static lit(value)[source]

Create a literal value.

Note

In PySpark, lit() can be called without an active SparkSession. The literal expression is evaluated later when used with a DataFrame.

Parameters:: value (Any)
Return type:: Literal

static cast(column, data_type)[source]

Cast column to different data type.

Parameters:

column (Union[Column, str]) – The column to cast.
data_type (Any) – The target data type.

Return type:

Returns:

ColumnOperation representing the cast function.

Raises:

RuntimeError – If no active SparkSession is available

static current_catalog(session=None)[source]

Return the current catalog name as a literal.

Parameters:: session (Optional[SparkSession])
Return type:: Literal

static current_database(session=None)[source]

Return the current database/schema as a literal.

Parameters:: session (Optional[SparkSession])
Return type:: Literal

static current_schema(session=None)[source]

Alias for current_database (Spark SQL compatibility).

Parameters:: session (Optional[SparkSession])
Return type:: Literal

static current_user(session=None)[source]

Return the current Spark user as a literal.

Parameters:: session (Optional[SparkSession])
Return type:: Literal

static upper(column)[source]

Convert string to uppercase.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static lower(column)[source]

Convert string to lowercase.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static length(column)[source]

Get string length.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static char_length(column)[source]

Get character length (alias for length) (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static character_length(column)[source]

Get character length (alias for length) (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static trim(column)[source]

Trim whitespace.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static ltrim(column)[source]

Trim left whitespace.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static rtrim(column)[source]

Trim right whitespace.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static regexp_replace(column, pattern, replacement)[source]

Replace regex pattern.

Parameters:

column (Union[Column, str])
pattern (str)
replacement (str)

Return type:

static split(column, delimiter, limit=None)[source]

Split string by delimiter.

Parameters:

column (Union[Column, str]) – The column to split.
delimiter (str) – The delimiter to split on.
limit (Optional[int]) – Optional limit on the number of times the pattern is applied. If None or -1, no limit (default PySpark behavior).

Return type:

static substring(column, start, length=None)[source]

Extract substring.

Parameters:

column (Union[Column, str])
start (int)
length (Optional[int])

Return type:

static concat(*columns)[source]

Concatenate strings.

Parameters:: columns (Union[Column, str])
Return type:: ColumnOperation

static format_string(format_str, *columns)[source]

Format string using printf-style placeholders.

Parameters:

format_str (str)
columns (Union[Column, str])

Return type:

static translate(column, matching_string, replace_string)[source]

Translate characters in a string using a character mapping.

Parameters:

column (Union[Column, str])
matching_string (str)
replace_string (str)

Return type:

static ascii(column)[source]

Return ASCII value of the first character.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static base64(column)[source]

Encode the string to base64.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static btrim(column, trim_string=None)[source]

Trim characters from both ends of string.

Parameters:

column (Union[Column, str])
trim_string (Optional[str])

Return type:

static contains(column, substring)[source]

Check if string contains substring.

Parameters:

column (Union[Column, str])
substring (str)

Return type:

static left(column, length)[source]

Extract left N characters from string.

Parameters:

column (Union[Column, str])
length (int)

Return type:

static right(column, length)[source]

Extract right N characters from string.

Parameters:

column (Union[Column, str])
length (int)

Return type:

static bit_length(column)[source]

Get bit length of string.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static startswith(column, substring)[source]

Check if string starts with substring.

Parameters:

column (Union[Column, str])
substring (str)

Return type:

static endswith(column, substring)[source]

Check if string ends with substring.

Parameters:

column (Union[Column, str])
substring (str)

Return type:

static like(column, pattern)[source]

SQL LIKE pattern matching.

Parameters:

column (Union[Column, str])
pattern (str)

Return type:

static rlike(column, pattern)[source]

Regular expression pattern matching.

Parameters:

column (Union[Column, str])
pattern (str)

Return type:

static isin(column, *values)[source]

Check if column value is in list of values.

Parameters:

column (Union[Column, str]) – The column to check.
*values (Any) – Variable number of values to check against.

Return type:

Returns:

ColumnOperation representing the isin function.

static replace(column, old, new)[source]

Replace occurrences of substring in string.

Parameters:

column (Union[Column, str])
old (str)
new (str)

Return type:

static substr(column, start, length=None)[source]

Alias for substring - Extract substring from string.

Parameters:

column (Union[Column, str])
start (int)
length (Optional[int])

Return type:

static split_part(column, delimiter, part)[source]

Extract part of string split by delimiter.

Parameters:

column (Union[Column, str])
delimiter (str)
part (int)

Return type:

static position(substring, column)[source]

Find position of substring in string (1-indexed).

Parameters:

substring (Union[Column, str])
column (Union[Column, str])

Return type:

static octet_length(column)[source]

Get byte length (octet length) of string.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static char(column)[source]

Convert integer to character.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static ucase(column)[source]

Alias for upper - Convert string to uppercase.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static lcase(column)[source]

Alias for lower - Convert string to lowercase.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static elt(n, *columns)[source]

Return element at index from list of columns.

Parameters:

n (Union[Column, int])
columns (Union[Column, str])

Return type:

static unbase64(column)[source]

Decode a base64-encoded string.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static md5(column)[source]

MD5 hash (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static sha1(column)[source]

SHA-1 hash (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static sha2(column, numBits)[source]

SHA-2 hash family (PySpark 3.0+).

Parameters:

column (Union[Column, str])
numBits (int)

Return type:

static sha(column)[source]

SHA-1 hash alias (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static mask(column, upperChar=None, lowerChar=None, digitChar=None, otherChar=None)[source]

Mask sensitive data in a string (PySpark 3.5+).

Parameters:

column (Union[Column, str])
upperChar (Optional[str])
lowerChar (Optional[str])
digitChar (Optional[str])
otherChar (Optional[str])

Return type:

static json_array_length(column, path=None)[source]

Get the length of a JSON array (PySpark 3.5+).

Parameters:

column (Union[Column, str])
path (Optional[str])

Return type:

static json_object_keys(column, path=None)[source]

Get the keys of a JSON object (PySpark 3.5+).

Parameters:

column (Union[Column, str])
path (Optional[str])

Return type:

static xpath_number(column, path)[source]

Extract number from XML using XPath (PySpark 3.5+).

Parameters:

column (Union[Column, str])
path (str)

Return type:

static user()[source]

Get current user name (PySpark 3.5+).

Return type:: ColumnOperation

static crc32(column)[source]

CRC32 checksum (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static aes_encrypt(data, key, mode=None, padding=None)[source]

Encrypt data using AES encryption (PySpark 3.5+).

Parameters:

data (Union[Column, str])
key (Union[Column, str])
mode (Optional[str])
padding (Optional[str])

Return type:

static aes_decrypt(data, key, mode=None, padding=None)[source]

Decrypt data using AES decryption (PySpark 3.5+).

Parameters:

data (Union[Column, str])
key (Union[Column, str])
mode (Optional[str])
padding (Optional[str])

Return type:

static try_aes_decrypt(data, key, mode=None, padding=None)[source]

Null-safe AES decryption - returns NULL on error (PySpark 3.5+).

Parameters:

data (Union[Column, str])
key (Union[Column, str])
mode (Optional[str])
padding (Optional[str])

Return type:

static to_str(column)[source]

Convert column to string (all PySpark versions).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static regexp_extract_all(column, pattern, idx=0)[source]

Extract all matches of a regex pattern.

Parameters:

column (Union[Column, str])
pattern (str)
idx (int)

Return type:

static array_join(column, delimiter, null_replacement=None)[source]

Join array elements with a delimiter.

Parameters:

column (Union[Column, str])
delimiter (str)
null_replacement (Optional[str])

Return type:

static repeat(column, n)[source]

Repeat a string N times.

Parameters:

column (Union[Column, str])
n (int)

Return type:

static concat_ws(sep, *cols)[source]

Concatenate multiple columns with separator.

Parameters:

sep (str)
cols (Union[Column, str])

Return type:

static regexp_extract(column, pattern, idx=0)[source]

Extract specific group matched by regex.

Parameters:

column (Union[Column, str])
pattern (str)
idx (int)

Return type:

static substring_index(column, delim, count)[source]

Returns substring before/after count occurrences of delimiter.

Parameters:

column (Union[Column, str])
delim (str)
count (int)

Return type:

static format_number(column, d)[source]

Format number with d decimal places and thousands separator.

Parameters:

column (Union[Column, str])
d (int)

Return type:

static instr(column, substr)[source]

Locate position of first occurrence of substr.

Parameters:

column (Union[Column, str])
substr (str)

Return type:

static locate(substr, column, pos=1)[source]

Locate position of substr starting from pos.

Parameters:

substr (str)
column (Union[Column, str])
pos (int)

Return type:

static lpad(column, len, pad)[source]

Left-pad string to length len with pad string.

Parameters:

column (Union[Column, str])
len (int)
pad (str)

Return type:

static rpad(column, len, pad)[source]

Right-pad string to length len with pad string.

Parameters:

column (Union[Column, str])
len (int)
pad (str)

Return type:

static levenshtein(left, right)[source]

Compute Levenshtein distance between two strings.

Parameters:

left (Union[Column, str])
right (Union[Column, str])

Return type:

static bin(column)[source]

Convert to binary string.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static hex(column)[source]

Convert to hexadecimal string.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static unhex(column)[source]

Convert hex string to binary.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static hash(*cols)[source]

Compute hash value.

Parameters:: cols (Union[Column, str])
Return type:: ColumnOperation

static xxhash64(*cols)[source]

Compute xxHash64 value (all PySpark versions).

Parameters:: cols (Union[Column, str])
Return type:: ColumnOperation

static encode(column, charset)[source]

Encode string to binary.

Parameters:

column (Union[Column, str])
charset (str)

Return type:

static decode(column, charset)[source]

Decode binary to string.

Parameters:

column (Union[Column, str])
charset (str)

Return type:

static conv(column, from_base, to_base)[source]

Convert number between bases.

Parameters:

column (Union[Column, str])
from_base (int)
to_base (int)

Return type:

static initcap(column)[source]

Capitalize first letter of each word.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static soundex(column)[source]

Soundex encoding for phonetic matching.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static abs(column)[source]

Get absolute value.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static round(column, scale=0)[source]

Round to decimal places.

Parameters:

column (Union[Column, str])
scale (int)

Return type:

static ceil(column)[source]

Round up.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static ceiling(column)[source]

Alias for ceil - Round up to nearest integer.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static floor(column)[source]

Round down.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static sqrt(column)[source]

Square root.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static exp(column)[source]

Exponential.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static log(base, column=None)[source]

Logarithm.

PySpark signature: log(base, column) or log(column) for natural log.

Parameters:

base (Union[Column, str, float, int, None])
column (Union[Column, str, None])

Return type:

static log10(column)[source]

Base-10 logarithm (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static log2(column)[source]

Base-2 logarithm (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static log1p(column)[source]

Natural log of (1 + x) (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static expm1(column)[source]

exp(x) - 1 (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static pow(column, exponent)[source]

Power.

Parameters:

column (Union[Column, str])
exponent (Union[Column, float, int])

Return type:

static power(column, exponent)[source]

Alias for pow - Raise to power.

Parameters:

column (Union[Column, str])
exponent (Union[Column, float, int])

Return type:

static positive(column)[source]

Return positive value (identity function).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static negative(column)[source]

Return negative value.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static sin(column)[source]

Sine.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static cos(column)[source]

Cosine.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static tan(column)[source]

Tangent.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static acosh(column)[source]

Inverse hyperbolic cosine (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static asinh(column)[source]

Inverse hyperbolic sine (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static atanh(column)[source]

Inverse hyperbolic tangent (PySpark 3.0+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static acos(column)[source]

Inverse cosine (arc cosine).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static asin(column)[source]

Inverse sine (arc sine).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static atan(column)[source]

Inverse tangent (arc tangent).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static atan2(y, x)[source]

2-argument arctangent (PySpark 3.0+).

Parameters:

y (Union[Column, str, float, int])
x (Union[Column, str, float, int])

Return type:

static cosh(column)[source]

Hyperbolic cosine.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static sinh(column)[source]

Hyperbolic sine.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static tanh(column)[source]

Hyperbolic tangent.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static degrees(column)[source]

Convert radians to degrees.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static radians(column)[source]

Convert degrees to radians.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static cbrt(column)[source]

Cube root.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static factorial(column)[source]

Factorial of non-negative integer.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static rand(seed=None)[source]

Generate random column with uniform distribution [0.0, 1.0].

Parameters:: seed (Optional[int])
Return type:: ColumnOperation

static randn(seed=None)[source]

Generate random column with standard normal distribution.

Parameters:: seed (Optional[int])
Return type:: ColumnOperation

static rint(column)[source]

Round to nearest integer using banker’s rounding.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static bround(column, scale=0)[source]

Round using HALF_EVEN rounding mode.

Parameters:

column (Union[Column, str])
scale (int)

Return type:

static sign(column)[source]

Sign of number (matches PySpark signum).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static hypot(col1, col2)[source]

Compute hypotenuse.

Parameters:

col1 (Union[Column, str])
col2 (Union[Column, str])

Return type:

static nanvl(col1, col2)[source]

Return col1 if not NaN, else col2.

Parameters:

col1 (Union[Column, str])
col2 (Union[Column, str])

Return type:

static signum(column)[source]

Compute signum (sign).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static width_bucket(value, min_value, max_value, num_buckets)[source]

Compute histogram bucket number for value (PySpark 3.5+).

Parameters:

value (Union[Column, str])
min_value (Union[Column, str, float])
max_value (Union[Column, str, float])
num_buckets (Union[Column, str, int])

Return type:

static cot(column)[source]

Compute cotangent (PySpark 3.3+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static csc(column)[source]

Compute cosecant (PySpark 3.3+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static sec(column)[source]

Compute secant (PySpark 3.3+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static e()[source]

Euler’s number e (PySpark 3.5+).

Return type:: ColumnOperation

static pi()[source]

Pi constant (PySpark 3.5+).

Return type:: ColumnOperation

static ln(column)[source]

Natural logarithm (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static greatest(*columns)[source]

Greatest value among columns.

Parameters:: columns (Union[Column, str])
Return type:: ColumnOperation

static least(*columns)[source]

Least value among columns.

Parameters:: columns (Union[Column, str])
Return type:: ColumnOperation

static count(column=None)[source]

Count values.

Parameters:: column (Union[Column, str, None])
Return type:: ColumnOperation

static sum(column)[source]

Sum values.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static avg(column)[source]

Average values.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static max(column)[source]

Maximum value.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static min(column)[source]

Minimum value.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static first(column, ignorenulls=False)[source]

First value.

Parameters:

column (Union[Column, str])
ignorenulls (bool)

Return type:

static last(column)[source]

Last value.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static collect_list(column)[source]

Collect values into list.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static collect_set(column)[source]

Collect unique values into set.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static stddev(column)[source]

Standard deviation.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static std(column)[source]

Alias for stddev - Standard deviation.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static product(column)[source]

Multiply all values in column.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static sum_distinct(column)[source]

Sum of distinct values.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static variance(column)[source]

Variance.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static skewness(column)[source]

Skewness.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static kurtosis(column)[source]

Kurtosis.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static countDistinct(column)[source]

Count distinct values.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static count_distinct(column)[source]

Alias for countDistinct - Count distinct values.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static percentile_approx(column, percentage, accuracy=10000)[source]

Approximate percentile.

Parameters:

column (Union[Column, str])
percentage (float)
accuracy (int)

Return type:

static corr(column1, column2)[source]

Correlation between two columns.

Parameters:

column1 (Union[Column, str])
column2 (Union[Column, str])

Return type:

static covar_samp(column1, column2)[source]

Sample covariance between two columns.

Parameters:

column1 (Union[Column, str])
column2 (Union[Column, str])

Return type:

static mean(column)[source]

Mean of values (alias for avg).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static approx_count_distinct(column, rsd=None)[source]

Approximate count of distinct elements.

Parameters:

column (Union[Column, str]) – Column to count distinct values.
rsd (Optional[float]) – Optional relative standard deviation (default: None, which uses PySpark’s default of 0.05). Controls the approximation accuracy. Lower values provide better accuracy but use more memory.

Return type:

static stddev_pop(column)[source]

Population standard deviation.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static stddev_samp(column)[source]

Sample standard deviation.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static var_pop(column)[source]

Population variance.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static var_samp(column)[source]

Sample variance.

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static covar_pop(column1, column2)[source]

Population covariance.

Parameters:

column1 (Union[Column, str])
column2 (Union[Column, str])

Return type:

static median(column)[source]

Median value (PySpark 3.4+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static mode(column)[source]

Most frequent value (PySpark 3.4+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static percentile(column, percentage)[source]

Exact percentile (PySpark 3.5+).

Parameters:

column (Union[Column, str])
percentage (float)

Return type:

static approx_percentile(column, percentage, accuracy=10000)[source]

Approximate percentile (PySpark 3.5+).

Parameters:

column (Union[Column, str])
percentage (Union[float, Column, str])
accuracy (Union[int, Column, str])

Return type:

static bool_and(column)[source]

Aggregate AND (PySpark 3.1+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static bool_or(column)[source]

Aggregate OR (PySpark 3.1+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static every(column)[source]

Alias for bool_and (PySpark 3.1+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static some(column)[source]

Alias for bool_or (PySpark 3.1+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static max_by(column, ord)[source]

Value with max of ord column (PySpark 3.1+).

Parameters:

column (Union[Column, str])
ord (Union[Column, str])

Return type:

static min_by(column, ord)[source]

Value with min of ord column (PySpark 3.1+).

Parameters:

column (Union[Column, str])
ord (Union[Column, str])

Return type:

static count_if(column)[source]

Count where condition is true (PySpark 3.1+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static any_value(column)[source]

Return any non-null value (PySpark 3.1+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static current_timestamp()[source]

Current timestamp.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static current_date()[source]

Current date.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static version()[source]

Return Spark version string (PySpark 3.0+).

Return type:: Literal
Returns:: Literal with sparkless version

static to_date(column, format=None)[source]

Convert to date.

Parameters:

column (Union[Column, str])
format (Optional[str])

Return type:

static to_timestamp(column, format=None)[source]

Convert to timestamp.

Parameters:

column (Union[Column, str])
format (Optional[str])

Return type:

static date_from_unix_date(days)[source]

Convert unix date (days since epoch) to date (PySpark 3.5+).

Parameters:: days (Union[Column, str, int])
Return type:: ColumnOperation

static to_timestamp_ltz(timestamp_str, format=None)[source]

Convert string to timestamp with local timezone (PySpark 3.5+).

Parameters:

timestamp_str (Union[Column, str])
format (Optional[str])

Return type:

static to_timestamp_ntz(timestamp_str, format=None)[source]

Convert string to timestamp with no timezone (PySpark 3.5+).

Parameters:

timestamp_str (Union[Column, str])
format (Optional[str])

Return type:

static hour(column)[source]

Extract hour.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static day(column)[source]

Extract day.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static dayofmonth(column)[source]

Extract day of month (alias for day).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static month(column)[source]

Extract month.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static year(column)[source]

Extract year.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static coalesce(*columns)[source]

Return first non-null value.

Parameters:: columns (Union[Column, str, Any])
Return type:: ColumnOperation

static isnull(column)[source]

Check if column is null.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static isnotnull(column)[source]

Check if column is not null.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static isnan(column)[source]

Check if column is NaN.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static when(condition, value=None)[source]

Start CASE WHEN expression.

Raises:

RuntimeError – If no active SparkSession is available

Parameters:

condition (Any)
value (Any)

Return type:

static case_when(*conditions, else_value=None)[source]

Create CASE WHEN expression with multiple conditions.

Parameters:

conditions (Tuple[Any, Any])
else_value (Any)

Return type:

static dayofweek(column)[source]

Extract day of week.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static dayofyear(column)[source]

Extract day of year.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static weekofyear(column)[source]

Extract week of year.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static quarter(column)[source]

Extract quarter.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static now()[source]

Alias for current_timestamp - Get current timestamp.

Return type:: ColumnOperation

static curdate()[source]

Alias for current_date - Get current date.

Return type:: ColumnOperation

static days(column)[source]

Convert number to days interval.

Parameters:: column (Union[Column, str, int])
Return type:: ColumnOperation

static hours(column)[source]

Convert number to hours interval.

Parameters:: column (Union[Column, str, int])
Return type:: ColumnOperation

static months(column)[source]

Convert number to months interval.

Parameters:: column (Union[Column, str, int])
Return type:: ColumnOperation

static expr(expression)[source]

Parse SQL expression into a column.

Parameters:

expression (str) – SQL expression string (e.g., “id IS NOT NULL”, “age > 18”). Must use SQL syntax, not Python expressions.

Return type:

Union[ColumnOperation, Column, CaseWhen, Literal]

Returns:

ColumnOperation for the expression.

Raises:

RuntimeError – If no active SparkSession is available
ParseException – If SQL syntax is invalid

static minute(column)[source]

Extract minute.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static second(column)[source]

Extract second.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static add_months(column, num_months)[source]

Add months to date.

Parameters:

column (Union[Column, str])
num_months (int)

Return type:

static months_between(column1, column2)[source]

Calculate months between two dates.

Parameters:

column1 (Union[Column, str])
column2 (Union[Column, str])

Return type:

static date_add(column, days)[source]

Add days to date.

Parameters:

column (Union[Column, str])
days (int)

Return type:

static date_sub(column, days)[source]

Subtract days from date.

Parameters:

column (Union[Column, str])
days (int)

Return type:

static date_format(column, format)[source]

Format date/timestamp as string.

Parameters:

column (Union[Column, str])
format (str)

Return type:

static make_date(year, month, day)[source]

Construct date from year, month, day (PySpark 3.0+).

Parameters:

year (Union[Column, int])
month (Union[Column, int])
day (Union[Column, int])

Return type:

static date_trunc(format, timestamp)[source]

Truncate timestamp to specified unit.

Parameters:

format (str)
timestamp (Union[Column, str])

Return type:

static datediff(end, start)[source]

Number of days between two dates.

Parameters:

end (Union[Column, str])
start (Union[Column, str])

Return type:

static date_diff(end, start)[source]

Alias for datediff - Returns number of days between two dates.

Parameters:

end (Union[Column, str])
start (Union[Column, str])

Return type:

static unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')[source]

Convert timestamp to Unix timestamp.

Parameters:

timestamp (Union[Column, str, None])
format (str)

Return type:

static last_day(date)[source]

Last day of the month for given date.

Parameters:: date (Union[Column, str])
Return type:: ColumnOperation

static next_day(date, dayOfWeek)[source]

First date later than date on specified day of week.

Parameters:

date (Union[Column, str])
dayOfWeek (str)

Return type:

static trunc(date, format)[source]

Truncate date to specified unit.

Parameters:

date (Union[Column, str])
format (str)

Return type:

static timestamp_seconds(col)[source]

Convert seconds since epoch to timestamp (PySpark 3.1+).

Parameters:: col (Union[Column, str, int])
Return type:: ColumnOperation

static weekday(col)[source]

Day of week as integer (0=Monday, 6=Sunday) (PySpark 3.5+).

Parameters:: col (Union[Column, str])
Return type:: ColumnOperation

static extract(field, source)[source]

Extract field from date/timestamp (PySpark 3.5+).

Parameters:

field (str)
source (Union[Column, str])

Return type:

static raise_error(msg)[source]

Raise an error with the specified message (PySpark 3.1+).

Parameters:: msg (Union[Column, str]) – Error message
Return type:: ColumnOperation
Returns:: ColumnOperation representing the raise_error function

static from_unixtime(column, format='yyyy-MM-dd HH:mm:ss')[source]

Convert unix timestamp to string.

Parameters:

column (Union[Column, str])
format (str)

Return type:

static timestampadd(unit, quantity, timestamp)[source]

Add time units to a timestamp.

Parameters:

unit (str)
quantity (Union[int, Column])
timestamp (Union[str, Column])

Return type:

static timestampdiff(unit, start, end)[source]

Calculate difference between two timestamps.

Parameters:

unit (str)
start (Union[str, Column])
end (Union[str, Column])

Return type:

static nvl(column, default_value)[source]

Return default if null. PySpark uses coalesce internally.

Parameters:

column (Union[Column, str])
default_value (Any)

Return type:

static nvl2(column, value_if_not_null, value_if_null)[source]

Return value based on null check. PySpark uses when/otherwise internally.

Parameters:

column (Union[Column, str])
value_if_not_null (Any)
value_if_null (Any)

Return type:

static equal_null(col1, col2)[source]

Equality check that treats NULL as equal.

Parameters:

col1 (Union[Column, str])
col2 (Union[Column, str, Any])

Return type:

static row_number()[source]

Row number window function.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static rank()[source]

Rank window function.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static dense_rank()[source]

Dense rank window function.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static lag(column, offset=1, default=None)[source]

Lag window function.

Parameters:

column (Union[Column, str]) – The column to lag.
offset (int) – Number of rows to look back. Default is 1.
default (Any) – Default value if offset goes beyond partition. Default is None.

Raises:

RuntimeError – If no active SparkSession is available

Return type:

static lead(column, offset=1, default=None)[source]

Lead window function.

Parameters:

column (Union[Column, str]) – The column to lead.
offset (int) – Number of rows to look forward. Default is 1.
default (Any) – Default value if offset goes beyond partition. Default is None.

Raises:

RuntimeError – If no active SparkSession is available

Return type:

static nth_value(column, n)[source]

Nth value window function.

Raises:

RuntimeError – If no active SparkSession is available

Parameters:

column (Union[Column, str])
n (int)

Return type:

static ntile(n)[source]

NTILE window function.

Raises:: RuntimeError – If no active SparkSession is available
Parameters:: n (int)
Return type:: ColumnOperation

static cume_dist()[source]

Cumulative distribution window function.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static percent_rank()[source]

Percent rank window function.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static first_value(column)[source]

First value window function.

Raises:: RuntimeError – If no active SparkSession is available
Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static last_value(column)[source]

Last value window function.

Raises:: RuntimeError – If no active SparkSession is available
Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static desc(column)[source]

Create descending order column.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static array(*cols)[source]

Create array from columns (PySpark 3.0+).

Parameters:: cols (Union[Column, str])
Return type:: ColumnOperation

static array_repeat(col, count)[source]

Repeat value to create array (PySpark 3.0+).

Parameters:

col (Union[Column, str])
count (int)

Return type:

static sort_array(col, asc=True)[source]

Sort array elements (PySpark 3.0+).

Parameters:

col (Union[Column, str])
asc (bool)

Return type:

static array_agg(col)[source]

Aggregate values into array (PySpark 3.5+).

Parameters:: col (Union[Column, str])
Return type:: AggregateFunction

static cardinality(col)[source]

Return size of array or map (PySpark 3.5+).

Parameters:: col (Union[Column, str])
Return type:: ColumnOperation

static array_distinct(column)[source]

Remove duplicate elements from array.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static array_intersect(column1, column2)[source]

Intersection of two arrays.

Parameters:

column1 (Union[Column, str])
column2 (Union[Column, str])

Return type:

static array_union(column1, column2)[source]

Union of two arrays.

Parameters:

column1 (Union[Column, str])
column2 (Union[Column, str])

Return type:

static array_except(column1, column2)[source]

Elements in first array but not second.

Parameters:

column1 (Union[Column, str])
column2 (Union[Column, str])

Return type:

static array_position(column, value)[source]

Position of element in array.

Parameters:

column (Union[Column, str])
value (Any)

Return type:

static array_remove(column, value)[source]

Remove all occurrences of element from array.

Parameters:

column (Union[Column, str])
value (Any)

Return type:

static transform(column, function)[source]

Apply function to each array element.

Parameters:

column (Union[Column, str])
function (Callable[[Any], Any])

Return type:

static filter(column, function)[source]

Filter array elements with predicate.

Parameters:

column (Union[Column, str])
function (Callable[[Any], bool])

Return type:

static exists(column, function)[source]

Check if any element satisfies predicate.

Parameters:

column (Union[Column, str])
function (Callable[[Any], bool])

Return type:

static forall(column, function)[source]

Check if all elements satisfy predicate.

Parameters:

column (Union[Column, str])
function (Callable[[Any], bool])

Return type:

static aggregate(column, initial_value, merge, finish=None)[source]

Aggregate array elements to single value.

Parameters:

column (Union[Column, str])
initial_value (Any)
merge (Callable[[Any, Any], Any])
finish (Optional[Callable[[Any], Any]])

Return type:

static zip_with(left, right, function)[source]

Merge two arrays element-wise.

Parameters:

left (Union[Column, str])
right (Union[Column, str])
function (Callable[[Any, Any], Any])

Return type:

static array_compact(column)[source]

Remove null values from array.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static slice(column, start, length)[source]

Extract array slice.

Parameters:

column (Union[Column, str])
start (int)
length (int)

Return type:

static element_at(column, index)[source]

Get element at index.

Parameters:

column (Union[Column, str])
index (int)

Return type:

static array_append(column, element)[source]

Append element to array.

Parameters:

column (Union[Column, str])
element (Any)

Return type:

static array_prepend(column, element)[source]

Prepend element to array.

Parameters:

column (Union[Column, str])
element (Any)

Return type:

static array_insert(column, pos, value)[source]

Insert element at position.

Parameters:

column (Union[Column, str])
pos (int)
value (Any)

Return type:

static array_size(column)[source]

Get array length.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static array_sort(column)[source]

Sort array elements.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static arrays_overlap(column1, column2)[source]

Check if arrays have common elements.

Parameters:

column1 (Union[Column, str])
column2 (Union[Column, str])

Return type:

static array_contains(column, value)[source]

Check if array contains value.

Parameters:

column (Union[Column, str])
value (Any)

Return type:

static array_max(column)[source]

Return maximum value from array.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static array_min(column)[source]

Return minimum value from array.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static explode(column)[source]

Returns a new row for each element in array or map.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static size(column)[source]

Return size of array or map.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static flatten(column)[source]

Flatten array of arrays into single array.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static reverse(column)[source]

Reverse string or array elements. Defaults to string reverse.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static explode_outer(column)[source]

Explode array including null/empty arrays.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static posexplode(column)[source]

Explode array with position.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static posexplode_outer(column)[source]

Explode array with position including null/empty.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static arrays_zip(*columns)[source]

Merge arrays into array of structs.

Parameters:: columns (Union[Column, str])
Return type:: ColumnOperation

static sequence(start, stop, step=1)[source]

Generate array sequence from start to stop.

Parameters:

start (Union[Column, str, int])
stop (Union[Column, str, int])
step (Union[Column, str, int])

Return type:

static shuffle(column)[source]

Randomly shuffle array elements.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static map_keys(column)[source]

Get all keys from map.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static map_values(column)[source]

Get all values from map.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static map_entries(column)[source]

Get key-value pairs as array of structs.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static map_concat(*columns)[source]

Concatenate multiple maps.

Parameters:: columns (Union[Column, str])
Return type:: ColumnOperation

static map_from_arrays(keys, values)[source]

Create map from key and value arrays.

Parameters:

keys (Union[Column, str])
values (Union[Column, str])

Return type:

static create_map(*cols)[source]

Create map from key-value pairs.

Parameters:: cols (Union[Column, str, Any])
Return type:: ColumnOperation

static map_contains_key(column, key)[source]

Check if map contains key.

Parameters:

column (Union[Column, str])
key (Any)

Return type:

static map_from_entries(column)[source]

Convert array of structs to map.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static map_filter(column, function)[source]

Filter map entries with predicate.

Parameters:

column (Union[Column, str])
function (Callable[[Any, Any], bool])

Return type:

static transform_keys(column, function)[source]

Transform map keys with function.

Parameters:

column (Union[Column, str])
function (Callable[[Any, Any], Any])

Return type:

static transform_values(column, function)[source]

Transform map values with function.

Parameters:

column (Union[Column, str])
function (Callable[[Any, Any], Any])

Return type:

static map_zip_with(col1, col2, function)[source]

Merge two maps using function (PySpark 3.1+).

Parameters:

col1 (Union[Column, str])
col2 (Union[Column, str])
function (Callable[[Any, Any, Any], Any])

Return type:

static struct(*cols)[source]

Create a struct column from given columns.

Parameters:: cols (Union[Column, str])
Return type:: ColumnOperation

static named_struct(*cols)[source]

Create a struct column with named fields.

Parameters:: *cols (Any) – Alternating field names (strings) and column values.
Return type:: ColumnOperation

static bit_count(column)[source]

Count set bits.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static bit_get(column, pos)[source]

Get bit at position.

Parameters:

column (Union[Column, str])
pos (int)

Return type:

static getbit(column, pos)[source]

Get bit at position (alias for bit_get) (PySpark 3.5+).

Parameters:

column (Union[Column, str])
pos (int)

Return type:

static bitmap_bit_position(column)[source]

Get the bit position in a bitmap (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static bitmap_bucket_number(column)[source]

Get the bucket number in a bitmap (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static bitmap_construct_agg(column)[source]

Aggregate function - construct bitmap from values (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static bitmap_count(column)[source]

Count the number of set bits in a bitmap (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static bitmap_or_agg(column)[source]

Aggregate function - bitwise OR of bitmaps (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static bitwise_not(column)[source]

Bitwise NOT.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static bit_and(column)[source]

Bitwise AND aggregate (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static bit_or(column)[source]

Bitwise OR aggregate (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static bit_xor(column)[source]

Bitwise XOR aggregate (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static convert_timezone(sourceTz, targetTz, sourceTs)[source]

Convert timestamp between timezones.

Parameters:

sourceTz (str)
targetTz (str)
sourceTs (Union[Column, str])

Return type:

static current_timezone()[source]

Get current timezone.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static from_utc_timestamp(ts, tz)[source]

Convert UTC timestamp to timezone.

Parameters:

ts (Union[Column, str])
tz (str)

Return type:

static to_utc_timestamp(ts, tz)[source]

Convert timestamp to UTC.

Parameters:

ts (Union[Column, str])
tz (str)

Return type:

static parse_url(url, part)[source]

Extract part from URL.

Parameters:

url (Union[Column, str])
part (str)

Return type:

static url_encode(url)[source]

URL-encode string.

Parameters:: url (Union[Column, str])
Return type:: ColumnOperation

static url_decode(url)[source]

URL-decode string.

Parameters:: url (Union[Column, str])
Return type:: ColumnOperation

static overlay(src, replace, pos, len=-1)[source]

Replace part of string (PySpark 3.0+).

Parameters:

src (Union[Column, str])
replace (Union[Column, str])
pos (Union[Column, int])
len (Union[Column, int])

Return type:

static date_part(field, source)[source]

Extract date/time part.

Parameters:

field (str)
source (Union[Column, str])

Return type:

static dayname(date)[source]

Get day of week name.

Parameters:: date (Union[Column, str])
Return type:: ColumnOperation

static assert_true(condition)[source]

Assert condition is true.

Parameters:: condition (Union[Column, ColumnOperation])
Return type:: ColumnOperation

static ifnull(col1, col2)[source]

Return col2 if col1 is null (PySpark 3.5+).

Parameters:

col1 (Union[Column, str])
col2 (Union[Column, str])

Return type:

static nullif(col1, col2)[source]

Return null if col1 equals col2 (PySpark 3.5+).

Parameters:

col1 (Union[Column, str])
col2 (Union[Column, str])

Return type:

static try_add(left, right)[source]

Null-safe addition - returns NULL on error (PySpark 3.5+).

Parameters:

left (Union[Column, str, int, float])
right (Union[Column, str, int, float])

Return type:

static try_subtract(left, right)[source]

Null-safe subtraction - returns NULL on error (PySpark 3.5+).

Parameters:

left (Union[Column, str, int, float])
right (Union[Column, str, int, float])

Return type:

static try_multiply(left, right)[source]

Null-safe multiplication - returns NULL on error (PySpark 3.5+).

Parameters:

left (Union[Column, str, int, float])
right (Union[Column, str, int, float])

Return type:

static try_divide(left, right)[source]

Null-safe division - returns NULL on error (PySpark 3.5+).

Parameters:

left (Union[Column, str, int, float])
right (Union[Column, str, int, float])

Return type:

static try_sum(column)[source]

Null-safe sum aggregate - returns NULL on error (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static try_avg(column)[source]

Null-safe average aggregate - returns NULL on error (PySpark 3.5+).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static try_element_at(column, index)[source]

Null-safe element_at - returns NULL on error (PySpark 3.5+).

Parameters:

column (Union[Column, str])
index (Union[Column, str, int])

Return type:

static try_to_binary(column, format=None)[source]

Null-safe to_binary - returns NULL on error (PySpark 3.5+).

Parameters:

column (Union[Column, str])
format (Optional[str])

Return type:

static try_to_number(column, format=None)[source]

Null-safe to_number - returns NULL on error (PySpark 3.5+).

Parameters:

column (Union[Column, str])
format (Optional[str])

Return type:

static try_to_timestamp(column, format=None)[source]

Null-safe to_timestamp - returns NULL on error (PySpark 3.5+).

Parameters:

column (Union[Column, str])
format (Optional[str])

Return type:

static from_xml(col, schema)[source]

Parse XML string to struct.

Parameters:

col (Union[Column, str])
schema (str)

Return type:

static to_xml(col)[source]

Convert struct to XML string.

Parameters:: col (Union[Column, ColumnOperation])
Return type:: ColumnOperation

static schema_of_xml(col)[source]

Infer schema from XML.

Parameters:: col (Union[Column, str])
Return type:: ColumnOperation

static xpath(xml, path)[source]

Extract array from XML using XPath.

Parameters:

xml (Union[Column, str])
path (str)

Return type:

static xpath_boolean(xml, path)[source]

Extract boolean from XML using XPath.

Parameters:

xml (Union[Column, str])
path (str)

Return type:

static xpath_double(xml, path)[source]

Extract double from XML using XPath.

Parameters:

xml (Union[Column, str])
path (str)

Return type:

static xpath_float(xml, path)[source]

Extract float from XML using XPath.

Parameters:

xml (Union[Column, str])
path (str)

Return type:

static xpath_int(xml, path)[source]

Extract integer from XML using XPath.

Parameters:

xml (Union[Column, str])
path (str)

Return type:

static xpath_long(xml, path)[source]

Extract long from XML using XPath.

Parameters:

xml (Union[Column, str])
path (str)

Return type:

static xpath_short(xml, path)[source]

Extract short from XML using XPath.

Parameters:

xml (Union[Column, str])
path (str)

Return type:

static xpath_string(xml, path)[source]

Extract string from XML using XPath.

Parameters:

xml (Union[Column, str])
path (str)

Return type:

static from_json(column, schema, options=None)[source]

Parse JSON string into struct/array.

Parameters:

column (Union[Column, str])
schema (Any)
options (Optional[Dict[str, Any]])

Return type:

static to_json(column)[source]

Convert struct/array to JSON string.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static get_json_object(column, path)[source]

Extract JSON object at path.

Parameters:

column (Union[Column, str])
path (str)

Return type:

static json_tuple(column, *fields)[source]

Extract multiple fields from JSON.

Parameters:

column (Union[Column, str])
fields (str)

Return type:

static schema_of_json(json_string)[source]

Infer schema from JSON string.

Parameters:: json_string (str)
Return type:: ColumnOperation

static from_csv(column, schema, options=None)[source]

Parse CSV string into struct.

Parameters:

column (Union[Column, str])
schema (Any)
options (Optional[Dict[str, Any]])

Return type:

static to_csv(column)[source]

Convert struct to CSV string.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static schema_of_csv(csv_string)[source]

Infer schema from CSV string.

Parameters:: csv_string (str)
Return type:: ColumnOperation

static asc(column)[source]

Sort ascending.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static asc_nulls_first(column)[source]

Sort ascending, nulls first.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static asc_nulls_last(column)[source]

Sort ascending, nulls last.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static desc_nulls_first(column)[source]

Sort descending, nulls first.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static desc_nulls_last(column)[source]

Sort descending, nulls last.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static input_file_name()[source]

Return input file name.

Return type:: ColumnOperation

static monotonically_increasing_id()[source]

Generate monotonically increasing ID.

Return type:: ColumnOperation

static spark_partition_id()[source]

Return partition ID.

Return type:: ColumnOperation

static broadcast(df)[source]

Mark DataFrame for broadcast (hint).

Parameters:: df (Any)
Return type:: Any

static column(col_name)[source]

Create column reference (alias for col).

Parameters:: col_name (str)
Return type:: Column

static grouping(column)[source]

Grouping indicator for CUBE/ROLLUP.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static grouping_id(*cols)[source]

Grouping ID for CUBE/ROLLUP.

Parameters:: cols (Union[Column, str])
Return type:: ColumnOperation

static udf(f=None, returnType=None)[source]

Create a user-defined function (all PySpark versions).

Parameters:

f (Optional[Callable[..., Any]]) – Python function to wrap, or DataType if used as decorator with returnType
returnType (Any) – Return type of the function (defaults to StringType)

Return type:

Callable[..., Any]

Returns:

Wrapped function that can be used in DataFrame operations

Example

>>> from sparkless.sql import SparkSession, functions as F
>>> from sparkless.spark_types import IntegerType
>>> spark = SparkSession("test")
>>> square = F.udf(lambda x: x * x, IntegerType())
>>> df = spark.createDataFrame([{"value": 5}])
>>> df.select(square("value").alias("squared")).show()

# Decorator pattern: >>> @F.udf(IntegerType()) >>> def square(x): … return x * x >>> df.select(square(“value”)).show()

static pandas_udf(f=None, returnType=None, functionType=None)[source]

Create a Pandas UDF (vectorized UDF) (all PySpark versions).

Pandas UDFs are user-defined functions that execute vectorized operations using Pandas Series/DataFrame, providing better performance than row-at-a-time UDFs.

Parameters:

f (Optional[Any]) – Python function to wrap OR return type (if used as decorator)
returnType (Any) – Return type of the function (defaults to StringType)
functionType (Any) – Type of Pandas UDF (optional, for compatibility)

Return type:

Returns:

Wrapped function that can be used in DataFrame operations

Example

>>> from sparkless.sql import SparkSession, functions as F
>>> from sparkless.spark_types import IntegerType
>>> spark = SparkSession("test")
>>> @F.pandas_udf(IntegerType())
>>> def multiply_by_two(s):
...     return s * 2
>>> df = spark.createDataFrame([{"value": 5}])
>>> df.select(multiply_by_two("value").alias("doubled")).show()

static window(timeColumn, windowDuration, slideDuration=None, startTime=None)[source]

Create time-based window for grouping operations (all PySpark versions).

Parameters:

timeColumn (Union[Column, str]) – Timestamp column to window
windowDuration (str) – Duration string (e.g., “10 seconds”, “1 minute”, “2 hours”)
slideDuration (Optional[str]) – Slide duration for sliding windows (defaults to windowDuration)
startTime (Optional[str]) – Offset for window alignment (e.g., “0 seconds”)

Return type:

Returns:

Column representing window struct with start and end times

Example

>>> df.groupBy(F.window("timestamp", "10 minutes")).count()
>>> df.groupBy(F.window("timestamp", "10 minutes", "5 minutes")).agg(F.sum("value"))

static window_time(windowColumn)[source]

Extract window start time from window column (PySpark 3.4+).

Parameters:: windowColumn (Union[Column, str]) – Window column to extract time from
Return type:: ColumnOperation
Returns:: Column operation representing window start timestamp

Example

>>> df.groupBy(F.window("timestamp", "1 hour")).agg(
...     F.window_time(F.col("window")).alias("window_start")
... )

static ilike(column, pattern)[source]

Case-insensitive LIKE pattern matching.

Parameters:

column (Union[Column, str])
pattern (str)

Return type:

static find_in_set(column, str_list)[source]

Find position of value in comma-separated string list.

Parameters:

column (Union[Column, str])
str_list (Union[Column, str])

Return type:

static regexp_count(column, pattern)[source]

Count occurrences of regex pattern in string.

Parameters:

column (Union[Column, str])
pattern (str)

Return type:

static regexp_like(column, pattern)[source]

Regex pattern matching (similar to rlike).

Parameters:

column (Union[Column, str])
pattern (str)

Return type:

static regexp_substr(column, pattern, pos=1, occurrence=1)[source]

Extract substring matching regex pattern.

Parameters:

column (Union[Column, str])
pattern (str)
pos (int)
occurrence (int)

Return type:

static regexp_instr(column, pattern, pos=1, occurrence=1)[source]

Find position of regex pattern match.

Parameters:

column (Union[Column, str])
pattern (str)
pos (int)
occurrence (int)

Return type:

static regexp(column, pattern)[source]

Alias for rlike - regex pattern matching.

Parameters:

column (Union[Column, str])
pattern (str)

Return type:

static sentences(column, language=None, country=None)[source]

Split text into sentences.

Parameters:

column (Union[Column, str])
language (Optional[str])
country (Optional[str])

Return type:

static printf(format_str, *columns)[source]

Formatted string (like sprintf).

Parameters:

format_str (str)
columns (Union[Column, str])

Return type:

static to_char(column, format=None)[source]

Convert number/date to character string.

Parameters:

column (Union[Column, str])
format (Optional[str])

Return type:

static to_varchar(column, length=None)[source]

Convert to varchar type.

Parameters:

column (Union[Column, str])
length (Optional[int])

Return type:

static typeof(column)[source]

Get type of value as string.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static stack(n, *cols)[source]

Stack multiple columns into rows.

Parameters:

n (int)
cols (Union[Column, str, Any])

Return type:

static pmod(dividend, divisor)[source]

Positive modulo - always returns positive remainder.

Parameters:

dividend (Union[Column, str, int])
divisor (Union[Column, str, int])

Return type:

static negate(column)[source]

Negate value (alias for negative).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static shiftleft(column, num_bits)[source]

Bitwise left shift.

Parameters:

column (Union[Column, str])
num_bits (Union[Column, str, int])

Return type:

static shiftright(column, num_bits)[source]

Bitwise right shift (signed).

Parameters:

column (Union[Column, str])
num_bits (Union[Column, str, int])

Return type:

static shiftrightunsigned(column, num_bits)[source]

Bitwise unsigned right shift.

Parameters:

column (Union[Column, str])
num_bits (Union[Column, str, int])

Return type:

static shiftLeft(column, num_bits)[source]

Deprecated alias for shiftleft (PySpark 3.0-3.1).

Parameters:

column (Union[Column, str])
num_bits (Union[Column, str, int])

Return type:

static shiftRight(column, num_bits)[source]

Deprecated alias for shiftright (PySpark 3.0-3.1).

Parameters:

column (Union[Column, str])
num_bits (Union[Column, str, int])

Return type:

static shiftRightUnsigned(column, num_bits)[source]

Deprecated alias for shiftrightunsigned (PySpark 3.0-3.1).

Parameters:

column (Union[Column, str])
num_bits (Union[Column, str, int])

Return type:

static years(column)[source]

Convert number to years interval.

Parameters:: column (Union[Column, str, int])
Return type:: ColumnOperation

static localtimestamp()[source]

Get local timestamp (without timezone).

Return type:: ColumnOperation

static dateadd(date_part, value, date)[source]

SQL Server style date addition.

Parameters:

date_part (str)
value (Union[Column, str, int])
date (Union[Column, str])

Return type:

static datepart(date_part, date)[source]

SQL Server style date part extraction.

Parameters:

date_part (str)
date (Union[Column, str])

Return type:

static make_timestamp(year, month, day, hour=0, minute=0, second=0)[source]

Create timestamp from components.

Parameters:

year (Union[Column, str, int])
month (Union[Column, str, int])
day (Union[Column, str, int])
hour (Union[Column, str, int])
minute (Union[Column, str, int])
second (Union[Column, str, int])

Return type:

static make_timestamp_ltz(year, month, day, hour=0, minute=0, second=0, timezone=None)[source]

Create timestamp with local timezone.

Parameters:

year (Union[Column, str, int])
month (Union[Column, str, int])
day (Union[Column, str, int])
hour (Union[Column, str, int])
minute (Union[Column, str, int])
second (Union[Column, str, int])
timezone (Optional[str])

Return type:

static make_timestamp_ntz(year, month, day, hour=0, minute=0, second=0)[source]

Create timestamp with no timezone.

Parameters:

year (Union[Column, str, int])
month (Union[Column, str, int])
day (Union[Column, str, int])
hour (Union[Column, str, int])
minute (Union[Column, str, int])
second (Union[Column, str, int])

Return type:

static make_interval(years=0, months=0, weeks=0, days=0, hours=0, mins=0, secs=0)[source]

Create interval from components.

Parameters:

years (Union[Column, str, int])
months (Union[Column, str, int])
weeks (Union[Column, str, int])
days (Union[Column, str, int])
hours (Union[Column, str, int])
mins (Union[Column, str, int])
secs (Union[Column, str, int])

Return type:

static make_dt_interval(days=0, hours=0, mins=0, secs=0)[source]

Create day-time interval.

Parameters:

days (Union[Column, str, int])
hours (Union[Column, str, int])
mins (Union[Column, str, int])
secs (Union[Column, str, int])

Return type:

static make_ym_interval(years=0, months=0)[source]

Create year-month interval.

Parameters:

years (Union[Column, str, int])
months (Union[Column, str, int])

Return type:

static to_number(column, format=None)[source]

Convert string to number.

Parameters:

column (Union[Column, str])
format (Optional[str])

Return type:

static to_binary(column, format=None)[source]

Convert to binary format.

Parameters:

column (Union[Column, str])
format (Optional[str])

Return type:

static to_unix_timestamp(column, format=None)[source]

Convert to unix timestamp.

Parameters:

column (Union[Column, str])
format (Optional[str])

Return type:

static unix_date(column)[source]

Convert unix timestamp to date.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static unix_seconds(column)[source]

Convert timestamp to unix seconds.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static unix_millis(column)[source]

Convert timestamp to unix milliseconds.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static unix_micros(column)[source]

Convert timestamp to unix microseconds.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static timestamp_millis(column)[source]

Create timestamp from unix milliseconds.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static timestamp_micros(column)[source]

Create timestamp from unix microseconds.

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static regr_avgx(y, x)[source]

Linear regression average of x.

Parameters:

y (Union[Column, str])
x (Union[Column, str])

Return type:

static regr_avgy(y, x)[source]

Linear regression average of y.

Parameters:

y (Union[Column, str])
x (Union[Column, str])

Return type:

static regr_count(y, x)[source]

Linear regression count.

Parameters:

y (Union[Column, str])
x (Union[Column, str])

Return type:

static regr_intercept(y, x)[source]

Linear regression intercept.

Parameters:

y (Union[Column, str])
x (Union[Column, str])

Return type:

static regr_r2(y, x)[source]

Linear regression R-squared.

Parameters:

y (Union[Column, str])
x (Union[Column, str])

Return type:

static regr_slope(y, x)[source]

Linear regression slope.

Parameters:

y (Union[Column, str])
x (Union[Column, str])

Return type:

static regr_sxx(y, x)[source]

Linear regression sum of squares of x.

Parameters:

y (Union[Column, str])
x (Union[Column, str])

Return type:

static regr_sxy(y, x)[source]

Linear regression sum of products.

Parameters:

y (Union[Column, str])
x (Union[Column, str])

Return type:

static regr_syy(y, x)[source]

Linear regression sum of squares of y.

Parameters:

y (Union[Column, str])
x (Union[Column, str])

Return type:

static get(col, key)[source]

Get element from array by index or map by key.

Parameters:

col (Union[Column, str])
key (Union[Column, str, int, Any])

Return type:

static inline(col)[source]

Explode array of structs into rows.

Parameters:: col (Union[Column, str])
Return type:: ColumnOperation

static inline_outer(col)[source]

Explode array of structs into rows (outer join style).

Parameters:: col (Union[Column, str])
Return type:: ColumnOperation

static str_to_map(column, pair_delim=',', key_value_delim=':')[source]

Convert string to map using delimiters.

Parameters:

column (Union[Column, str])
pair_delim (Optional[str])
key_value_delim (Optional[str])

Return type:

static approxCountDistinct(*cols)[source]

Deprecated alias for approx_count_distinct (all PySpark versions).

Parameters:: cols (Union[Column, str])
Return type:: AggregateFunction

static sumDistinct(column)[source]

Deprecated alias for sum_distinct (all PySpark versions).

Parameters:: column (Union[Column, str])
Return type:: AggregateFunction

static bitwiseNOT(column)[source]

Deprecated alias for bitwise_not (all PySpark versions).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static toDegrees(column)[source]

Deprecated alias for degrees (all PySpark versions).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

__init__(*args, **kwargs)

Warn when Functions() is instantiated directly.

Parameters:

self (Any)
args (Any)
kwargs (Any)

static toRadians(column)[source]

Deprecated alias for radians (all PySpark versions).

Parameters:: column (Union[Column, str])
Return type:: ColumnOperation

static call_function(function_name, *columns)[source]

Dynamically invoke a function from the sparkless functions namespace.

Parameters:

function_name (str) – Name of the function to invoke (e.g. "upper").
*columns (Any) – Positional arguments passed to the resolved function.

Return type:

Returns:

Whatever the resolved function returns (typically a ColumnOperation).

Raises:

PySparkValueError – If the requested function is not registered.
PySparkTypeError – If the supplied arguments are incompatible with the resolved function signature.

class sparkless.functions.functions.StringFunctions[source]

Bases: object

Collection of string manipulation functions.

static upper(column)[source]

Convert string to uppercase.

Parameters:: column (Union[Column, str]) – The column to convert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the upper function.

static lower(column)[source]

Convert string to lowercase.

Parameters:: column (Union[Column, str]) – The column to convert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the lower function.

static length(column)[source]

Get the length of a string.

Parameters:: column (Union[Column, str]) – The column to get length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the length function.

static char_length(column)[source]

Alias for length() - Get the character length of a string (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – The column to get length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the char_length function.

static character_length(column)[source]

Alias for length() - Get the character length of a string (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – The column to get length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the character_length function.

static trim(column)[source]

Trim whitespace from string.

Parameters:: column (Union[Column, str]) – The column to trim.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the trim function.

static ltrim(column)[source]

Trim whitespace from left side of string.

Parameters:: column (Union[Column, str]) – The column to trim.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ltrim function.

static rtrim(column)[source]

Trim whitespace from right side of string.

Parameters:: column (Union[Column, str]) – The column to trim.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the rtrim function.

static btrim(column, trim_string=None)[source]

Trim characters from both ends of string.

Parameters:

column (Union[Column, str]) – The column to trim.
trim_string (Optional[str]) – Optional string of characters to trim (default: whitespace).

Return type:

Returns:

ColumnOperation representing the btrim function.

static contains(column, substring)[source]

Check if string contains substring.

Parameters:

column (Union[Column, str]) – The column to check.
substring (str) – The substring to search for.

Return type:

Returns:

ColumnOperation representing the contains function.

static left(column, length)[source]

Extract left N characters from string.

Parameters:

column (Union[Column, str]) – The column to extract from.
length (int) – Number of characters to extract from the left.

Return type:

Returns:

ColumnOperation representing the left function.

static right(column, length)[source]

Extract right N characters from string.

Parameters:

column (Union[Column, str]) – The column to extract from.
length (int) – Number of characters to extract from the right.

Return type:

Returns:

ColumnOperation representing the right function.

static bit_length(column)[source]

Get bit length of string.

Parameters:: column (Union[Column, str]) – The column to get bit length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the bit_length function.

static startswith(column, substring)[source]

Check if string starts with substring.

Parameters:

column (Union[Column, str]) – The column to check.
substring (str) – The substring to check for.

Return type:

Returns:

ColumnOperation representing the startswith function.

static endswith(column, substring)[source]

Check if string ends with substring.

Parameters:

column (Union[Column, str]) – The column to check.
substring (str) – The substring to check for.

Return type:

Returns:

ColumnOperation representing the endswith function.

static like(column, pattern)[source]

SQL LIKE pattern matching.

Parameters:

column (Union[Column, str]) – The column to match.
pattern (str) – The LIKE pattern (supports % and _ wildcards).

Return type:

Returns:

ColumnOperation representing the like function.

static rlike(column, pattern)[source]

Regular expression pattern matching.

Parameters:

column (Union[Column, str]) – The column to match.
pattern (str) – The regular expression pattern.

Return type:

Returns:

ColumnOperation representing the rlike function.

static replace(column, old, new)[source]

Replace occurrences of substring in string.

Parameters:

column (Union[Column, str]) – The column to replace in.
old (str) – The substring to replace.
new (str) – The replacement substring.

Return type:

Returns:

ColumnOperation representing the replace function.

static substr(column, start, length=None)[source]

Alias for substring - Extract substring from string.

Parameters:

column (Union[Column, str]) – The column to extract from.
start (int) – Starting position (1-indexed).
length (Optional[int]) – Optional length of substring.

Return type:

Returns:

ColumnOperation representing the substr function.

static split_part(column, delimiter, part)[source]

Extract part of string split by delimiter.

Parameters:

column (Union[Column, str]) – The column to split.
delimiter (str) – The delimiter to split on.
part (int) – The part number to extract (1-indexed).

Return type:

Returns:

ColumnOperation representing the split_part function.

static position(substring, column)[source]

Find position of substring in string (1-indexed).

Parameters:

substring (Union[Column, str]) – The substring to search for.
column (Union[Column, str]) – The column to search in.

Return type:

Returns:

ColumnOperation representing the position function.

static octet_length(column)[source]

Get byte length (octet length) of string.

Parameters:: column (Union[Column, str]) – The column to get byte length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the octet_length function.

static char(column)[source]

Convert integer to character.

Parameters:: column (Union[Column, str]) – The column containing integer values.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the char function.

static ucase(column)[source]

Alias for upper - Convert string to uppercase.

Parameters:: column (Union[Column, str]) – The column to convert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ucase function.

static lcase(column)[source]

Alias for lower - Convert string to lowercase.

Parameters:: column (Union[Column, str]) – The column to convert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the lcase function.

static elt(n, *columns)[source]

Return element at index from list of columns.

Parameters:

n (Union[Column, int]) – The index (1-indexed).
*columns (Union[Column, str]) – The columns to choose from.

Return type:

Returns:

ColumnOperation representing the elt function.

static regexp_replace(column, pattern, replacement)[source]

Replace regex pattern in string.

Parameters:

column (Union[Column, str]) – The column to replace in.
pattern (str) – The regex pattern to match.
replacement (str) – The replacement string.

Return type:

Returns:

ColumnOperation representing the regexp_replace function.

static split(column, delimiter, limit=None)[source]

Split string by delimiter.

Parameters:

column (Union[Column, str]) – The column to split.
delimiter (str) – The delimiter to split on.
limit (Optional[int]) – Optional limit on the number of times the pattern is applied. If None or -1, no limit (default PySpark behavior).

Return type:

Returns:

ColumnOperation representing the split function.

static substring(column, start, length=None)[source]

Extract substring from string.

Parameters:

column (Union[Column, str]) – The column to extract from.
start (int) – Starting position (1-indexed).
length (Optional[int]) – Optional length of substring.

Return type:

Returns:

ColumnOperation representing the substring function.

static concat(*columns)[source]

Concatenate multiple strings.

Parameters:: *columns (Union[Column, str]) – Columns or strings to concatenate.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the concat function.

static format_string(format_str, *columns)[source]

Format string using printf-style format string.

Parameters:

format_str (str) – The format string (e.g., “Hello %s, you are %d years old”).
*columns (Union[Column, str]) – Columns to use as format arguments.

Return type:

Returns:

ColumnOperation representing the format_string function.

static translate(column, matching_string, replace_string)[source]

Translate characters in string using character mapping.

Parameters:

column (Union[Column, str]) – The column to translate.
matching_string (str) – Characters to match.
replace_string (str) – Characters to replace with (must be same length as matching_string).

Return type:

Returns:

ColumnOperation representing the translate function.

static ascii(column)[source]

Get ASCII value of first character in string.

Parameters:: column (Union[Column, str]) – The column to get ASCII value of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ascii function.

static base64(column)[source]

Encode string to base64.

Parameters:: column (Union[Column, str]) – The column to encode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the base64 function.

static unbase64(column)[source]

Decode base64 string.

Parameters:: column (Union[Column, str]) – The column to decode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unbase64 function.

static regexp_extract_all(column, pattern, idx=0)[source]

Extract all matches of a regex pattern.

Parameters:

column (Union[Column, str]) – The column to extract from.
pattern (str) – The regex pattern to match.
idx (int) – Group index to extract (default: 0 for entire match).

Return type:

Returns:

ColumnOperation representing the regexp_extract_all function.

Example

>>> df.select(F.regexp_extract_all(F.col("text"), r"\d+", 0))

static array_join(column, delimiter, null_replacement=None)[source]

Join array elements with a delimiter.

Parameters:

column (Union[Column, str]) – The array column to join.
delimiter (str) – The delimiter to use for joining.
null_replacement (Optional[str]) – Optional string to replace nulls with.

Return type:

Returns:

ColumnOperation representing the array_join function.

Example

>>> df.select(F.array_join(F.col("tags"), ", "))
>>> df.select(F.array_join(F.col("tags"), "|", "N/A"))

static reverse(column)[source]

Reverse a string column.

Parameters:: column (Union[Column, str]) – The string column to reverse.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the reverse function.

Example

>>> df.select(F.reverse(F.col("name")))

static repeat(column, n)[source]

Repeat a string N times.

Parameters:

column (Union[Column, str]) – The column to repeat.
n (int) – Number of times to repeat.

Return type:

Returns:

ColumnOperation representing the repeat function.

Example

>>> df.select(F.repeat(F.col("text"), 3))

static initcap(column)[source]

Capitalize first letter of each word.

Parameters:: column (Union[Column, str]) – The column to capitalize.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the initcap function.

Example

>>> df.select(F.initcap(F.col("name")))

static soundex(column)[source]

Soundex encoding for phonetic matching.

Parameters:: column (Union[Column, str]) – The column to encode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the soundex function.

Example

>>> df.select(F.soundex(F.col("name")))

static parse_url(url, part)[source]

Extract a part from a URL.

Parameters:

url (Union[Column, str]) – URL column or string.
part (str) – Part to extract (HOST, PATH, QUERY, REF, PROTOCOL, FILE, AUTHORITY, USERINFO).

Return type:

Returns:

ColumnOperation representing the parse_url function.

Example

>>> df.select(F.parse_url(F.col("url"), "HOST"))

static url_encode(url)[source]

URL-encode a string.

Parameters:: url (Union[Column, str]) – String column to encode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the url_encode function.

Example

>>> df.select(F.url_encode(F.col("text")))

static url_decode(url)[source]

URL-decode a string.

Parameters:: url (Union[Column, str]) – String column to decode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the url_decode function.

Example

>>> df.select(F.url_decode(F.col("encoded")))

static concat_ws(sep, *cols)[source]

Concatenate multiple columns with a separator.

Parameters:

sep (str) – Separator string
*cols (Union[Column, str]) – Columns to concatenate

Return type:

Returns:

ColumnOperation representing concat_ws

Example

>>> df.select(F.concat_ws("-", F.col("first"), F.col("last")))

static regexp_extract(column, pattern, idx=0)[source]

Extract a specific group matched by a regex pattern.

Parameters:

column (Union[Column, str]) – Input column
pattern (str) – Regular expression pattern. Supports lookahead (?=…) and lookbehind (?<=…) assertions via Python fallback when Polars native support is unavailable.
idx (int) – Group index to extract (default 0)

Return type:

Returns:

ColumnOperation representing regexp_extract

Example

>>> df.select(F.regexp_extract(F.col("email"), r"(.+)@(.+)", 1))
>>> df.select(F.regexp_extract(F.col("text"), r"(?<=prefix_)\w+", 0))

Note

Fixed in version 3.23.0 (Issue #228): Added fallback support for regex patterns with lookahead and lookbehind assertions using Python’s re module when Polars native support is unavailable.

static substring_index(column, delim, count)[source]

Returns substring before/after count occurrences of delimiter.

Parameters:

column (Union[Column, str]) – Input string column
delim (str) – Delimiter string
count (int) – Number of delimiters (positive for left, negative for right)

Return type:

Returns:

ColumnOperation representing substring_index

Example

>>> df.select(F.substring_index(F.col("path"), "/", 2))

static format_number(column, d)[source]

Format number with d decimal places and thousands separator.

Parameters:

column (Union[Column, str]) – Numeric column
d (int) – Number of decimal places

Return type:

Returns:

ColumnOperation representing format_number

Example

>>> df.select(F.format_number(F.col("amount"), 2))

static instr(column, substr)[source]

Locate the position of the first occurrence of substr (1-indexed).

Parameters:

column (Union[Column, str]) – Input string column
substr (str) – Substring to locate

Return type:

Returns:

ColumnOperation representing instr

Example

>>> df.select(F.instr(F.col("text"), "spark"))

static locate(substr, column, pos=1)[source]

Locate the position of substr starting from pos (1-indexed).

Parameters:

substr (str) – Substring to locate
column (Union[Column, str]) – Input string column
pos (int) – Starting position (default 1)

Return type:

Returns:

ColumnOperation representing locate

Example

>>> df.select(F.locate("spark", F.col("text"), 1))

static lpad(column, len, pad)[source]

Left-pad string column to length len with pad string.

Parameters:

column (Union[Column, str]) – Input string column
len (int) – Target length
pad (str) – Padding string

Return type:

Returns:

ColumnOperation representing lpad

Example

>>> df.select(F.lpad(F.col("id"), 5, "0"))

static rpad(column, len, pad)[source]

Right-pad string column to length len with pad string.

Parameters:

column (Union[Column, str]) – Input string column
len (int) – Target length
pad (str) – Padding string

Return type:

Returns:

ColumnOperation representing rpad

Example

>>> df.select(F.rpad(F.col("id"), 5, "0"))

static levenshtein(left, right)[source]

Compute Levenshtein distance between two strings.

Parameters:

left (Union[Column, str]) – First string column
right (Union[Column, str]) – Second string column

Return type:

Returns:

ColumnOperation representing levenshtein

Example

>>> df.select(F.levenshtein(F.col("word1"), F.col("word2")))

static overlay(src, replace, pos, len=-1)[source]

Replace part of a string with another string starting at a position (PySpark 3.0+).

Parameters:

src (Union[Column, str]) – Source string column
replace (Union[Column, str]) – Replacement string
pos (Union[Column, int]) – Starting position (1-indexed)
len (Union[Column, int]) – Length to replace (default -1 means to end of string)

Return type:

Returns:

ColumnOperation for overlay operation

Example

>>> df.select(F.overlay(F.col("text"), F.lit("NEW"), F.lit(5), F.lit(3)))

static bin(column)[source]

Convert to binary string representation.

Parameters:: column (Union[Column, str]) – Numeric column
Return type:: ColumnOperation
Returns:: ColumnOperation representing bin

static hex(column)[source]

Convert to hexadecimal string.

Parameters:: column (Union[Column, str]) – Column to convert
Return type:: ColumnOperation
Returns:: ColumnOperation representing hex

static unhex(column)[source]

Convert hex string to binary.

Parameters:: column (Union[Column, str]) – Hex string column
Return type:: ColumnOperation
Returns:: ColumnOperation representing unhex

static hash(*cols)[source]

Compute hash value of given columns.

Parameters:: *cols (Union[Column, str]) – Columns to hash
Return type:: ColumnOperation
Returns:: ColumnOperation representing hash

static xxhash64(*cols)[source]

Compute xxHash64 value of given columns (all PySpark versions).

Parameters:: *cols (Union[Column, str]) – Columns to hash
Return type:: ColumnOperation
Returns:: ColumnOperation representing xxhash64

static encode(column, charset)[source]

Encode string to binary using charset.

Parameters:

column (Union[Column, str]) – String column
charset (str) – Character set (e.g., ‘UTF-8’)

Return type:

Returns:

ColumnOperation representing encode

static decode(column, charset)[source]

Decode binary to string using charset.

Parameters:

column (Union[Column, str]) – Binary column
charset (str) – Character set (e.g., ‘UTF-8’)

Return type:

Returns:

ColumnOperation representing decode

static conv(column, from_base, to_base)[source]

Convert number from one base to another.

Parameters:

column (Union[Column, str]) – Number column
from_base (int) – Source base (2-36)
to_base (int) – Target base (2-36)

Return type:

Returns:

ColumnOperation representing conv

static md5(column)[source]

Calculate MD5 hash of string (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – String column to hash
Return type:: ColumnOperation
Returns:: ColumnOperation representing md5 function (returns 32-char hex string)

Example

>>> df.select(F.md5(F.col("text")))

static sha1(column)[source]

Calculate SHA-1 hash of string (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – String column to hash
Return type:: ColumnOperation
Returns:: ColumnOperation representing sha1 function (returns 40-char hex string)

Example

>>> df.select(F.sha1(F.col("text")))

static sha2(column, numBits)[source]

Calculate SHA-2 family hash (PySpark 3.0+).

Parameters:

column (Union[Column, str]) – String column to hash
numBits (int) – Bit length - 224, 256, 384, or 512

Return type:

Returns:

ColumnOperation representing sha2 function (returns hex string)

Example

>>> df.select(F.sha2(F.col("text"), 256))

static crc32(column)[source]

Calculate CRC32 checksum (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – String column to checksum
Return type:: ColumnOperation
Returns:: ColumnOperation representing crc32 function (returns signed 32-bit int)

Example

>>> df.select(F.crc32(F.col("text")))

static to_str(column)[source]

Convert column to string representation (all PySpark versions).

Parameters:: column (Union[Column, str]) – Column to convert to string
Return type:: ColumnOperation
Returns:: Column operation for string conversion

Example

>>> df.select(F.to_str(F.col("value")))

static ilike(column, pattern)[source]

Case-insensitive LIKE pattern matching.

Parameters:

column (Union[Column, str]) – The column to match against.
pattern (str) – The pattern to match (SQL LIKE pattern).

Return type:

Returns:

ColumnOperation representing the ilike function.

static find_in_set(column, str_list)[source]

Find position of value in comma-separated string list.

Parameters:

column (Union[Column, str]) – The value to find.
str_list (Union[Column, str]) – The comma-separated string list.

Return type:

Returns:

ColumnOperation representing the find_in_set function.

static regexp_count(column, pattern)[source]

Count occurrences of regex pattern in string.

Parameters:

column (Union[Column, str]) – The column to search in.
pattern (str) – The regex pattern to count.

Return type:

Returns:

ColumnOperation representing the regexp_count function.

static regexp_like(column, pattern)[source]

Regex pattern matching (similar to rlike).

Parameters:

column (Union[Column, str]) – The column to match against.
pattern (str) – The regex pattern to match.

Return type:

Returns:

ColumnOperation representing the regexp_like function.

static regexp_substr(column, pattern, pos=1, occurrence=1)[source]

Extract substring matching regex pattern.

Parameters:

column (Union[Column, str]) – The column to extract from.
pattern (str) – The regex pattern to match.
pos (int) – Starting position (1-indexed).
occurrence (int) – Which occurrence to extract.

Return type:

Returns:

ColumnOperation representing the regexp_substr function.

static regexp_instr(column, pattern, pos=1, occurrence=1)[source]

Find position of regex pattern match.

Parameters:

column (Union[Column, str]) – The column to search in.
pattern (str) – The regex pattern to find.
pos (int) – Starting position (1-indexed).
occurrence (int) – Which occurrence to find.

Return type:

Returns:

ColumnOperation representing the regexp_instr function.

static regexp(column, pattern)[source]

Alias for rlike - regex pattern matching.

Parameters:

column (Union[Column, str]) – The column to match against.
pattern (str) – The regex pattern to match.

Return type:

Returns:

ColumnOperation representing the regexp function.

static sentences(column, language=None, country=None)[source]

Split text into sentences.

Parameters:

column (Union[Column, str]) – The column containing text.
language (Optional[str]) – Language code (optional).
country (Optional[str]) – Country code (optional).

Return type:

Returns:

ColumnOperation representing the sentences function.

static printf(format_str, *columns)[source]

Formatted string (like sprintf).

Parameters:

format_str (str) – Format string with placeholders.
*columns (Union[Column, str]) – Columns to format.

Return type:

Returns:

ColumnOperation representing the printf function.

static to_char(column, format=None)[source]

Convert number/date to character string.

Parameters:

column (Union[Column, str]) – The column to convert.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the to_char function.

static to_varchar(column, length=None)[source]

Convert to varchar type.

Parameters:

column (Union[Column, str]) – The column to convert.
length (Optional[int]) – Optional length for varchar.

Return type:

Returns:

ColumnOperation representing the to_varchar function.

static typeof(column)[source]

Get type of value as string.

Parameters:: column (Union[Column, str]) – The column to get type of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the typeof function.

static stack(n, *cols)[source]

Stack multiple columns into rows.

Parameters:

n (int) – Number of rows to create per input row.
*cols (Union[Column, str, Any]) – Columns to stack.

Return type:

Returns:

ColumnOperation representing the stack function.

static sha(column)[source]

Alias for sha1 - Calculate SHA-1 hash of string (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – String column to hash.
Return type:: ColumnOperation
Returns:: ColumnOperation representing sha function (returns 40-char hex string).

Example

>>> df.select(F.sha(F.col("text")))

static mask(column, upperChar=None, lowerChar=None, digitChar=None, otherChar=None)[source]

Mask sensitive data in a string (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – String column to mask.
upperChar (Optional[str]) – Character to use for uppercase letters (default: ‘X’).
lowerChar (Optional[str]) – Character to use for lowercase letters (default: ‘x’).
digitChar (Optional[str]) – Character to use for digits (default: ‘n’).
otherChar (Optional[str]) – Character to use for other characters (default: ‘-‘).

Return type:

Returns:

ColumnOperation representing the mask function.

Example

>>> df.select(F.mask(F.col("email"), upperChar='U', lowerChar='l', digitChar='d'))

static json_array_length(column, path=None)[source]

Get the length of a JSON array (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – JSON column to get array length from.
path (Optional[str]) – Optional JSONPath expression to specify array location.

Return type:

Returns:

ColumnOperation representing the json_array_length function.

Example

>>> df.select(F.json_array_length(F.col("json_col"), "$.array"))

static json_object_keys(column, path=None)[source]

Get the keys of a JSON object (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – JSON column to get object keys from.
path (Optional[str]) – Optional JSONPath expression to specify object location.

Return type:

Returns:

ColumnOperation representing the json_object_keys function.

Example

>>> df.select(F.json_object_keys(F.col("json_col"), "$.object"))

static xpath_number(column, path)[source]

Extract number from XML using XPath (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – XML column to extract from.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath_number function.

Example

>>> df.select(F.xpath_number(F.col("xml_col"), "/root/value"))

static user()[source]

Get current user name (PySpark 3.5+).

Return type:: ColumnOperation
Returns:: ColumnOperation representing the user function.

Example

>>> df.select(F.user())

class sparkless.functions.functions.MathFunctions[source]

Bases: object

Collection of mathematical functions.

static abs(column)[source]

Get absolute value.

Parameters:: column (Union[Column, str]) – The column to get absolute value of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the abs function.

static positive(column)[source]

Return positive value (identity function).

Parameters:: column (Union[Column, str]) – The column to return as positive.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the positive function.

static negative(column)[source]

Return negative value.

Parameters:: column (Union[Column, str]) – The column to negate.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the negative function.

static round(column, scale=0)[source]

Round to specified number of decimal places.

Parameters:

column (Union[Column, str]) – The column to round.
scale (int) – Number of decimal places (default: 0).

Return type:

Returns:

ColumnOperation representing the round function.

static ceil(column)[source]

Round up to nearest integer.

Parameters:: column (Union[Column, str]) – The column to round up.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ceil function.

static ceiling(column)[source]

Alias for ceil - Round up to nearest integer.

Parameters:: column (Union[Column, str]) – The column to round up.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ceiling function.

static floor(column)[source]

Round down to nearest integer.

Parameters:: column (Union[Column, str]) – The column to round down.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the floor function.

static sqrt(column)[source]

Get square root.

Parameters:: column (Union[Column, str]) – The column to get square root of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sqrt function.

static exp(column)[source]

Get exponential (e^x).

Parameters:: column (Union[Column, str]) – The column to get exponential of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the exp function.

static log(base, column=None)[source]

Get logarithm.

PySpark signature: log(base, column) or log(column) for natural log.

Parameters:

base (Union[Column, str, float, int, None]) – Base for logarithm. Can be a float/int constant or Column. If column is None, base is treated as the column (natural log).
column (Union[Column, str, None]) – The column to get logarithm of. If None, base is the column (natural log).

Return type:

Returns:

ColumnOperation representing the log function.

static log10(column)[source]

Get base-10 logarithm (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – The column to get log10 of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the log10 function.

Example

>>> df.select(F.log10(F.col("value")))

static log2(column)[source]

Get base-2 logarithm (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – The column to get log2 of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the log2 function.

Example

>>> df.select(F.log2(F.col("value")))

static log1p(column)[source]

Get natural logarithm of (1 + x) (PySpark 3.0+).

Computes ln(1 + x) accurately for small values of x.

Parameters:: column (Union[Column, str]) – The column to compute log1p of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the log1p function.

Example

>>> df.select(F.log1p(F.col("value")))

static expm1(column)[source]

Get exp(x) - 1 (PySpark 3.0+).

Computes e^x - 1 accurately for small values of x.

Parameters:: column (Union[Column, str]) – The column to compute expm1 of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the expm1 function.

Example

>>> df.select(F.expm1(F.col("value")))

static pow(column, exponent)[source]

Raise to power.

Parameters:

column (Union[Column, str]) – The column to raise to power.
exponent (Union[Column, float, int]) – The exponent.

Return type:

Returns:

ColumnOperation representing the pow function.

static power(column, exponent)[source]

Alias for pow - Raise to power.

Parameters:

column (Union[Column, str]) – The column to raise to power.
exponent (Union[Column, float, int]) – The exponent.

Return type:

Returns:

ColumnOperation representing the power function.

static sin(column)[source]

Get sine.

Parameters:: column (Union[Column, str]) – The column to get sine of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sin function.

static cos(column)[source]

Get cosine.

Parameters:: column (Union[Column, str]) – The column to get cosine of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cos function.

static tan(column)[source]

Get tangent.

Parameters:: column (Union[Column, str]) – The column to get tangent of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the tan function.

static sign(column)[source]

Get sign of number (-1, 0, or 1).

Parameters:: column (Union[Column, str]) – The column to get sign of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sign function.

static greatest(*columns)[source]

Get the greatest value among columns.

Parameters:: *columns (Union[Column, str]) – Columns to compare.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the greatest function.

static least(*columns)[source]

Get the least value among columns.

Parameters:: *columns (Union[Column, str]) – Columns to compare.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the least function.

static acosh(col)[source]

Compute inverse hyperbolic cosine (arc hyperbolic cosine).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the acosh function.

Note

Input must be >= 1. Returns NaN for invalid inputs.

static asinh(col)[source]

Compute inverse hyperbolic sine (arc hyperbolic sine).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the asinh function.

static atanh(col)[source]

Compute inverse hyperbolic tangent (arc hyperbolic tangent).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the atanh function.

Note

Input must be in range (-1, 1). Returns NaN for invalid inputs.

static acos(col)[source]

Compute inverse cosine (arc cosine).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the acos function.

static asin(col)[source]

Compute inverse sine (arc sine).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the asin function.

static atan(col)[source]

Compute inverse tangent (arc tangent).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the atan function.

static atan2(y, x)[source]

Compute 2-argument arctangent (PySpark 3.0+).

Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).

Parameters:

y (Union[Column, str, float, int]) – Y coordinate (column or numeric value).
x (Union[Column, str, float, int]) – X coordinate (column or numeric value).

Return type:

Returns:

ColumnOperation representing the atan2 function.

Example

>>> df.select(F.atan2(F.col("y"), F.col("x")))
>>> df.select(F.atan2(F.lit(1.0), F.lit(1.0)))  # Returns π/4

static cosh(col)[source]

Compute hyperbolic cosine.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cosh function.

static sinh(col)[source]

Compute hyperbolic sine.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sinh function.

static tanh(col)[source]

Compute hyperbolic tangent.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the tanh function.

static degrees(col)[source]

Convert radians to degrees.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the degrees function.

static radians(col)[source]

Convert degrees to radians.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the radians function.

static cbrt(col)[source]

Compute cube root.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cbrt function.

static factorial(col)[source]

Compute factorial.

Parameters:: col (Union[Column, str]) – Column or column name (non-negative integers).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the factorial function.

static rand(seed=None)[source]

Generate a random column with i.i.d. samples from U[0.0, 1.0].

Parameters:: seed (Optional[int]) – Random seed (optional).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the rand function.

static randn(seed=None)[source]

Generate a random column with i.i.d. samples from standard normal distribution.

Parameters:: seed (Optional[int]) – Random seed (optional).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the randn function.

static rint(col)[source]

Round to nearest integer using banker’s rounding (half to even).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the rint function.

static bround(col, scale=0)[source]

Round using HALF_EVEN rounding mode (banker’s rounding).

Parameters:

col (Union[Column, str]) – Column or column name.
scale (int) – Number of decimal places (default 0).

Return type:

Returns:

ColumnOperation representing the bround function.

static hypot(col1, col2)[source]

Compute sqrt(col1^2 + col2^2) (hypotenuse).

Parameters:

col1 (Union[Column, str]) – First column
col2 (Union[Column, str]) – Second column

Return type:

Returns:

ColumnOperation representing the hypot function.

static nanvl(col1, col2)[source]

Returns col1 if not NaN, or col2 if col1 is NaN.

Parameters:

col1 (Union[Column, str]) – First column
col2 (Union[Column, str, int, float]) – Second column or literal value (replacement for NaN)

Return type:

Returns:

ColumnOperation representing the nanvl function.

static signum(col)[source]

Compute the signum function (sign: -1, 0, or 1).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the signum function.

static cot(col)[source]

Compute cotangent (PySpark 3.3+).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cot function.

static csc(col)[source]

Compute cosecant (PySpark 3.3+).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the csc function.

static sec(col)[source]

Compute secant (PySpark 3.3+).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sec function.

static e()[source]

Return Euler’s number e (PySpark 3.5+).

Return type:: ColumnOperation
Returns:: ColumnOperation representing Euler’s number constant.

static pi()[source]

Return the value of pi (PySpark 3.5+).

Return type:: ColumnOperation
Returns:: ColumnOperation representing pi constant.

static ln(col)[source]

Compute natural logarithm (alias for log) (PySpark 3.5+).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ln function.

static toDegrees(column)[source]

Deprecated alias for degrees (all PySpark versions).

Use degrees instead.

Parameters:: column (Union[Column, str]) – Angle in radians.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the degrees conversion.

static toRadians(column)[source]

Deprecated alias for radians (all PySpark versions).

Use radians instead.

Parameters:: column (Union[Column, str]) – Angle in degrees.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the radians conversion.

static pmod(dividend, divisor)[source]

Positive modulo - always returns positive remainder.

Parameters:

dividend (Union[Column, str, int]) – The dividend.
divisor (Union[Column, str, int]) – The divisor.

Return type:

Returns:

ColumnOperation representing the pmod function.

static negate(column)[source]

Negate value (alias for negative).

Parameters:: column (Union[Column, str]) – The column to negate.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the negate function.

static getbit(column, bit)[source]

Get bit at specified position (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – The column containing the integer.
bit (Union[Column, str, int]) – The bit position (0-indexed from right).

Return type:

Returns:

ColumnOperation representing the getbit function.

Example

>>> df.select(F.getbit(F.col("value"), 3))

static width_bucket(value, min_value, max_value, num_buckets)[source]

Compute histogram bucket number for value (PySpark 3.5+).

Parameters:

value (Union[Column, str]) – The value to compute bucket for.
min_value (Union[Column, str, float]) – Minimum value of the range.
max_value (Union[Column, str, float]) – Maximum value of the range.
num_buckets (Union[Column, str, int]) – Number of buckets.

Return type:

Returns:

ColumnOperation representing the width_bucket function.

Example

>>> df.select(F.width_bucket(F.col("value"), 0.0, 100.0, 10))

class sparkless.functions.functions.AggregateFunctions[source]

Bases: object

Collection of aggregate functions.

static count(column=None)[source]

Count non-null values.

Parameters:: column (Union[Column, str, None]) – The column to count (None for count(*)).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the count function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static sum(column)[source]

Sum values.

Parameters:: column (Union[Column, str]) – The column to sum.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sum function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static avg(column)[source]

Average values.

Parameters:: column (Union[Column, str]) – The column to average.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the avg function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static max(column)[source]

Maximum value.

Parameters:: column (Union[Column, str]) – The column to get max of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the max function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static min(column)[source]

Minimum value.

Parameters:: column (Union[Column, str]) – The column to get min of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the min function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static first(column, ignorenulls=False)[source]

First value.

Parameters:

column (Union[Column, str]) – The column to get first value of.
ignorenulls (bool) – If True, ignore null values and return first non-null value. If False (default), return first value even if it’s null.

Return type:

Returns:

AggregateFunction representing the first function.

Raises:

RuntimeError – If no active SparkSession is available

static last(column)[source]

Last value.

Parameters:: column (Union[Column, str]) – The column to get last value of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the last function.
Raises:: RuntimeError – If no active SparkSession is available

static collect_list(column)[source]

Collect values into a list.

Parameters:: column (Union[Column, str]) – The column to collect.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the collect_list function.
Raises:: RuntimeError – If no active SparkSession is available

static collect_set(column)[source]

Collect unique values into a set.

Parameters:: column (Union[Column, str]) – The column to collect.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the collect_set function.
Raises:: RuntimeError – If no active SparkSession is available

static stddev(column)[source]

Standard deviation.

Parameters:: column (Union[Column, str]) – The column to get stddev of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the stddev function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static std(column)[source]

Alias for stddev - Standard deviation.

Parameters:: column (Union[Column, str]) – The column to get stddev of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the std function.
Raises:: RuntimeError – If no active SparkSession is available

static product(column)[source]

Multiply all values in column.

Parameters:: column (Union[Column, str]) – The column to multiply values of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the product function.
Raises:: RuntimeError – If no active SparkSession is available

static sum_distinct(column)[source]

Sum of distinct values.

Parameters:: column (Union[Column, str]) – The column to sum distinct values of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the sum_distinct function.
Raises:: RuntimeError – If no active SparkSession is available

static variance(column)[source]

Variance.

Parameters:: column (Union[Column, str]) – The column to get variance of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the variance function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static skewness(column)[source]

Skewness.

Parameters:: column (Union[Column, str]) – The column to get skewness of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the skewness function.
Raises:: RuntimeError – If no active SparkSession is available

static kurtosis(column)[source]

Kurtosis.

Parameters:: column (Union[Column, str]) – The column to get kurtosis of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the kurtosis function.
Raises:: RuntimeError – If no active SparkSession is available

static countDistinct(column)[source]

Count distinct values.

Parameters:: column (Union[Column, str]) – The column to count distinct values of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the countDistinct function.
Raises:: RuntimeError – If no active SparkSession is available

static count_distinct(column)[source]

Alias for countDistinct - Count distinct values.

Parameters:: column (Union[Column, str]) – The column to count distinct values of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the count_distinct function.
Raises:: RuntimeError – If no active SparkSession is available

static percentile_approx(column, percentage, accuracy=10000)[source]

Approximate percentile.

Parameters:

column (Union[Column, str]) – The column to get percentile of.
percentage (float) – The percentage (0.0 to 1.0).
accuracy (int) – The accuracy parameter.

Return type:

Returns:

AggregateFunction representing the percentile_approx function.

Raises:

RuntimeError – If no active SparkSession is available

static corr(column1, column2)[source]

Correlation between two columns.

Parameters:

column1 (Union[Column, str]) – The first column.
column2 (Union[Column, str]) – The second column.

Return type:

Returns:

ColumnOperation representing the corr function (PySpark-compatible).

Raises:

RuntimeError – If no active SparkSession is available

static covar_samp(column1, column2)[source]

Sample covariance between two columns.

Parameters:

column1 (Union[Column, str]) – The first column.
column2 (Union[Column, str]) – The second column.

Return type:

Returns:

ColumnOperation representing the covar_samp function (PySpark-compatible).

Raises:

RuntimeError – If no active SparkSession is available

static bool_and(column)[source]

Aggregate AND - returns true if all values are true (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column containing boolean values.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the bool_and function.
Raises:: RuntimeError – If no active SparkSession is available

static bool_or(column)[source]

Aggregate OR - returns true if any value is true (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column containing boolean values.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the bool_or function.
Raises:: RuntimeError – If no active SparkSession is available

static every(column)[source]

Alias for bool_and (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column containing boolean values.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the every function.
Raises:: RuntimeError – If no active SparkSession is available

static some(column)[source]

Alias for bool_or (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column containing boolean values.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the some function.
Raises:: RuntimeError – If no active SparkSession is available

static max_by(column, ord)[source]

Return value associated with the maximum of ord column (PySpark 3.1+).

Parameters:

column (Union[Column, str]) – Column to return value from.
ord (Union[Column, str]) – Column to find maximum of.

Return type:

Returns:

AggregateFunction representing the max_by function.

Raises:

RuntimeError – If no active SparkSession is available

static min_by(column, ord)[source]

Return value associated with the minimum of ord column (PySpark 3.1+).

Parameters:

column (Union[Column, str]) – Column to return value from.
ord (Union[Column, str]) – Column to find minimum of.

Return type:

Returns:

AggregateFunction representing the min_by function.

Raises:

RuntimeError – If no active SparkSession is available

static count_if(column)[source]

Count rows where condition is true (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Boolean column or condition.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the count_if function.
Raises:: RuntimeError – If no active SparkSession is available

static any_value(column)[source]

Return any non-null value (non-deterministic) (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column to return value from.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the any_value function.
Raises:: RuntimeError – If no active SparkSession is available

static mean(column)[source]

Aggregate function: returns the mean of the values (alias for avg).

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the mean function.
Raises:: RuntimeError – If no active SparkSession is available

static approx_count_distinct(column, rsd=None)[source]

Returns approximate count of distinct elements (alias for approxCountDistinct).

Parameters:

column (Union[Column, str]) – Column to count distinct values.
rsd (Optional[float]) – Optional relative standard deviation (default: None, which uses PySpark’s default of 0.05). Controls the approximation accuracy. Lower values provide better accuracy but use more memory. Typical values range from 0.01 (1% error) to 0.1 (10% error).

Return type:

Returns:

ColumnOperation representing the approx_count_distinct function (PySpark-compatible).

Raises:

RuntimeError – If no active SparkSession is available

static stddev_pop(column)[source]

Returns population standard deviation.

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the stddev_pop function.
Raises:: RuntimeError – If no active SparkSession is available

static stddev_samp(column)[source]

Returns sample standard deviation.

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the stddev_samp function.
Raises:: RuntimeError – If no active SparkSession is available

static var_pop(column)[source]

Returns population variance.

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the var_pop function.
Raises:: RuntimeError – If no active SparkSession is available

static var_samp(column)[source]

Returns sample variance.

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the var_samp function.
Raises:: RuntimeError – If no active SparkSession is available

static covar_pop(column1, column2)[source]

Returns population covariance.

Parameters:

column1 (Union[Column, str]) – First numeric column.
column2 (Union[Column, str]) – Second numeric column.

Return type:

Returns:

AggregateFunction representing the covar_pop function.

static median(column)[source]

Returns the median value (PySpark 3.4+).

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the median function.
Raises:: RuntimeError – If no active SparkSession is available

static mode(column)[source]

Returns the most frequent value (mode) (PySpark 3.4+).

Parameters:: column (Union[Column, str]) – Column to find mode of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the mode function.
Raises:: RuntimeError – If no active SparkSession is available

static percentile(column, percentage)[source]

Returns the exact percentile value (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – Numeric column.
percentage (float) – Percentile to compute (between 0.0 and 1.0).

Return type:

Returns:

AggregateFunction representing the percentile function.

static approxCountDistinct(*cols)[source]

Deprecated alias for approx_count_distinct (all PySpark versions).

Use approx_count_distinct instead.

Parameters:: cols (Union[Column, str]) – Columns to count distinct values for. Only the first column is used.
Return type:: AggregateFunction
Returns:: AggregateFunction for approximate distinct count.

static sumDistinct(column)[source]

Deprecated alias for sum_distinct (PySpark 3.2+).

Use sum_distinct instead (or sum(distinct(col)) for earlier versions).

Parameters:: column (Union[Column, str]) – Numeric column to sum.
Return type:: AggregateFunction
Returns:: AggregateFunction for distinct sum.

static regr_avgx(y, x)[source]

Linear regression average of x.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_avgx function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_avgy(y, x)[source]

Linear regression average of y.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_avgy function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_count(y, x)[source]

Linear regression count.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_count function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_intercept(y, x)[source]

Linear regression intercept.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_intercept function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_r2(y, x)[source]

Linear regression R-squared.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_r2 function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_slope(y, x)[source]

Linear regression slope.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_slope function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_sxx(y, x)[source]

Linear regression sum of squares of x.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_sxx function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_sxy(y, x)[source]

Linear regression sum of products.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_sxy function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_syy(y, x)[source]

Linear regression sum of squares of y.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_syy function.

Raises:

RuntimeError – If no active SparkSession is available

static approx_percentile(column, percentage, accuracy=10000)[source]

Compute approximate percentile (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – The column to compute percentile for.
percentage (Union[float, Column, str]) – The percentage (0.0 to 1.0) or array of percentages.
accuracy (Union[int, Column, str]) – The accuracy parameter (default: 10000).

Return type:

Returns:

AggregateFunction representing the approx_percentile function.

Example

>>> df.groupBy("dept").agg(F.approx_percentile(F.col("salary"), 0.5))

class sparkless.functions.functions.DateTimeFunctions[source]

Bases: object

Collection of datetime functions.

static current_timestamp()[source]

Get current timestamp.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the current_timestamp function.
Raises:: RuntimeError – If no active SparkSession is available

static current_date()[source]

Get current date.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the current_date function.
Raises:: RuntimeError – If no active SparkSession is available

static now()[source]

Alias for current_timestamp - Get current timestamp.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the now function.

static curdate()[source]

Alias for current_date - Get current date.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the curdate function.

static days(column)[source]

Convert number to days interval.

Parameters:: column (Union[Column, str, int]) – The number of days (can be column or literal).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the days function.

static hours(column)[source]

Convert number to hours interval.

Parameters:: column (Union[Column, str, int]) – The number of hours (can be column or literal).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the hours function.

static months(column)[source]

Convert number to months interval.

Parameters:: column (Union[Column, str, int]) – The number of months (can be column or literal).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the months function.

static years(column)[source]

Convert number to years interval.

Parameters:: column (Union[Column, str, int]) – The number of years (can be column or literal).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the years function.

static localtimestamp()[source]

Get local timestamp (without timezone).

Return type:: ColumnOperation
Returns:: ColumnOperation representing the localtimestamp function.

static dateadd(date_part, value, date)[source]

SQL Server style date addition.

Parameters:

date_part (str) – The date part to add (year, month, day, etc.).
value (Union[Column, str, int]) – The value to add.
date (Union[Column, str]) – The date column.

Return type:

Returns:

ColumnOperation representing the dateadd function.

static datepart(date_part, date)[source]

SQL Server style date part extraction.

Parameters:

date_part (str) – The date part to extract (year, month, day, etc.).
date (Union[Column, str]) – The date column.

Return type:

Returns:

ColumnOperation representing the datepart function.

static make_timestamp(year, month, day, hour=0, minute=0, second=0)[source]

Create timestamp from components.

Parameters:

year (Union[Column, str, int]) – Year component.
month (Union[Column, str, int]) – Month component.
day (Union[Column, str, int]) – Day component.
hour (Union[Column, str, int]) – Hour component (default 0).
minute (Union[Column, str, int]) – Minute component (default 0).
second (Union[Column, str, int]) – Second component (default 0).

Return type:

Returns:

ColumnOperation representing the make_timestamp function.

static make_timestamp_ltz(year, month, day, hour=0, minute=0, second=0, timezone=None)[source]

Create timestamp with local timezone.

Parameters:

year (Union[Column, str, int]) – Year component.
month (Union[Column, str, int]) – Month component.
day (Union[Column, str, int]) – Day component.
hour (Union[Column, str, int]) – Hour component (default 0).
minute (Union[Column, str, int]) – Minute component (default 0).
second (Union[Column, str, int]) – Second component (default 0).
timezone (Optional[str]) – Optional timezone string.

Return type:

Returns:

ColumnOperation representing the make_timestamp_ltz function.

static make_timestamp_ntz(year, month, day, hour=0, minute=0, second=0)[source]

Create timestamp with no timezone.

Parameters:

year (Union[Column, str, int]) – Year component.
month (Union[Column, str, int]) – Month component.
day (Union[Column, str, int]) – Day component.
hour (Union[Column, str, int]) – Hour component (default 0).
minute (Union[Column, str, int]) – Minute component (default 0).
second (Union[Column, str, int]) – Second component (default 0).

Return type:

Returns:

ColumnOperation representing the make_timestamp_ntz function.

static make_interval(years=0, months=0, weeks=0, days=0, hours=0, mins=0, secs=0)[source]

Create interval from components.

Parameters:

years (Union[Column, str, int]) – Years component (default 0).
months (Union[Column, str, int]) – Months component (default 0).
weeks (Union[Column, str, int]) – Weeks component (default 0).
days (Union[Column, str, int]) – Days component (default 0).
hours (Union[Column, str, int]) – Hours component (default 0).
mins (Union[Column, str, int]) – Minutes component (default 0).
secs (Union[Column, str, int]) – Seconds component (default 0).

Return type:

Returns:

ColumnOperation representing the make_interval function.

static make_dt_interval(days=0, hours=0, mins=0, secs=0)[source]

Create day-time interval.

Parameters:

days (Union[Column, str, int]) – Days component (default 0).
hours (Union[Column, str, int]) – Hours component (default 0).
mins (Union[Column, str, int]) – Minutes component (default 0).
secs (Union[Column, str, int]) – Seconds component (default 0).

Return type:

Returns:

ColumnOperation representing the make_dt_interval function.

static make_ym_interval(years=0, months=0)[source]

Create year-month interval.

Parameters:

years (Union[Column, str, int]) – Years component (default 0).
months (Union[Column, str, int]) – Months component (default 0).

Return type:

Returns:

ColumnOperation representing the make_ym_interval function.

static to_number(column, format=None)[source]

Convert string to number.

Parameters:

column (Union[Column, str]) – The column to convert.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the to_number function.

static to_binary(column, format=None)[source]

Convert to binary format.

Parameters:

column (Union[Column, str]) – The column to convert.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the to_binary function.

static to_unix_timestamp(column, format=None)[source]

Convert to unix timestamp.

Parameters:

column (Union[Column, str]) – The column to convert.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the to_unix_timestamp function.

static unix_date(column)[source]

Convert unix timestamp to date.

Parameters:: column (Union[Column, str]) – The unix timestamp column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unix_date function.

static unix_seconds(column)[source]

Convert timestamp to unix seconds.

Parameters:: column (Union[Column, str]) – The timestamp column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unix_seconds function.

static unix_millis(column)[source]

Convert timestamp to unix milliseconds.

Parameters:: column (Union[Column, str]) – The timestamp column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unix_millis function.

static unix_micros(column)[source]

Convert timestamp to unix microseconds.

Parameters:: column (Union[Column, str]) – The timestamp column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unix_micros function.

static timestamp_millis(column)[source]

Create timestamp from unix milliseconds.

Parameters:: column (Union[Column, str]) – The unix milliseconds column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the timestamp_millis function.

static timestamp_micros(column)[source]

Create timestamp from unix microseconds.

Parameters:: column (Union[Column, str]) – The unix microseconds column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the timestamp_micros function.

static to_date(column, format=None)[source]

Convert string, timestamp, or date to date.

Parameters:

column (Union[Column, str]) – The column to convert (StringType, TimestampType, or DateType).
format (Optional[str]) – Optional date format string (only used for StringType input).

Return type:

Returns:

ColumnOperation representing the to_date function.

Raises:

TypeError – If input column type is not StringType, TimestampType, or DateType

static to_timestamp(column, format=None)[source]

Convert to timestamp.

Parameters:

column (Union[Column, str]) – The column to convert. Accepts StringType, TimestampType, IntegerType, LongType, DateType, or DoubleType (matching PySpark behavior).
format (Optional[str]) – Optional timestamp format string (used for StringType input).

Return type:

Returns:

ColumnOperation representing the to_timestamp function.

Raises:

TypeError – If input column type is not one of the supported types.

static hour(column)[source]

Extract hour from timestamp.

Parameters:: column (Union[Column, str]) – The column to extract hour from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the hour function.

static day(column)[source]

Extract day from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract day from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the day function.

static dayofmonth(column)[source]

Extract day of month from date/timestamp (alias for day).

Parameters:: column (Union[Column, str]) – The column to extract day from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the dayofmonth function.

static month(column)[source]

Extract month from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract month from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the month function.

static year(column)[source]

Extract year from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract year from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the year function.

static dayofweek(column)[source]

Extract day of week from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract day of week from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the dayofweek function.

static dayofyear(column)[source]

Extract day of year from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract day of year from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the dayofyear function.

static weekofyear(column)[source]

Extract week of year from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract week of year from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the weekofyear function.

static quarter(column)[source]

Extract quarter from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract quarter from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the quarter function.

static minute(column)[source]

Extract minute from timestamp.

Parameters:: column (Union[Column, str]) – The column to extract minute from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the minute function.

static second(column)[source]

Extract second from timestamp.

Parameters:: column (Union[Column, str]) – The column to extract second from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the second function.

static add_months(column, num_months)[source]

Add months to date/timestamp.

Parameters:

column (Union[Column, str]) – The column to add months to.
num_months (int) – Number of months to add.

Return type:

Returns:

ColumnOperation representing the add_months function.

static months_between(column1, column2)[source]

Calculate months between two dates.

Parameters:

column1 (Union[Column, str]) – The first date column.
column2 (Union[Column, str]) – The second date column.

Return type:

Returns:

ColumnOperation representing the months_between function.

static date_add(column, days)[source]

Add days to date.

Parameters:

column (Union[Column, str]) – The column to add days to.
days (int) – Number of days to add.

Return type:

Returns:

ColumnOperation representing the date_add function.

static date_sub(column, days)[source]

Subtract days from date.

Parameters:

column (Union[Column, str]) – The column to subtract days from.
days (int) – Number of days to subtract.

Return type:

Returns:

ColumnOperation representing the date_sub function.

static date_format(column, format)[source]

Format date/timestamp as string.

Parameters:

column (Union[Column, str]) – The column to format.
format (str) – Date format string (e.g., ‘yyyy-MM-dd’).

Return type:

Returns:

ColumnOperation representing the date_format function.

static from_unixtime(column, format='yyyy-MM-dd HH:mm:ss')[source]

Convert unix timestamp to string.

Parameters:

column (Union[Column, str]) – The column with unix timestamp.
format (str) – Date format string (default: ‘yyyy-MM-dd HH:mm:ss’).

Return type:

Returns:

ColumnOperation representing the from_unixtime function.

static timestampadd(unit, quantity, timestamp)[source]

Add time units to a timestamp.

Parameters:

unit (str) – Time unit (YEAR, QUARTER, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND).
quantity (Union[int, Column]) – Number of units to add (can be column or integer).
timestamp (Union[str, Column]) – Timestamp column or literal.

Return type:

Returns:

ColumnOperation representing the timestampadd function.

Example

>>> df.select(F.timestampadd("DAY", 7, F.col("created_at")))
>>> df.select(F.timestampadd("HOUR", F.col("offset"), "2024-01-01"))

static timestampdiff(unit, start, end)[source]

Calculate difference between two timestamps.

Parameters:

unit (str) – Time unit (YEAR, QUARTER, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND).
start (Union[str, Column]) – Start timestamp column or literal.
end (Union[str, Column]) – End timestamp column or literal.

Return type:

Returns:

ColumnOperation representing the timestampdiff function.

Example

>>> df.select(F.timestampdiff("DAY", F.col("start_date"), F.col("end_date")))
>>> df.select(F.timestampdiff("HOUR", "2024-01-01", F.col("end_time")))

static convert_timezone(sourceTz, targetTz, sourceTs)[source]

Convert timestamp from source to target timezone.

Parameters:

sourceTz (str)
targetTz (str)
sourceTs (Union[Column, str])

Return type:

static current_timezone()[source]

Get current timezone.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static from_utc_timestamp(ts, tz)[source]

Convert UTC timestamp to given timezone.

Parameters:

ts (Union[Column, str])
tz (str)

Return type:

static to_utc_timestamp(ts, tz)[source]

Convert timestamp from given timezone to UTC.

Parameters:

ts (Union[Column, str])
tz (str)

Return type:

static date_part(field, source)[source]

Extract a field from a date/timestamp.

Parameters:

field (str) – Field to extract (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, etc.).
source (Union[Column, str]) – Date/timestamp column.

Return type:

Returns:

ColumnOperation representing the date_part function.

Example

>>> df.select(F.date_part("YEAR", F.col("date")))

static dayname(date)[source]

Get the name of the day of the week.

Parameters:: date (Union[Column, str]) – Date column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the dayname function.

Example

>>> df.select(F.dayname(F.col("date")))

static make_date(year, month, day)[source]

Construct a date from year, month, day integers (PySpark 3.0+).

Parameters:

year (Union[Column, int, str, Literal]) – Year column or integer
month (Union[Column, int, str, Literal]) – Month column or integer (1-12)
day (Union[Column, int, str, Literal]) – Day column or integer (1-31)

Return type:

Returns:

ColumnOperation representing the make_date function

Example

>>> df.select(F.make_date(F.lit(2024), F.lit(3), F.lit(15)))

static date_trunc(format, timestamp)[source]

Truncate timestamp to specified unit (year, month, day, hour, etc.).

Parameters:

format (str) – Truncation unit (‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’, ‘second’)
timestamp (Union[Column, str]) – Timestamp column to truncate

Return type:

Returns:

ColumnOperation representing the date_trunc function

Example

>>> df.select(F.date_trunc('month', F.col('timestamp')))

static datediff(end, start)[source]

Returns number of days between two dates.

Parameters:

end (Union[Column, str, Literal]) – End date column or literal
start (Union[Column, str, Literal]) – Start date column or literal

Return type:

Returns:

ColumnOperation representing the datediff function

Example

>>> df.select(F.datediff(F.col('end_date'), F.lit('2024-01-01')))

static date_diff(end, start)[source]

Alias for datediff - Returns number of days between two dates.

Parameters:

end (Union[Column, str]) – End date column
start (Union[Column, str]) – Start date column

Return type:

Returns:

ColumnOperation representing the date_diff function

Example

>>> df.select(F.date_diff(F.col('end_date'), F.col('start_date')))

static unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')[source]

Convert timestamp string to Unix timestamp (seconds since epoch).

Parameters:

timestamp (Union[Column, str, None]) – Timestamp column (optional, defaults to current timestamp)
format (str) – Date/time format string

Return type:

Returns:

ColumnOperation representing the unix_timestamp function

Example

>>> df.select(F.unix_timestamp(F.col('timestamp'), 'yyyy-MM-dd'))

static last_day(date)[source]

Returns the last day of the month for a given date.

Parameters:: date (Union[Column, str]) – Date column
Return type:: ColumnOperation
Returns:: ColumnOperation representing the last_day function

Example

>>> df.select(F.last_day(F.col('date')))

static next_day(date, dayOfWeek)[source]

Returns the first date which is later than the value of the date column that is on the specified day of the week.

Parameters:

date (Union[Column, str]) – Date column
dayOfWeek (str) – Day of week string (e.g., ‘Mon’, ‘Monday’)

Return type:

Returns:

ColumnOperation representing the next_day function

Example

>>> df.select(F.next_day(F.col('date'), 'Monday'))

static trunc(date, format)[source]

Truncate date to specified unit (year, month, etc.).

Parameters:

date (Union[Column, str]) – Date column
format (str) – Truncation format (‘year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’)

Return type:

Returns:

ColumnOperation representing the trunc function

Example

>>> df.select(F.trunc(F.col('date'), 'year'))

static timestamp_seconds(col)[source]

Convert seconds since epoch to timestamp (PySpark 3.1+).

Parameters:: col (Union[Column, str, int]) – Column or integer representing seconds since epoch
Return type:: ColumnOperation
Returns:: ColumnOperation representing the timestamp

Example

>>> df.select(F.timestamp_seconds(F.col("seconds")))

static weekday(col)[source]

Get the day of week as an integer (0 = Monday, 6 = Sunday) (PySpark 3.5+).

Parameters:: col (Union[Column, str]) – Column or column name containing date/timestamp values.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the weekday function.

Note

Returns 0 for Monday through 6 for Sunday.

static extract(field, source)[source]

Extract a field from a date/timestamp column (PySpark 3.5+).

Parameters:

field (str) – The field to extract (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, etc.)
source (Union[Column, str]) – Column or column name containing date/timestamp values.

Return type:

Returns:

ColumnOperation representing the extract function.

Example

>>> df.select(F.extract("YEAR", F.col("date")))
>>> df.select(F.extract("MONTH", F.col("timestamp")))

static date_from_unix_date(days)[source]

Convert unix date (days since epoch) to date (PySpark 3.5+).

Parameters:: days (Union[Column, str, int]) – Column or integer representing days since epoch (1970-01-01).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the date_from_unix_date function.

Example

>>> df.select(F.date_from_unix_date(F.col("days")))

static to_timestamp_ltz(timestamp_str, format=None)[source]

Convert string to timestamp with local timezone (PySpark 3.5+).

Parameters:

timestamp_str (Union[Column, str]) – Column or string containing timestamp.
format (Optional[str]) – Optional format string for parsing.

Return type:

Returns:

ColumnOperation representing the to_timestamp_ltz function.

Example

>>> df.select(F.to_timestamp_ltz(F.col("ts_str"), "yyyy-MM-dd HH:mm:ss"))

static to_timestamp_ntz(timestamp_str, format=None)[source]

Convert string to timestamp with no timezone (PySpark 3.5+).

Parameters:

timestamp_str (Union[Column, str]) – Column or string containing timestamp.
format (Optional[str]) – Optional format string for parsing.

Return type:

Returns:

ColumnOperation representing the to_timestamp_ntz function.

Example

>>> df.select(F.to_timestamp_ntz(F.col("ts_str"), "yyyy-MM-dd HH:mm:ss"))

String Functions

String functions for Sparkless.

This module provides comprehensive string manipulation functions that match PySpark’s string function API. Includes case conversion, trimming, pattern matching, and string transformation operations for text processing in DataFrames.

Key Features:

Complete PySpark string function API compatibility
Case conversion (upper, lower)
Length and trimming operations (length, trim, ltrim, rtrim)
Pattern matching and replacement (regexp_replace, split)
String manipulation (substring, concat)
Type-safe operations with proper return types
Support for both column references and string literals

Example

>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"name": "  Alice  ", "email": "alice@example.com"}]
>>> df = spark.createDataFrame(data)
>>> df.select(
...     F.upper(F.trim(F.col("name"))),
...     F.regexp_replace(F.col("email"), "@.*", "@company.com")
... ).show()
DataFrame[1 rows, 2 columns]

upper(trim(name)) regexp_replace(email, @.*, @company.com, 1)
ALICE               alice@example.com

class sparkless.functions.string.StringFunctions[source]

Bases: object

Collection of string manipulation functions.

static upper(column)[source]

Convert string to uppercase.

Parameters:: column (Union[Column, str]) – The column to convert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the upper function.

static lower(column)[source]

Convert string to lowercase.

Parameters:: column (Union[Column, str]) – The column to convert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the lower function.

static length(column)[source]

Get the length of a string.

Parameters:: column (Union[Column, str]) – The column to get length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the length function.

static char_length(column)[source]

Alias for length() - Get the character length of a string (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – The column to get length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the char_length function.

static character_length(column)[source]

Alias for length() - Get the character length of a string (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – The column to get length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the character_length function.

static trim(column)[source]

Trim whitespace from string.

Parameters:: column (Union[Column, str]) – The column to trim.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the trim function.

static ltrim(column)[source]

Trim whitespace from left side of string.

Parameters:: column (Union[Column, str]) – The column to trim.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ltrim function.

static rtrim(column)[source]

Trim whitespace from right side of string.

Parameters:: column (Union[Column, str]) – The column to trim.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the rtrim function.

static btrim(column, trim_string=None)[source]

Trim characters from both ends of string.

Parameters:

column (Union[Column, str]) – The column to trim.
trim_string (Optional[str]) – Optional string of characters to trim (default: whitespace).

Return type:

Returns:

ColumnOperation representing the btrim function.

static contains(column, substring)[source]

Check if string contains substring.

Parameters:

column (Union[Column, str]) – The column to check.
substring (str) – The substring to search for.

Return type:

Returns:

ColumnOperation representing the contains function.

static left(column, length)[source]

Extract left N characters from string.

Parameters:

column (Union[Column, str]) – The column to extract from.
length (int) – Number of characters to extract from the left.

Return type:

Returns:

ColumnOperation representing the left function.

static right(column, length)[source]

Extract right N characters from string.

Parameters:

column (Union[Column, str]) – The column to extract from.
length (int) – Number of characters to extract from the right.

Return type:

Returns:

ColumnOperation representing the right function.

static bit_length(column)[source]

Get bit length of string.

Parameters:: column (Union[Column, str]) – The column to get bit length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the bit_length function.

static startswith(column, substring)[source]

Check if string starts with substring.

Parameters:

column (Union[Column, str]) – The column to check.
substring (str) – The substring to check for.

Return type:

Returns:

ColumnOperation representing the startswith function.

static endswith(column, substring)[source]

Check if string ends with substring.

Parameters:

column (Union[Column, str]) – The column to check.
substring (str) – The substring to check for.

Return type:

Returns:

ColumnOperation representing the endswith function.

static like(column, pattern)[source]

SQL LIKE pattern matching.

Parameters:

column (Union[Column, str]) – The column to match.
pattern (str) – The LIKE pattern (supports % and _ wildcards).

Return type:

Returns:

ColumnOperation representing the like function.

static rlike(column, pattern)[source]

Regular expression pattern matching.

Parameters:

column (Union[Column, str]) – The column to match.
pattern (str) – The regular expression pattern.

Return type:

Returns:

ColumnOperation representing the rlike function.

static replace(column, old, new)[source]

Replace occurrences of substring in string.

Parameters:

column (Union[Column, str]) – The column to replace in.
old (str) – The substring to replace.
new (str) – The replacement substring.

Return type:

Returns:

ColumnOperation representing the replace function.

static substr(column, start, length=None)[source]

Alias for substring - Extract substring from string.

Parameters:

column (Union[Column, str]) – The column to extract from.
start (int) – Starting position (1-indexed).
length (Optional[int]) – Optional length of substring.

Return type:

Returns:

ColumnOperation representing the substr function.

static split_part(column, delimiter, part)[source]

Extract part of string split by delimiter.

Parameters:

column (Union[Column, str]) – The column to split.
delimiter (str) – The delimiter to split on.
part (int) – The part number to extract (1-indexed).

Return type:

Returns:

ColumnOperation representing the split_part function.

static position(substring, column)[source]

Find position of substring in string (1-indexed).

Parameters:

substring (Union[Column, str]) – The substring to search for.
column (Union[Column, str]) – The column to search in.

Return type:

Returns:

ColumnOperation representing the position function.

static octet_length(column)[source]

Get byte length (octet length) of string.

Parameters:: column (Union[Column, str]) – The column to get byte length of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the octet_length function.

static char(column)[source]

Convert integer to character.

Parameters:: column (Union[Column, str]) – The column containing integer values.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the char function.

static ucase(column)[source]

Alias for upper - Convert string to uppercase.

Parameters:: column (Union[Column, str]) – The column to convert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ucase function.

static lcase(column)[source]

Alias for lower - Convert string to lowercase.

Parameters:: column (Union[Column, str]) – The column to convert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the lcase function.

static elt(n, *columns)[source]

Return element at index from list of columns.

Parameters:

n (Union[Column, int]) – The index (1-indexed).
*columns (Union[Column, str]) – The columns to choose from.

Return type:

Returns:

ColumnOperation representing the elt function.

static regexp_replace(column, pattern, replacement)[source]

Replace regex pattern in string.

Parameters:

column (Union[Column, str]) – The column to replace in.
pattern (str) – The regex pattern to match.
replacement (str) – The replacement string.

Return type:

Returns:

ColumnOperation representing the regexp_replace function.

static split(column, delimiter, limit=None)[source]

Split string by delimiter.

Parameters:

column (Union[Column, str]) – The column to split.
delimiter (str) – The delimiter to split on.
limit (Optional[int]) – Optional limit on the number of times the pattern is applied. If None or -1, no limit (default PySpark behavior).

Return type:

Returns:

ColumnOperation representing the split function.

static substring(column, start, length=None)[source]

Extract substring from string.

Parameters:

column (Union[Column, str]) – The column to extract from.
start (int) – Starting position (1-indexed).
length (Optional[int]) – Optional length of substring.

Return type:

Returns:

ColumnOperation representing the substring function.

static concat(*columns)[source]

Concatenate multiple strings.

Parameters:: *columns (Union[Column, str]) – Columns or strings to concatenate.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the concat function.

static format_string(format_str, *columns)[source]

Format string using printf-style format string.

Parameters:

format_str (str) – The format string (e.g., “Hello %s, you are %d years old”).
*columns (Union[Column, str]) – Columns to use as format arguments.

Return type:

Returns:

ColumnOperation representing the format_string function.

static translate(column, matching_string, replace_string)[source]

Translate characters in string using character mapping.

Parameters:

column (Union[Column, str]) – The column to translate.
matching_string (str) – Characters to match.
replace_string (str) – Characters to replace with (must be same length as matching_string).

Return type:

Returns:

ColumnOperation representing the translate function.

static ascii(column)[source]

Get ASCII value of first character in string.

Parameters:: column (Union[Column, str]) – The column to get ASCII value of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ascii function.

static base64(column)[source]

Encode string to base64.

Parameters:: column (Union[Column, str]) – The column to encode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the base64 function.

static unbase64(column)[source]

Decode base64 string.

Parameters:: column (Union[Column, str]) – The column to decode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unbase64 function.

static regexp_extract_all(column, pattern, idx=0)[source]

Extract all matches of a regex pattern.

Parameters:

column (Union[Column, str]) – The column to extract from.
pattern (str) – The regex pattern to match.
idx (int) – Group index to extract (default: 0 for entire match).

Return type:

Returns:

ColumnOperation representing the regexp_extract_all function.

Example

>>> df.select(F.regexp_extract_all(F.col("text"), r"\d+", 0))

static array_join(column, delimiter, null_replacement=None)[source]

Join array elements with a delimiter.

Parameters:

column (Union[Column, str]) – The array column to join.
delimiter (str) – The delimiter to use for joining.
null_replacement (Optional[str]) – Optional string to replace nulls with.

Return type:

Returns:

ColumnOperation representing the array_join function.

Example

>>> df.select(F.array_join(F.col("tags"), ", "))
>>> df.select(F.array_join(F.col("tags"), "|", "N/A"))

static reverse(column)[source]

Reverse a string column.

Parameters:: column (Union[Column, str]) – The string column to reverse.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the reverse function.

Example

>>> df.select(F.reverse(F.col("name")))

static repeat(column, n)[source]

Repeat a string N times.

Parameters:

column (Union[Column, str]) – The column to repeat.
n (int) – Number of times to repeat.

Return type:

Returns:

ColumnOperation representing the repeat function.

Example

>>> df.select(F.repeat(F.col("text"), 3))

static initcap(column)[source]

Capitalize first letter of each word.

Parameters:: column (Union[Column, str]) – The column to capitalize.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the initcap function.

Example

>>> df.select(F.initcap(F.col("name")))

static soundex(column)[source]

Soundex encoding for phonetic matching.

Parameters:: column (Union[Column, str]) – The column to encode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the soundex function.

Example

>>> df.select(F.soundex(F.col("name")))

static parse_url(url, part)[source]

Extract a part from a URL.

Parameters:

url (Union[Column, str]) – URL column or string.
part (str) – Part to extract (HOST, PATH, QUERY, REF, PROTOCOL, FILE, AUTHORITY, USERINFO).

Return type:

Returns:

ColumnOperation representing the parse_url function.

Example

>>> df.select(F.parse_url(F.col("url"), "HOST"))

static url_encode(url)[source]

URL-encode a string.

Parameters:: url (Union[Column, str]) – String column to encode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the url_encode function.

Example

>>> df.select(F.url_encode(F.col("text")))

static url_decode(url)[source]

URL-decode a string.

Parameters:: url (Union[Column, str]) – String column to decode.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the url_decode function.

Example

>>> df.select(F.url_decode(F.col("encoded")))

static concat_ws(sep, *cols)[source]

Concatenate multiple columns with a separator.

Parameters:

sep (str) – Separator string
*cols (Union[Column, str]) – Columns to concatenate

Return type:

Returns:

ColumnOperation representing concat_ws

Example

>>> df.select(F.concat_ws("-", F.col("first"), F.col("last")))

static regexp_extract(column, pattern, idx=0)[source]

Extract a specific group matched by a regex pattern.

Parameters:

column (Union[Column, str]) – Input column
pattern (str) – Regular expression pattern. Supports lookahead (?=…) and lookbehind (?<=…) assertions via Python fallback when Polars native support is unavailable.
idx (int) – Group index to extract (default 0)

Return type:

Returns:

ColumnOperation representing regexp_extract

Example

>>> df.select(F.regexp_extract(F.col("email"), r"(.+)@(.+)", 1))
>>> df.select(F.regexp_extract(F.col("text"), r"(?<=prefix_)\w+", 0))

Note

Fixed in version 3.23.0 (Issue #228): Added fallback support for regex patterns with lookahead and lookbehind assertions using Python’s re module when Polars native support is unavailable.

static substring_index(column, delim, count)[source]

Returns substring before/after count occurrences of delimiter.

Parameters:

column (Union[Column, str]) – Input string column
delim (str) – Delimiter string
count (int) – Number of delimiters (positive for left, negative for right)

Return type:

Returns:

ColumnOperation representing substring_index

Example

>>> df.select(F.substring_index(F.col("path"), "/", 2))

static format_number(column, d)[source]

Format number with d decimal places and thousands separator.

Parameters:

column (Union[Column, str]) – Numeric column
d (int) – Number of decimal places

Return type:

Returns:

ColumnOperation representing format_number

Example

>>> df.select(F.format_number(F.col("amount"), 2))

static instr(column, substr)[source]

Locate the position of the first occurrence of substr (1-indexed).

Parameters:

column (Union[Column, str]) – Input string column
substr (str) – Substring to locate

Return type:

Returns:

ColumnOperation representing instr

Example

>>> df.select(F.instr(F.col("text"), "spark"))

static locate(substr, column, pos=1)[source]

Locate the position of substr starting from pos (1-indexed).

Parameters:

substr (str) – Substring to locate
column (Union[Column, str]) – Input string column
pos (int) – Starting position (default 1)

Return type:

Returns:

ColumnOperation representing locate

Example

>>> df.select(F.locate("spark", F.col("text"), 1))

static lpad(column, len, pad)[source]

Left-pad string column to length len with pad string.

Parameters:

column (Union[Column, str]) – Input string column
len (int) – Target length
pad (str) – Padding string

Return type:

Returns:

ColumnOperation representing lpad

Example

>>> df.select(F.lpad(F.col("id"), 5, "0"))

static rpad(column, len, pad)[source]

Right-pad string column to length len with pad string.

Parameters:

column (Union[Column, str]) – Input string column
len (int) – Target length
pad (str) – Padding string

Return type:

Returns:

ColumnOperation representing rpad

Example

>>> df.select(F.rpad(F.col("id"), 5, "0"))

static levenshtein(left, right)[source]

Compute Levenshtein distance between two strings.

Parameters:

left (Union[Column, str]) – First string column
right (Union[Column, str]) – Second string column

Return type:

Returns:

ColumnOperation representing levenshtein

Example

>>> df.select(F.levenshtein(F.col("word1"), F.col("word2")))

static overlay(src, replace, pos, len=-1)[source]

Replace part of a string with another string starting at a position (PySpark 3.0+).

Parameters:

src (Union[Column, str]) – Source string column
replace (Union[Column, str]) – Replacement string
pos (Union[Column, int]) – Starting position (1-indexed)
len (Union[Column, int]) – Length to replace (default -1 means to end of string)

Return type:

Returns:

ColumnOperation for overlay operation

Example

>>> df.select(F.overlay(F.col("text"), F.lit("NEW"), F.lit(5), F.lit(3)))

static bin(column)[source]

Convert to binary string representation.

Parameters:: column (Union[Column, str]) – Numeric column
Return type:: ColumnOperation
Returns:: ColumnOperation representing bin

static hex(column)[source]

Convert to hexadecimal string.

Parameters:: column (Union[Column, str]) – Column to convert
Return type:: ColumnOperation
Returns:: ColumnOperation representing hex

static unhex(column)[source]

Convert hex string to binary.

Parameters:: column (Union[Column, str]) – Hex string column
Return type:: ColumnOperation
Returns:: ColumnOperation representing unhex

static hash(*cols)[source]

Compute hash value of given columns.

Parameters:: *cols (Union[Column, str]) – Columns to hash
Return type:: ColumnOperation
Returns:: ColumnOperation representing hash

static xxhash64(*cols)[source]

Compute xxHash64 value of given columns (all PySpark versions).

Parameters:: *cols (Union[Column, str]) – Columns to hash
Return type:: ColumnOperation
Returns:: ColumnOperation representing xxhash64

static encode(column, charset)[source]

Encode string to binary using charset.

Parameters:

column (Union[Column, str]) – String column
charset (str) – Character set (e.g., ‘UTF-8’)

Return type:

Returns:

ColumnOperation representing encode

static decode(column, charset)[source]

Decode binary to string using charset.

Parameters:

column (Union[Column, str]) – Binary column
charset (str) – Character set (e.g., ‘UTF-8’)

Return type:

Returns:

ColumnOperation representing decode

static conv(column, from_base, to_base)[source]

Convert number from one base to another.

Parameters:

column (Union[Column, str]) – Number column
from_base (int) – Source base (2-36)
to_base (int) – Target base (2-36)

Return type:

Returns:

ColumnOperation representing conv

static md5(column)[source]

Calculate MD5 hash of string (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – String column to hash
Return type:: ColumnOperation
Returns:: ColumnOperation representing md5 function (returns 32-char hex string)

Example

>>> df.select(F.md5(F.col("text")))

static sha1(column)[source]

Calculate SHA-1 hash of string (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – String column to hash
Return type:: ColumnOperation
Returns:: ColumnOperation representing sha1 function (returns 40-char hex string)

Example

>>> df.select(F.sha1(F.col("text")))

static sha2(column, numBits)[source]

Calculate SHA-2 family hash (PySpark 3.0+).

Parameters:

column (Union[Column, str]) – String column to hash
numBits (int) – Bit length - 224, 256, 384, or 512

Return type:

Returns:

ColumnOperation representing sha2 function (returns hex string)

Example

>>> df.select(F.sha2(F.col("text"), 256))

static crc32(column)[source]

Calculate CRC32 checksum (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – String column to checksum
Return type:: ColumnOperation
Returns:: ColumnOperation representing crc32 function (returns signed 32-bit int)

Example

>>> df.select(F.crc32(F.col("text")))

static to_str(column)[source]

Convert column to string representation (all PySpark versions).

Parameters:: column (Union[Column, str]) – Column to convert to string
Return type:: ColumnOperation
Returns:: Column operation for string conversion

Example

>>> df.select(F.to_str(F.col("value")))

static ilike(column, pattern)[source]

Case-insensitive LIKE pattern matching.

Parameters:

column (Union[Column, str]) – The column to match against.
pattern (str) – The pattern to match (SQL LIKE pattern).

Return type:

Returns:

ColumnOperation representing the ilike function.

static find_in_set(column, str_list)[source]

Find position of value in comma-separated string list.

Parameters:

column (Union[Column, str]) – The value to find.
str_list (Union[Column, str]) – The comma-separated string list.

Return type:

Returns:

ColumnOperation representing the find_in_set function.

static regexp_count(column, pattern)[source]

Count occurrences of regex pattern in string.

Parameters:

column (Union[Column, str]) – The column to search in.
pattern (str) – The regex pattern to count.

Return type:

Returns:

ColumnOperation representing the regexp_count function.

static regexp_like(column, pattern)[source]

Regex pattern matching (similar to rlike).

Parameters:

column (Union[Column, str]) – The column to match against.
pattern (str) – The regex pattern to match.

Return type:

Returns:

ColumnOperation representing the regexp_like function.

static regexp_substr(column, pattern, pos=1, occurrence=1)[source]

Extract substring matching regex pattern.

Parameters:

column (Union[Column, str]) – The column to extract from.
pattern (str) – The regex pattern to match.
pos (int) – Starting position (1-indexed).
occurrence (int) – Which occurrence to extract.

Return type:

Returns:

ColumnOperation representing the regexp_substr function.

static regexp_instr(column, pattern, pos=1, occurrence=1)[source]

Find position of regex pattern match.

Parameters:

column (Union[Column, str]) – The column to search in.
pattern (str) – The regex pattern to find.
pos (int) – Starting position (1-indexed).
occurrence (int) – Which occurrence to find.

Return type:

Returns:

ColumnOperation representing the regexp_instr function.

static regexp(column, pattern)[source]

Alias for rlike - regex pattern matching.

Parameters:

column (Union[Column, str]) – The column to match against.
pattern (str) – The regex pattern to match.

Return type:

Returns:

ColumnOperation representing the regexp function.

static sentences(column, language=None, country=None)[source]

Split text into sentences.

Parameters:

column (Union[Column, str]) – The column containing text.
language (Optional[str]) – Language code (optional).
country (Optional[str]) – Country code (optional).

Return type:

Returns:

ColumnOperation representing the sentences function.

static printf(format_str, *columns)[source]

Formatted string (like sprintf).

Parameters:

format_str (str) – Format string with placeholders.
*columns (Union[Column, str]) – Columns to format.

Return type:

Returns:

ColumnOperation representing the printf function.

static to_char(column, format=None)[source]

Convert number/date to character string.

Parameters:

column (Union[Column, str]) – The column to convert.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the to_char function.

static to_varchar(column, length=None)[source]

Convert to varchar type.

Parameters:

column (Union[Column, str]) – The column to convert.
length (Optional[int]) – Optional length for varchar.

Return type:

Returns:

ColumnOperation representing the to_varchar function.

static typeof(column)[source]

Get type of value as string.

Parameters:: column (Union[Column, str]) – The column to get type of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the typeof function.

static stack(n, *cols)[source]

Stack multiple columns into rows.

Parameters:

n (int) – Number of rows to create per input row.
*cols (Union[Column, str, Any]) – Columns to stack.

Return type:

Returns:

ColumnOperation representing the stack function.

static sha(column)[source]

Alias for sha1 - Calculate SHA-1 hash of string (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – String column to hash.
Return type:: ColumnOperation
Returns:: ColumnOperation representing sha function (returns 40-char hex string).

Example

>>> df.select(F.sha(F.col("text")))

static mask(column, upperChar=None, lowerChar=None, digitChar=None, otherChar=None)[source]

Mask sensitive data in a string (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – String column to mask.
upperChar (Optional[str]) – Character to use for uppercase letters (default: ‘X’).
lowerChar (Optional[str]) – Character to use for lowercase letters (default: ‘x’).
digitChar (Optional[str]) – Character to use for digits (default: ‘n’).
otherChar (Optional[str]) – Character to use for other characters (default: ‘-‘).

Return type:

Returns:

ColumnOperation representing the mask function.

Example

>>> df.select(F.mask(F.col("email"), upperChar='U', lowerChar='l', digitChar='d'))

static json_array_length(column, path=None)[source]

Get the length of a JSON array (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – JSON column to get array length from.
path (Optional[str]) – Optional JSONPath expression to specify array location.

Return type:

Returns:

ColumnOperation representing the json_array_length function.

Example

>>> df.select(F.json_array_length(F.col("json_col"), "$.array"))

static json_object_keys(column, path=None)[source]

Get the keys of a JSON object (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – JSON column to get object keys from.
path (Optional[str]) – Optional JSONPath expression to specify object location.

Return type:

Returns:

ColumnOperation representing the json_object_keys function.

Example

>>> df.select(F.json_object_keys(F.col("json_col"), "$.object"))

static xpath_number(column, path)[source]

Extract number from XML using XPath (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – XML column to extract from.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath_number function.

Example

>>> df.select(F.xpath_number(F.col("xml_col"), "/root/value"))

static user()[source]

Get current user name (PySpark 3.5+).

Return type:: ColumnOperation
Returns:: ColumnOperation representing the user function.

Example

>>> df.select(F.user())

Math Functions

Mathematical functions for Sparkless.

This module provides comprehensive mathematical functions that match PySpark’s math function API. Includes arithmetic operations, rounding functions, trigonometric functions, and mathematical transformations for numerical processing in DataFrames.

Key Features:

Complete PySpark math function API compatibility
Arithmetic operations (abs, round, ceil, floor)
Advanced math functions (sqrt, exp, log, pow)
Trigonometric functions (sin, cos, tan)
Type-safe operations with proper return types
Support for both column references and numeric literals
Proper handling of edge cases and null values

Example

>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"value": 3.7, "angle": 1.57}]
>>> df = spark.createDataFrame(data)
>>> df.select(
...     F.round(F.col("value"), 1),
...     F.ceil(F.col("value")),
...     F.sin(F.col("angle"))
... ).show()
DataFrame[1 rows, 3 columns]

round(value, 1) CEIL(value) SIN(angle)
3.7               4.0           0.9999996829318346

class sparkless.functions.math.MathFunctions[source]

Bases: object

Collection of mathematical functions.

static abs(column)[source]

Get absolute value.

Parameters:: column (Union[Column, str]) – The column to get absolute value of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the abs function.

static positive(column)[source]

Return positive value (identity function).

Parameters:: column (Union[Column, str]) – The column to return as positive.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the positive function.

static negative(column)[source]

Return negative value.

Parameters:: column (Union[Column, str]) – The column to negate.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the negative function.

static round(column, scale=0)[source]

Round to specified number of decimal places.

Parameters:

column (Union[Column, str]) – The column to round.
scale (int) – Number of decimal places (default: 0).

Return type:

Returns:

ColumnOperation representing the round function.

static ceil(column)[source]

Round up to nearest integer.

Parameters:: column (Union[Column, str]) – The column to round up.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ceil function.

static ceiling(column)[source]

Alias for ceil - Round up to nearest integer.

Parameters:: column (Union[Column, str]) – The column to round up.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ceiling function.

static floor(column)[source]

Round down to nearest integer.

Parameters:: column (Union[Column, str]) – The column to round down.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the floor function.

static sqrt(column)[source]

Get square root.

Parameters:: column (Union[Column, str]) – The column to get square root of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sqrt function.

static exp(column)[source]

Get exponential (e^x).

Parameters:: column (Union[Column, str]) – The column to get exponential of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the exp function.

static log(base, column=None)[source]

Get logarithm.

PySpark signature: log(base, column) or log(column) for natural log.

Parameters:

base (Union[Column, str, float, int, None]) – Base for logarithm. Can be a float/int constant or Column. If column is None, base is treated as the column (natural log).
column (Union[Column, str, None]) – The column to get logarithm of. If None, base is the column (natural log).

Return type:

Returns:

ColumnOperation representing the log function.

static log10(column)[source]

Get base-10 logarithm (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – The column to get log10 of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the log10 function.

Example

>>> df.select(F.log10(F.col("value")))

static log2(column)[source]

Get base-2 logarithm (PySpark 3.0+).

Parameters:: column (Union[Column, str]) – The column to get log2 of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the log2 function.

Example

>>> df.select(F.log2(F.col("value")))

static log1p(column)[source]

Get natural logarithm of (1 + x) (PySpark 3.0+).

Computes ln(1 + x) accurately for small values of x.

Parameters:: column (Union[Column, str]) – The column to compute log1p of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the log1p function.

Example

>>> df.select(F.log1p(F.col("value")))

static expm1(column)[source]

Get exp(x) - 1 (PySpark 3.0+).

Computes e^x - 1 accurately for small values of x.

Parameters:: column (Union[Column, str]) – The column to compute expm1 of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the expm1 function.

Example

>>> df.select(F.expm1(F.col("value")))

static pow(column, exponent)[source]

Raise to power.

Parameters:

column (Union[Column, str]) – The column to raise to power.
exponent (Union[Column, float, int]) – The exponent.

Return type:

Returns:

ColumnOperation representing the pow function.

static power(column, exponent)[source]

Alias for pow - Raise to power.

Parameters:

column (Union[Column, str]) – The column to raise to power.
exponent (Union[Column, float, int]) – The exponent.

Return type:

Returns:

ColumnOperation representing the power function.

static sin(column)[source]

Get sine.

Parameters:: column (Union[Column, str]) – The column to get sine of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sin function.

static cos(column)[source]

Get cosine.

Parameters:: column (Union[Column, str]) – The column to get cosine of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cos function.

static tan(column)[source]

Get tangent.

Parameters:: column (Union[Column, str]) – The column to get tangent of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the tan function.

static sign(column)[source]

Get sign of number (-1, 0, or 1).

Parameters:: column (Union[Column, str]) – The column to get sign of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sign function.

static greatest(*columns)[source]

Get the greatest value among columns.

Parameters:: *columns (Union[Column, str]) – Columns to compare.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the greatest function.

static least(*columns)[source]

Get the least value among columns.

Parameters:: *columns (Union[Column, str]) – Columns to compare.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the least function.

static acosh(col)[source]

Compute inverse hyperbolic cosine (arc hyperbolic cosine).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the acosh function.

Note

Input must be >= 1. Returns NaN for invalid inputs.

static asinh(col)[source]

Compute inverse hyperbolic sine (arc hyperbolic sine).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the asinh function.

static atanh(col)[source]

Compute inverse hyperbolic tangent (arc hyperbolic tangent).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the atanh function.

Note

Input must be in range (-1, 1). Returns NaN for invalid inputs.

static acos(col)[source]

Compute inverse cosine (arc cosine).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the acos function.

static asin(col)[source]

Compute inverse sine (arc sine).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the asin function.

static atan(col)[source]

Compute inverse tangent (arc tangent).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the atan function.

static atan2(y, x)[source]

Compute 2-argument arctangent (PySpark 3.0+).

Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).

Parameters:

y (Union[Column, str, float, int]) – Y coordinate (column or numeric value).
x (Union[Column, str, float, int]) – X coordinate (column or numeric value).

Return type:

Returns:

ColumnOperation representing the atan2 function.

Example

>>> df.select(F.atan2(F.col("y"), F.col("x")))
>>> df.select(F.atan2(F.lit(1.0), F.lit(1.0)))  # Returns π/4

static cosh(col)[source]

Compute hyperbolic cosine.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cosh function.

static sinh(col)[source]

Compute hyperbolic sine.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sinh function.

static tanh(col)[source]

Compute hyperbolic tangent.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the tanh function.

static degrees(col)[source]

Convert radians to degrees.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the degrees function.

static radians(col)[source]

Convert degrees to radians.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the radians function.

static cbrt(col)[source]

Compute cube root.

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cbrt function.

static factorial(col)[source]

Compute factorial.

Parameters:: col (Union[Column, str]) – Column or column name (non-negative integers).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the factorial function.

static rand(seed=None)[source]

Generate a random column with i.i.d. samples from U[0.0, 1.0].

Parameters:: seed (Optional[int]) – Random seed (optional).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the rand function.

static randn(seed=None)[source]

Generate a random column with i.i.d. samples from standard normal distribution.

Parameters:: seed (Optional[int]) – Random seed (optional).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the randn function.

static rint(col)[source]

Round to nearest integer using banker’s rounding (half to even).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the rint function.

static bround(col, scale=0)[source]

Round using HALF_EVEN rounding mode (banker’s rounding).

Parameters:

col (Union[Column, str]) – Column or column name.
scale (int) – Number of decimal places (default 0).

Return type:

Returns:

ColumnOperation representing the bround function.

static hypot(col1, col2)[source]

Compute sqrt(col1^2 + col2^2) (hypotenuse).

Parameters:

col1 (Union[Column, str]) – First column
col2 (Union[Column, str]) – Second column

Return type:

Returns:

ColumnOperation representing the hypot function.

static nanvl(col1, col2)[source]

Returns col1 if not NaN, or col2 if col1 is NaN.

Parameters:

col1 (Union[Column, str]) – First column
col2 (Union[Column, str, int, float]) – Second column or literal value (replacement for NaN)

Return type:

Returns:

ColumnOperation representing the nanvl function.

static signum(col)[source]

Compute the signum function (sign: -1, 0, or 1).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the signum function.

static cot(col)[source]

Compute cotangent (PySpark 3.3+).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cot function.

static csc(col)[source]

Compute cosecant (PySpark 3.3+).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the csc function.

static sec(col)[source]

Compute secant (PySpark 3.3+).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sec function.

static e()[source]

Return Euler’s number e (PySpark 3.5+).

Return type:: ColumnOperation
Returns:: ColumnOperation representing Euler’s number constant.

static pi()[source]

Return the value of pi (PySpark 3.5+).

Return type:: ColumnOperation
Returns:: ColumnOperation representing pi constant.

static ln(col)[source]

Compute natural logarithm (alias for log) (PySpark 3.5+).

Parameters:: col (Union[Column, str]) – Column or column name.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the ln function.

static toDegrees(column)[source]

Deprecated alias for degrees (all PySpark versions).

Use degrees instead.

Parameters:: column (Union[Column, str]) – Angle in radians.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the degrees conversion.

static toRadians(column)[source]

Deprecated alias for radians (all PySpark versions).

Use radians instead.

Parameters:: column (Union[Column, str]) – Angle in degrees.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the radians conversion.

static pmod(dividend, divisor)[source]

Positive modulo - always returns positive remainder.

Parameters:

dividend (Union[Column, str, int]) – The dividend.
divisor (Union[Column, str, int]) – The divisor.

Return type:

Returns:

ColumnOperation representing the pmod function.

static negate(column)[source]

Negate value (alias for negative).

Parameters:: column (Union[Column, str]) – The column to negate.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the negate function.

static getbit(column, bit)[source]

Get bit at specified position (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – The column containing the integer.
bit (Union[Column, str, int]) – The bit position (0-indexed from right).

Return type:

Returns:

ColumnOperation representing the getbit function.

Example

>>> df.select(F.getbit(F.col("value"), 3))

static width_bucket(value, min_value, max_value, num_buckets)[source]

Compute histogram bucket number for value (PySpark 3.5+).

Parameters:

value (Union[Column, str]) – The value to compute bucket for.
min_value (Union[Column, str, float]) – Minimum value of the range.
max_value (Union[Column, str, float]) – Maximum value of the range.
num_buckets (Union[Column, str, int]) – Number of buckets.

Return type:

Returns:

ColumnOperation representing the width_bucket function.

Example

>>> df.select(F.width_bucket(F.col("value"), 0.0, 100.0, 10))

DateTime Functions

Datetime functions for Sparkless.

This module provides comprehensive datetime functions that match PySpark’s datetime function API. Includes date/time conversion, extraction, and manipulation operations for temporal data processing in DataFrames.

Key Features:

Complete PySpark datetime function API compatibility
Current date/time functions (current_timestamp, current_date)
Date conversion (to_date, to_timestamp)
Date extraction (year, month, day, hour, minute, second)
Date manipulation (dayofweek, dayofyear, weekofyear, quarter)
Type-safe operations with proper return types
Support for various date formats and time zones
Proper handling of date parsing and validation

Example

>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"timestamp": "2024-01-15 10:30:00", "date_str": "2024-01-15"}]
>>> df = spark.createDataFrame(data)
>>> df.select(
...     F.year(F.col("timestamp")),
...     F.month(F.col("timestamp")),
...     F.to_date(F.col("date_str"))
... ).show()
DataFrame[1 rows, 3 columns]
year(timestamp) month(timestamp) to_date(date_str)
2024 1 2024-01-15

class sparkless.functions.datetime.DateTimeFunctions[source]

Bases: object

Collection of datetime functions.

static current_timestamp()[source]

Get current timestamp.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the current_timestamp function.
Raises:: RuntimeError – If no active SparkSession is available

static current_date()[source]

Get current date.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the current_date function.
Raises:: RuntimeError – If no active SparkSession is available

static now()[source]

Alias for current_timestamp - Get current timestamp.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the now function.

static curdate()[source]

Alias for current_date - Get current date.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the curdate function.

static days(column)[source]

Convert number to days interval.

Parameters:: column (Union[Column, str, int]) – The number of days (can be column or literal).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the days function.

static hours(column)[source]

Convert number to hours interval.

Parameters:: column (Union[Column, str, int]) – The number of hours (can be column or literal).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the hours function.

static months(column)[source]

Convert number to months interval.

Parameters:: column (Union[Column, str, int]) – The number of months (can be column or literal).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the months function.

static years(column)[source]

Convert number to years interval.

Parameters:: column (Union[Column, str, int]) – The number of years (can be column or literal).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the years function.

static localtimestamp()[source]

Get local timestamp (without timezone).

Return type:: ColumnOperation
Returns:: ColumnOperation representing the localtimestamp function.

static dateadd(date_part, value, date)[source]

SQL Server style date addition.

Parameters:

date_part (str) – The date part to add (year, month, day, etc.).
value (Union[Column, str, int]) – The value to add.
date (Union[Column, str]) – The date column.

Return type:

Returns:

ColumnOperation representing the dateadd function.

static datepart(date_part, date)[source]

SQL Server style date part extraction.

Parameters:

date_part (str) – The date part to extract (year, month, day, etc.).
date (Union[Column, str]) – The date column.

Return type:

Returns:

ColumnOperation representing the datepart function.

static make_timestamp(year, month, day, hour=0, minute=0, second=0)[source]

Create timestamp from components.

Parameters:

year (Union[Column, str, int]) – Year component.
month (Union[Column, str, int]) – Month component.
day (Union[Column, str, int]) – Day component.
hour (Union[Column, str, int]) – Hour component (default 0).
minute (Union[Column, str, int]) – Minute component (default 0).
second (Union[Column, str, int]) – Second component (default 0).

Return type:

Returns:

ColumnOperation representing the make_timestamp function.

static make_timestamp_ltz(year, month, day, hour=0, minute=0, second=0, timezone=None)[source]

Create timestamp with local timezone.

Parameters:

year (Union[Column, str, int]) – Year component.
month (Union[Column, str, int]) – Month component.
day (Union[Column, str, int]) – Day component.
hour (Union[Column, str, int]) – Hour component (default 0).
minute (Union[Column, str, int]) – Minute component (default 0).
second (Union[Column, str, int]) – Second component (default 0).
timezone (Optional[str]) – Optional timezone string.

Return type:

Returns:

ColumnOperation representing the make_timestamp_ltz function.

static make_timestamp_ntz(year, month, day, hour=0, minute=0, second=0)[source]

Create timestamp with no timezone.

Parameters:

year (Union[Column, str, int]) – Year component.
month (Union[Column, str, int]) – Month component.
day (Union[Column, str, int]) – Day component.
hour (Union[Column, str, int]) – Hour component (default 0).
minute (Union[Column, str, int]) – Minute component (default 0).
second (Union[Column, str, int]) – Second component (default 0).

Return type:

Returns:

ColumnOperation representing the make_timestamp_ntz function.

static make_interval(years=0, months=0, weeks=0, days=0, hours=0, mins=0, secs=0)[source]

Create interval from components.

Parameters:

years (Union[Column, str, int]) – Years component (default 0).
months (Union[Column, str, int]) – Months component (default 0).
weeks (Union[Column, str, int]) – Weeks component (default 0).
days (Union[Column, str, int]) – Days component (default 0).
hours (Union[Column, str, int]) – Hours component (default 0).
mins (Union[Column, str, int]) – Minutes component (default 0).
secs (Union[Column, str, int]) – Seconds component (default 0).

Return type:

Returns:

ColumnOperation representing the make_interval function.

static make_dt_interval(days=0, hours=0, mins=0, secs=0)[source]

Create day-time interval.

Parameters:

days (Union[Column, str, int]) – Days component (default 0).
hours (Union[Column, str, int]) – Hours component (default 0).
mins (Union[Column, str, int]) – Minutes component (default 0).
secs (Union[Column, str, int]) – Seconds component (default 0).

Return type:

Returns:

ColumnOperation representing the make_dt_interval function.

static make_ym_interval(years=0, months=0)[source]

Create year-month interval.

Parameters:

years (Union[Column, str, int]) – Years component (default 0).
months (Union[Column, str, int]) – Months component (default 0).

Return type:

Returns:

ColumnOperation representing the make_ym_interval function.

static to_number(column, format=None)[source]

Convert string to number.

Parameters:

column (Union[Column, str]) – The column to convert.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the to_number function.

static to_binary(column, format=None)[source]

Convert to binary format.

Parameters:

column (Union[Column, str]) – The column to convert.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the to_binary function.

static to_unix_timestamp(column, format=None)[source]

Convert to unix timestamp.

Parameters:

column (Union[Column, str]) – The column to convert.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the to_unix_timestamp function.

static unix_date(column)[source]

Convert unix timestamp to date.

Parameters:: column (Union[Column, str]) – The unix timestamp column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unix_date function.

static unix_seconds(column)[source]

Convert timestamp to unix seconds.

Parameters:: column (Union[Column, str]) – The timestamp column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unix_seconds function.

static unix_millis(column)[source]

Convert timestamp to unix milliseconds.

Parameters:: column (Union[Column, str]) – The timestamp column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unix_millis function.

static unix_micros(column)[source]

Convert timestamp to unix microseconds.

Parameters:: column (Union[Column, str]) – The timestamp column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the unix_micros function.

static timestamp_millis(column)[source]

Create timestamp from unix milliseconds.

Parameters:: column (Union[Column, str]) – The unix milliseconds column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the timestamp_millis function.

static timestamp_micros(column)[source]

Create timestamp from unix microseconds.

Parameters:: column (Union[Column, str]) – The unix microseconds column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the timestamp_micros function.

static to_date(column, format=None)[source]

Convert string, timestamp, or date to date.

Parameters:

column (Union[Column, str]) – The column to convert (StringType, TimestampType, or DateType).
format (Optional[str]) – Optional date format string (only used for StringType input).

Return type:

Returns:

ColumnOperation representing the to_date function.

Raises:

TypeError – If input column type is not StringType, TimestampType, or DateType

static to_timestamp(column, format=None)[source]

Convert to timestamp.

Parameters:

column (Union[Column, str]) – The column to convert. Accepts StringType, TimestampType, IntegerType, LongType, DateType, or DoubleType (matching PySpark behavior).
format (Optional[str]) – Optional timestamp format string (used for StringType input).

Return type:

Returns:

ColumnOperation representing the to_timestamp function.

Raises:

TypeError – If input column type is not one of the supported types.

static hour(column)[source]

Extract hour from timestamp.

Parameters:: column (Union[Column, str]) – The column to extract hour from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the hour function.

static day(column)[source]

Extract day from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract day from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the day function.

static dayofmonth(column)[source]

Extract day of month from date/timestamp (alias for day).

Parameters:: column (Union[Column, str]) – The column to extract day from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the dayofmonth function.

static month(column)[source]

Extract month from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract month from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the month function.

static year(column)[source]

Extract year from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract year from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the year function.

static dayofweek(column)[source]

Extract day of week from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract day of week from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the dayofweek function.

static dayofyear(column)[source]

Extract day of year from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract day of year from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the dayofyear function.

static weekofyear(column)[source]

Extract week of year from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract week of year from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the weekofyear function.

static quarter(column)[source]

Extract quarter from date/timestamp.

Parameters:: column (Union[Column, str]) – The column to extract quarter from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the quarter function.

static minute(column)[source]

Extract minute from timestamp.

Parameters:: column (Union[Column, str]) – The column to extract minute from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the minute function.

static second(column)[source]

Extract second from timestamp.

Parameters:: column (Union[Column, str]) – The column to extract second from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the second function.

static add_months(column, num_months)[source]

Add months to date/timestamp.

Parameters:

column (Union[Column, str]) – The column to add months to.
num_months (int) – Number of months to add.

Return type:

Returns:

ColumnOperation representing the add_months function.

static months_between(column1, column2)[source]

Calculate months between two dates.

Parameters:

column1 (Union[Column, str]) – The first date column.
column2 (Union[Column, str]) – The second date column.

Return type:

Returns:

ColumnOperation representing the months_between function.

static date_add(column, days)[source]

Add days to date.

Parameters:

column (Union[Column, str]) – The column to add days to.
days (int) – Number of days to add.

Return type:

Returns:

ColumnOperation representing the date_add function.

static date_sub(column, days)[source]

Subtract days from date.

Parameters:

column (Union[Column, str]) – The column to subtract days from.
days (int) – Number of days to subtract.

Return type:

Returns:

ColumnOperation representing the date_sub function.

static date_format(column, format)[source]

Format date/timestamp as string.

Parameters:

column (Union[Column, str]) – The column to format.
format (str) – Date format string (e.g., ‘yyyy-MM-dd’).

Return type:

Returns:

ColumnOperation representing the date_format function.

static from_unixtime(column, format='yyyy-MM-dd HH:mm:ss')[source]

Convert unix timestamp to string.

Parameters:

column (Union[Column, str]) – The column with unix timestamp.
format (str) – Date format string (default: ‘yyyy-MM-dd HH:mm:ss’).

Return type:

Returns:

ColumnOperation representing the from_unixtime function.

static timestampadd(unit, quantity, timestamp)[source]

Add time units to a timestamp.

Parameters:

unit (str) – Time unit (YEAR, QUARTER, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND).
quantity (Union[int, Column]) – Number of units to add (can be column or integer).
timestamp (Union[str, Column]) – Timestamp column or literal.

Return type:

Returns:

ColumnOperation representing the timestampadd function.

Example

>>> df.select(F.timestampadd("DAY", 7, F.col("created_at")))
>>> df.select(F.timestampadd("HOUR", F.col("offset"), "2024-01-01"))

static timestampdiff(unit, start, end)[source]

Calculate difference between two timestamps.

Parameters:

unit (str) – Time unit (YEAR, QUARTER, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND).
start (Union[str, Column]) – Start timestamp column or literal.
end (Union[str, Column]) – End timestamp column or literal.

Return type:

Returns:

ColumnOperation representing the timestampdiff function.

Example

>>> df.select(F.timestampdiff("DAY", F.col("start_date"), F.col("end_date")))
>>> df.select(F.timestampdiff("HOUR", "2024-01-01", F.col("end_time")))

static convert_timezone(sourceTz, targetTz, sourceTs)[source]

Convert timestamp from source to target timezone.

Parameters:

sourceTz (str)
targetTz (str)
sourceTs (Union[Column, str])

Return type:

static current_timezone()[source]

Get current timezone.

Raises:: RuntimeError – If no active SparkSession is available
Return type:: ColumnOperation

static from_utc_timestamp(ts, tz)[source]

Convert UTC timestamp to given timezone.

Parameters:

ts (Union[Column, str])
tz (str)

Return type:

static to_utc_timestamp(ts, tz)[source]

Convert timestamp from given timezone to UTC.

Parameters:

ts (Union[Column, str])
tz (str)

Return type:

static date_part(field, source)[source]

Extract a field from a date/timestamp.

Parameters:

field (str) – Field to extract (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, etc.).
source (Union[Column, str]) – Date/timestamp column.

Return type:

Returns:

ColumnOperation representing the date_part function.

Example

>>> df.select(F.date_part("YEAR", F.col("date")))

static dayname(date)[source]

Get the name of the day of the week.

Parameters:: date (Union[Column, str]) – Date column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the dayname function.

Example

>>> df.select(F.dayname(F.col("date")))

static make_date(year, month, day)[source]

Construct a date from year, month, day integers (PySpark 3.0+).

Parameters:

year (Union[Column, int, str, Literal]) – Year column or integer
month (Union[Column, int, str, Literal]) – Month column or integer (1-12)
day (Union[Column, int, str, Literal]) – Day column or integer (1-31)

Return type:

Returns:

ColumnOperation representing the make_date function

Example

>>> df.select(F.make_date(F.lit(2024), F.lit(3), F.lit(15)))

static date_trunc(format, timestamp)[source]

Truncate timestamp to specified unit (year, month, day, hour, etc.).

Parameters:

format (str) – Truncation unit (‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’, ‘second’)
timestamp (Union[Column, str]) – Timestamp column to truncate

Return type:

Returns:

ColumnOperation representing the date_trunc function

Example

>>> df.select(F.date_trunc('month', F.col('timestamp')))

static datediff(end, start)[source]

Returns number of days between two dates.

Parameters:

end (Union[Column, str, Literal]) – End date column or literal
start (Union[Column, str, Literal]) – Start date column or literal

Return type:

Returns:

ColumnOperation representing the datediff function

Example

>>> df.select(F.datediff(F.col('end_date'), F.lit('2024-01-01')))

static date_diff(end, start)[source]

Alias for datediff - Returns number of days between two dates.

Parameters:

end (Union[Column, str]) – End date column
start (Union[Column, str]) – Start date column

Return type:

Returns:

ColumnOperation representing the date_diff function

Example

>>> df.select(F.date_diff(F.col('end_date'), F.col('start_date')))

static unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')[source]

Convert timestamp string to Unix timestamp (seconds since epoch).

Parameters:

timestamp (Union[Column, str, None]) – Timestamp column (optional, defaults to current timestamp)
format (str) – Date/time format string

Return type:

Returns:

ColumnOperation representing the unix_timestamp function

Example

>>> df.select(F.unix_timestamp(F.col('timestamp'), 'yyyy-MM-dd'))

static last_day(date)[source]

Returns the last day of the month for a given date.

Parameters:: date (Union[Column, str]) – Date column
Return type:: ColumnOperation
Returns:: ColumnOperation representing the last_day function

Example

>>> df.select(F.last_day(F.col('date')))

static next_day(date, dayOfWeek)[source]

Returns the first date which is later than the value of the date column that is on the specified day of the week.

Parameters:

date (Union[Column, str]) – Date column
dayOfWeek (str) – Day of week string (e.g., ‘Mon’, ‘Monday’)

Return type:

Returns:

ColumnOperation representing the next_day function

Example

>>> df.select(F.next_day(F.col('date'), 'Monday'))

static trunc(date, format)[source]

Truncate date to specified unit (year, month, etc.).

Parameters:

date (Union[Column, str]) – Date column
format (str) – Truncation format (‘year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’)

Return type:

Returns:

ColumnOperation representing the trunc function

Example

>>> df.select(F.trunc(F.col('date'), 'year'))

static timestamp_seconds(col)[source]

Convert seconds since epoch to timestamp (PySpark 3.1+).

Parameters:: col (Union[Column, str, int]) – Column or integer representing seconds since epoch
Return type:: ColumnOperation
Returns:: ColumnOperation representing the timestamp

Example

>>> df.select(F.timestamp_seconds(F.col("seconds")))

static weekday(col)[source]

Get the day of week as an integer (0 = Monday, 6 = Sunday) (PySpark 3.5+).

Parameters:: col (Union[Column, str]) – Column or column name containing date/timestamp values.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the weekday function.

Note

Returns 0 for Monday through 6 for Sunday.

static extract(field, source)[source]

Extract a field from a date/timestamp column (PySpark 3.5+).

Parameters:

field (str) – The field to extract (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, etc.)
source (Union[Column, str]) – Column or column name containing date/timestamp values.

Return type:

Returns:

ColumnOperation representing the extract function.

Example

>>> df.select(F.extract("YEAR", F.col("date")))
>>> df.select(F.extract("MONTH", F.col("timestamp")))

static date_from_unix_date(days)[source]

Convert unix date (days since epoch) to date (PySpark 3.5+).

Parameters:: days (Union[Column, str, int]) – Column or integer representing days since epoch (1970-01-01).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the date_from_unix_date function.

Example

>>> df.select(F.date_from_unix_date(F.col("days")))

static to_timestamp_ltz(timestamp_str, format=None)[source]

Convert string to timestamp with local timezone (PySpark 3.5+).

Parameters:

timestamp_str (Union[Column, str]) – Column or string containing timestamp.
format (Optional[str]) – Optional format string for parsing.

Return type:

Returns:

ColumnOperation representing the to_timestamp_ltz function.

Example

>>> df.select(F.to_timestamp_ltz(F.col("ts_str"), "yyyy-MM-dd HH:mm:ss"))

static to_timestamp_ntz(timestamp_str, format=None)[source]

Convert string to timestamp with no timezone (PySpark 3.5+).

Parameters:

timestamp_str (Union[Column, str]) – Column or string containing timestamp.
format (Optional[str]) – Optional format string for parsing.

Return type:

Returns:

ColumnOperation representing the to_timestamp_ntz function.

Example

>>> df.select(F.to_timestamp_ntz(F.col("ts_str"), "yyyy-MM-dd HH:mm:ss"))

Array Functions

Array functions for Sparkless.

This module provides comprehensive array manipulation functions that match PySpark’s array function API. Includes array operations like distinct, intersect, union, except, and element operations for working with array columns in DataFrames.

Key Features:

Complete PySpark array function API compatibility
Array set operations (distinct, intersect, union, except)
Element operations (position, remove)
Type-safe operations with proper return types
Support for both column references and array literals

Example

>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"tags": ["a", "b", "c", "a"]}, {"tags": ["d", "e", "f"]}]
>>> df = spark.createDataFrame(data)
>>> df.select(F.array_distinct(F.col("tags"))).show()
DataFrame[2 rows, 1 columns]
array_distinct(tags)
['a', 'c', 'b']
['e', 'f', 'd']

class sparkless.functions.array.ArrayFunctions[source]

Bases: object

Collection of array manipulation functions.

static array_distinct(column)[source]

Remove duplicate elements from an array, preserving original element type.

Parameters:: column (Union[Column, str]) – The array column to process.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the array_distinct function.

Example

>>> df.select(F.array_distinct(F.col("tags")))

static array_intersect(column1, column2)[source]

Return the intersection of two arrays.

Parameters:

column1 (Union[Column, str]) – First array column.
column2 (Union[Column, str]) – Second array column.

Return type:

Returns:

ColumnOperation representing the array_intersect function.

Example

>>> df.select(F.array_intersect(F.col("tags1"), F.col("tags2")))

static array_union(column1, column2)[source]

Return the union of two arrays (with duplicates removed).

Parameters:

column1 (Union[Column, str]) – First array column.
column2 (Union[Column, str]) – Second array column.

Return type:

Returns:

ColumnOperation representing the array_union function.

Example

>>> df.select(F.array_union(F.col("tags1"), F.col("tags2")))

static array_except(column1, column2)[source]

Return elements in first array but not in second.

Parameters:

column1 (Union[Column, str]) – First array column.
column2 (Union[Column, str]) – Second array column.

Return type:

Returns:

ColumnOperation representing the array_except function.

Example

>>> df.select(F.array_except(F.col("tags1"), F.col("tags2")))

static array_position(column, value)[source]

Return the (1-based) index of the first occurrence of value in the array.

Parameters:

column (Union[Column, str]) – The array column.
value (Any) – The value to find.

Return type:

Returns:

ColumnOperation representing the array_position function.

Example

>>> df.select(F.array_position(F.col("tags"), "target"))

static array_remove(column, value)[source]

Remove all occurrences of a value from the array.

Parameters:

column (Union[Column, str]) – The array column.
value (Any) – The value to remove.

Return type:

Returns:

ColumnOperation representing the array_remove function.

Example

>>> df.select(F.array_remove(F.col("tags"), "unwanted"))

static transform(column, function)[source]

Apply a function to each element in the array.

This is a higher-order function that transforms each element of an array using the provided lambda function.

Parameters:

column (Union[Column, str]) – The array column to transform.
function (Callable[[Any], Any]) – Lambda function to apply to each element.

Return type:

Returns:

ColumnOperation representing the transform function.

Example

>>> df.select(F.transform(F.col("numbers"), lambda x: x * 2))

static filter(column, function)[source]

Filter array elements based on a predicate function.

This is a higher-order function that filters array elements using the provided lambda function.

Parameters:

column (Union[Column, str]) – The array column to filter.
function (Callable[[Any], bool]) – Lambda function that returns True for elements to keep.

Return type:

Returns:

ColumnOperation representing the filter function.

Example

>>> df.select(F.filter(F.col("numbers"), lambda x: x > 10))

static exists(column, function)[source]

Check if any element in the array satisfies the predicate.

This is a higher-order function that returns True if at least one element matches the condition.

Parameters:

column (Union[Column, str]) – The array column to check.
function (Callable[[Any], bool]) – Lambda function predicate.

Return type:

Returns:

ColumnOperation representing the exists function.

Example

>>> df.select(F.exists(F.col("numbers"), lambda x: x > 100))

static forall(column, function)[source]

Check if all elements in the array satisfy the predicate.

This is a higher-order function that returns True only if all elements match the condition.

Parameters:

column (Union[Column, str]) – The array column to check.
function (Callable[[Any], bool]) – Lambda function predicate.

Return type:

Returns:

ColumnOperation representing the forall function.

Example

>>> df.select(F.forall(F.col("numbers"), lambda x: x > 0))

static aggregate(column, initial_value, merge, finish=None)[source]

Reduce array elements to a single value.

This is a higher-order function that aggregates array elements using an accumulator pattern.

Parameters:

column (Union[Column, str]) – The array column to aggregate.
initial_value (Any) – Starting value for the accumulator.
merge (Callable[[Any, Any], Any]) – Lambda function (acc, x) -> acc that combines accumulator and element.
finish (Optional[Callable[[Any], Any]]) – Optional lambda to transform final accumulator value.

Return type:

Returns:

ColumnOperation representing the aggregate function.

Example

>>> df.select(F.aggregate(F.col("nums"), F.lit(0), lambda acc, x: acc + x))

static zip_with(left, right, function)[source]

Merge two arrays element-wise using a function.

This is a higher-order function that combines elements from two arrays using the provided lambda function.

Parameters:

left (Union[Column, str]) – First array column.
right (Union[Column, str]) – Second array column.
function (Callable[[Any, Any], Any]) – Lambda function (x, y) -> result for combining elements.

Return type:

Returns:

ColumnOperation representing the zip_with function.

Example

>>> df.select(F.zip_with(F.col("arr1"), F.col("arr2"), lambda x, y: x + y))

static array_compact(column)[source]

Remove null values from an array.

Parameters:: column (Union[Column, str]) – The array column to compact.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the array_compact function.

Example

>>> df.select(F.array_compact(F.col("nums")))

static slice(column, start, length)[source]

Extract array slice starting at position for given length.

Parameters:

column (Union[Column, str]) – The array column.
start (int) – Starting position (1-based).
length (int) – Number of elements to extract.

Return type:

Returns:

ColumnOperation representing the slice function.

Example

>>> df.select(F.slice(F.col("nums"), 2, 3))

static element_at(column, index)[source]

Get element at index (1-based, negative for reverse indexing).

Parameters:

column (Union[Column, str]) – The array column.
index (int) – Position to extract (1-based, negative counts from end).

Return type:

Returns:

ColumnOperation representing the element_at function.

Example

>>> df.select(F.element_at(F.col("nums"), 1))  # First element
>>> df.select(F.element_at(F.col("nums"), -1))  # Last element

static array_append(column, element)[source]

Append element to end of array.

Parameters:

column (Union[Column, str]) – The array column.
element (Any) – Element to append.

Return type:

Returns:

ColumnOperation representing the array_append function.

Example

>>> df.select(F.array_append(F.col("nums"), 10))

static array_prepend(column, element)[source]

Prepend element to start of array.

Parameters:

column (Union[Column, str]) – The array column.
element (Any) – Element to prepend.

Return type:

Returns:

ColumnOperation representing the array_prepend function.

Example

>>> df.select(F.array_prepend(F.col("nums"), 0))

static array_insert(column, pos, value)[source]

Insert element at position in array.

Parameters:

column (Union[Column, str]) – The array column.
pos (int) – Position to insert at (1-based).
value (Any) – Value to insert.

Return type:

Returns:

ColumnOperation representing the array_insert function.

Example

>>> df.select(F.array_insert(F.col("nums"), 2, 99))

static array_size(column)[source]

Get array length.

Parameters:: column (Union[Column, str]) – The array column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the array_size function.

Example

>>> df.select(F.array_size(F.col("nums")))

static array_sort(column)[source]

Sort array elements in ascending order.

Parameters:: column (Union[Column, str]) – The array column to sort.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the array_sort function.

Example

>>> df.select(F.array_sort(F.col("nums")))

static array_contains(column, value)[source]

Check if array contains a specific value.

Parameters:

column (Union[Column, str]) – The array column to search.
value (Any) – The value to search for.

Return type:

Returns:

ColumnOperation representing the array_contains function.

Example

>>> df.select(F.array_contains(F.col("tags"), "spark"))

static array_max(column)[source]

Return maximum value from array.

Parameters:: column (Union[Column, str]) – The array column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the array_max function.

Example

>>> df.select(F.array_max(F.col("nums")))

static array_min(column)[source]

Return minimum value from array.

Parameters:: column (Union[Column, str]) – The array column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the array_min function.

Example

>>> df.select(F.array_min(F.col("nums")))

static explode(column)[source]

Returns a new row for each element in the given array or map.

Parameters:: column (Union[Column, str]) – The array or map column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the explode function.

Example

>>> df.select(F.explode(F.col("tags")))

static size(column)[source]

Return the size (length) of an array or map.

Parameters:: column (Union[Column, str]) – The array or map column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the size function.

Example

>>> df.select(F.size(F.col("tags")))

static flatten(column)[source]

Flatten array of arrays into a single array.

Parameters:: column (Union[Column, str]) – The array column containing nested arrays.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the flatten function.

Example

>>> df.select(F.flatten(F.col("nested_arrays")))

static reverse(column)[source]

Reverse the elements of an array.

Parameters:: column (Union[Column, str]) – The array column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the reverse function.

Example

>>> df.select(F.reverse(F.col("nums")))

static arrays_overlap(column1, column2)[source]

Check if two arrays have any common elements.

Parameters:

column1 (Union[Column, str]) – First array column.
column2 (Union[Column, str]) – Second array column.

Return type:

Returns:

ColumnOperation representing the arrays_overlap function.

Example

>>> df.select(F.arrays_overlap(F.col("arr1"), F.col("arr2")))

static explode_outer(column)[source]

Returns a new row for each element, including rows with null/empty arrays.

Parameters:: column (Union[Column, str]) – The array or map column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the explode_outer function.

Example

>>> df.select(F.explode_outer(F.col("tags")))

static posexplode(column)[source]

Returns a new row for each element with position in array.

Parameters:: column (Union[Column, str]) – The array column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the posexplode function.

static posexplode_outer(column)[source]

Returns a new row for each element with position, including null/empty arrays.

Parameters:: column (Union[Column, str]) – The array column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the posexplode_outer function.

static arrays_zip(*columns)[source]

Merge arrays into array of structs (alias for array_zip).

Parameters:: *columns (Union[Column, str]) – Array columns to zip together.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the arrays_zip function.

static sequence(start, stop, step=1)[source]

Generate array of integers from start to stop by step.

Parameters:

start (Union[Column, str, int]) – Starting value
stop (Union[Column, str, int]) – Ending value
step (Union[Column, str, int]) – Step size (default 1)

Return type:

Returns:

ColumnOperation representing the sequence function.

static shuffle(column)[source]

Randomly shuffle array elements.

Parameters:: column (Union[Column, str]) – The array column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the shuffle function.

static array(*cols)[source]

Create array from multiple columns (PySpark 3.0+).

Parameters:: *cols (Union[Column, str, List[Union[Column, str]]]) – Variable number of columns to combine into array. Supports multiple formats: - F.array(“Name”, “Type”) - string column names - F.array([“Name”, “Type”]) - list of string column names - F.array(F.col(“Name”), F.col(“Type”)) - Column objects - F.array([F.col(“Name”), F.col(“Type”)]) - list of Column objects
Return type:: ColumnOperation
Returns:: ColumnOperation representing the array function.

Example

>>> df.select(F.array(F.col("a"), F.col("b"), F.col("c")))
>>> df.select(F.array(["a", "b", "c"]))  # List format
>>> df.select(F.array())  # Returns empty array [] (Issue #367)
>>> df.select(F.array([]))  # Returns empty array [] (Issue #367)

static array_repeat(col, count)[source]

Create array by repeating value N times (PySpark 3.0+).

Parameters:

col (Union[Column, str]) – Value to repeat
count (int) – Number of repetitions

Return type:

Returns:

ColumnOperation representing the array_repeat function.

Example

>>> df.select(F.array_repeat(F.col("value"), 3))

static sort_array(col, asc=True)[source]

Sort array elements (PySpark 3.0+).

Parameters:

col (Union[Column, str]) – Array column to sort
asc (bool) – Sort ascending if True, descending if False

Return type:

Returns:

ColumnOperation representing the sort_array function.

Example

>>> df.select(F.sort_array(F.col("values"), asc=False))

static array_agg(col)[source]

Aggregate function to collect values into an array (PySpark 3.5+).

Parameters:: col (Union[Column, str]) – Column to aggregate into an array
Return type:: AggregateFunction
Returns:: AggregateFunction representing the array_agg function.

Example

>>> df.groupBy("dept").agg(F.array_agg("name"))

static cardinality(col)[source]

Return the size of an array or map (PySpark 3.5+).

Parameters:: col (Union[Column, str]) – Array or map column
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cardinality function.

Example

>>> df.select(F.cardinality(F.col("array_col")))

static get(col, key)[source]

Get element from array by index or map by key.

Parameters:

col (Union[Column, str]) – Array or map column.
key (Union[Column, str, int, Any]) – Index (for arrays) or key (for maps).

Return type:

Returns:

ColumnOperation representing the get function.

static inline(col)[source]

Explode array of structs into rows.

Parameters:: col (Union[Column, str]) – Array of structs column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the inline function.

static inline_outer(col)[source]

Explode array of structs into rows (outer join style - preserves nulls).

Parameters:: col (Union[Column, str]) – Array of structs column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the inline_outer function.

Map Functions

Map functions for Sparkless.

This module provides comprehensive map manipulation functions that match PySpark’s map function API. Includes operations for extracting keys, values, entries, and combining maps for working with map columns in DataFrames.

Key Features:

Complete PySpark map function API compatibility
Key/value extraction (map_keys, map_values)
Entry operations (map_entries)
Map combination (map_concat, map_from_arrays)
Type-safe operations with proper return types
Support for both column references and map literals

Example

>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"properties": {"key1": "val1", "key2": "val2"}}]
>>> df = spark.createDataFrame(data)
>>> df.select(F.map_keys(F.col("properties"))).show()
DataFrame[1 rows, 1 columns]
map_keys(properties)
['key1', 'key2']

class sparkless.functions.map.MapFunctions[source]

Bases: object

Collection of map manipulation functions.

static map_keys(column)[source]

Return an array of all keys in the map.

Parameters:: column (Union[Column, str]) – The map column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the map_keys function.

Example

>>> df.select(F.map_keys(F.col("properties")))

static map_values(column)[source]

Return an array of all values in the map.

Parameters:: column (Union[Column, str]) – The map column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the map_values function.

Example

>>> df.select(F.map_values(F.col("properties")))

static map_entries(column)[source]

Return an array of structs with key-value pairs.

Parameters:: column (Union[Column, str]) – The map column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the map_entries function.

Example

>>> df.select(F.map_entries(F.col("properties")))

static map_concat(*columns)[source]

Concatenate multiple maps into a single map.

Parameters:: *columns (Union[Column, str]) – Map columns to concatenate.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the map_concat function.

Example

>>> df.select(F.map_concat(F.col("map1"), F.col("map2"), F.col("map3")))

static map_from_arrays(keys, values)[source]

Create a map from two arrays (keys and values).

Parameters:

keys (Union[Column, str]) – Array column containing keys.
values (Union[Column, str]) – Array column containing values.

Return type:

Returns:

ColumnOperation representing the map_from_arrays function.

Example

>>> df.select(F.map_from_arrays(F.col("keys"), F.col("values")))

static create_map(*cols)[source]

Create a map from key-value pairs.

Parameters:: *cols (Union[Column, str, Any]) – Alternating key-value columns/literals. If no arguments are provided, returns an empty map {}.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the create_map function.

Example

>>> df.select(F.create_map(F.col("k1"), F.col("v1"), F.col("k2"), F.col("v2")))
>>> df.select(F.create_map())  # Returns empty map {}

static map_contains_key(column, key)[source]

Check if map contains a specific key.

Parameters:

column (Union[Column, str]) – The map column.
key (Any) – The key to check for.

Return type:

Returns:

ColumnOperation representing the map_contains_key function.

Example

>>> df.select(F.map_contains_key(F.col("map"), "key"))

static map_from_entries(column)[source]

Convert array of key-value structs to map.

Parameters:: column (Union[Column, str]) – Array column containing structs with ‘key’ and ‘value’ fields.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the map_from_entries function.

Example

>>> df.select(F.map_from_entries(F.col("entries")))

static map_filter(column, function)[source]

Filter map entries based on key-value predicate.

This is a higher-order function that filters map entries using the provided lambda function.

Parameters:

column (Union[Column, str]) – The map column to filter.
function (Callable[[Any, Any], bool]) – Lambda function (key, value) -> bool that returns True for entries to keep.

Return type:

Returns:

ColumnOperation representing the map_filter function.

Example

>>> df.select(F.map_filter(F.col("map"), lambda k, v: v > 10))

static transform_keys(column, function)[source]

Transform map keys using a function.

This is a higher-order function that transforms map keys using the provided lambda function.

Parameters:

column (Union[Column, str]) – The map column.
function (Callable[[Any, Any], Any]) – Lambda function (key, value) -> new_key to transform keys.

Return type:

Returns:

ColumnOperation representing the transform_keys function.

Example

>>> df.select(F.transform_keys(F.col("map"), lambda k, v: F.upper(k)))

static transform_values(column, function)[source]

Transform map values using a function.

This is a higher-order function that transforms map values using the provided lambda function.

Parameters:

column (Union[Column, str]) – The map column.
function (Callable[[Any, Any], Any]) – Lambda function (key, value) -> new_value to transform values.

Return type:

Returns:

ColumnOperation representing the transform_values function.

Example

>>> df.select(F.transform_values(F.col("map"), lambda k, v: v * 2))

static map_zip_with(col1, col2, function)[source]

Merge two maps into a single map using a function (PySpark 3.1+).

This is a higher-order function that combines two maps by applying the provided lambda function to matching keys.

Parameters:

col1 (Union[Column, str]) – The first map column.
col2 (Union[Column, str]) – The second map column.
function (Callable[[Any, Any, Any], Any]) – Lambda function (key, value1, value2) -> new_value to combine values.

Return type:

Returns:

ColumnOperation representing the map_zip_with function.

Example

>>> df.select(F.map_zip_with(F.col("map1"), F.col("map2"), lambda k, v1, v2: v1 + v2))

static str_to_map(column, pair_delim=',', key_value_delim=':')[source]

Convert string to map using delimiters.

Parameters:

column (Union[Column, str]) – The string column to convert.
pair_delim (Optional[str]) – Delimiter between key-value pairs (default ‘,’).
key_value_delim (Optional[str]) – Delimiter between key and value (default ‘:’).

Return type:

Returns:

ColumnOperation representing the str_to_map function.

Aggregate Functions

Aggregate functions for Sparkless.

This module provides comprehensive aggregate functions that match PySpark’s aggregate function API. Includes statistical operations, counting functions, and data summarization operations for grouped data processing in DataFrames.

Key Features:

Complete PySpark aggregate function API compatibility
Basic aggregates (count, sum, avg, max, min)
Statistical functions (stddev, variance, skewness, kurtosis)
Collection aggregates (collect_list, collect_set, first, last)
Distinct counting (countDistinct)
Type-safe operations with proper return types
Support for both column references and expressions
Proper handling of null values and edge cases

Example

>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"dept": "IT", "salary": 50000}, {"dept": "IT", "salary": 60000}]
>>> df = spark.createDataFrame(data)
>>> grouped = df.groupBy("dept")
>>> result = grouped.agg(
...     F.count("*").alias("count"),
...     F.avg("salary").alias("avg_salary"),
...     F.max("salary").alias("max_salary")
... )
>>> result.show()
DataFrame[1 rows, 4 columns]
dept count avg_salary max_salary
IT 2 55000.0 60000

class sparkless.functions.aggregate.AggregateFunctions[source]

Bases: object

Collection of aggregate functions.

static count(column=None)[source]

Count non-null values.

Parameters:: column (Union[Column, str, None]) – The column to count (None for count(*)).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the count function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static sum(column)[source]

Sum values.

Parameters:: column (Union[Column, str]) – The column to sum.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the sum function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static avg(column)[source]

Average values.

Parameters:: column (Union[Column, str]) – The column to average.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the avg function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static max(column)[source]

Maximum value.

Parameters:: column (Union[Column, str]) – The column to get max of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the max function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static min(column)[source]

Minimum value.

Parameters:: column (Union[Column, str]) – The column to get min of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the min function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static first(column, ignorenulls=False)[source]

First value.

Parameters:

column (Union[Column, str]) – The column to get first value of.
ignorenulls (bool) – If True, ignore null values and return first non-null value. If False (default), return first value even if it’s null.

Return type:

Returns:

AggregateFunction representing the first function.

Raises:

RuntimeError – If no active SparkSession is available

static last(column)[source]

Last value.

Parameters:: column (Union[Column, str]) – The column to get last value of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the last function.
Raises:: RuntimeError – If no active SparkSession is available

static collect_list(column)[source]

Collect values into a list.

Parameters:: column (Union[Column, str]) – The column to collect.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the collect_list function.
Raises:: RuntimeError – If no active SparkSession is available

static collect_set(column)[source]

Collect unique values into a set.

Parameters:: column (Union[Column, str]) – The column to collect.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the collect_set function.
Raises:: RuntimeError – If no active SparkSession is available

static stddev(column)[source]

Standard deviation.

Parameters:: column (Union[Column, str]) – The column to get stddev of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the stddev function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static std(column)[source]

Alias for stddev - Standard deviation.

Parameters:: column (Union[Column, str]) – The column to get stddev of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the std function.
Raises:: RuntimeError – If no active SparkSession is available

static product(column)[source]

Multiply all values in column.

Parameters:: column (Union[Column, str]) – The column to multiply values of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the product function.
Raises:: RuntimeError – If no active SparkSession is available

static sum_distinct(column)[source]

Sum of distinct values.

Parameters:: column (Union[Column, str]) – The column to sum distinct values of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the sum_distinct function.
Raises:: RuntimeError – If no active SparkSession is available

static variance(column)[source]

Variance.

Parameters:: column (Union[Column, str]) – The column to get variance of.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the variance function (PySpark-compatible).
Raises:: RuntimeError – If no active SparkSession is available

static skewness(column)[source]

Skewness.

Parameters:: column (Union[Column, str]) – The column to get skewness of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the skewness function.
Raises:: RuntimeError – If no active SparkSession is available

static kurtosis(column)[source]

Kurtosis.

Parameters:: column (Union[Column, str]) – The column to get kurtosis of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the kurtosis function.
Raises:: RuntimeError – If no active SparkSession is available

static countDistinct(column)[source]

Count distinct values.

Parameters:: column (Union[Column, str]) – The column to count distinct values of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the countDistinct function.
Raises:: RuntimeError – If no active SparkSession is available

static count_distinct(column)[source]

Alias for countDistinct - Count distinct values.

Parameters:: column (Union[Column, str]) – The column to count distinct values of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the count_distinct function.
Raises:: RuntimeError – If no active SparkSession is available

static percentile_approx(column, percentage, accuracy=10000)[source]

Approximate percentile.

Parameters:

column (Union[Column, str]) – The column to get percentile of.
percentage (float) – The percentage (0.0 to 1.0).
accuracy (int) – The accuracy parameter.

Return type:

Returns:

AggregateFunction representing the percentile_approx function.

Raises:

RuntimeError – If no active SparkSession is available

static corr(column1, column2)[source]

Correlation between two columns.

Parameters:

column1 (Union[Column, str]) – The first column.
column2 (Union[Column, str]) – The second column.

Return type:

Returns:

ColumnOperation representing the corr function (PySpark-compatible).

Raises:

RuntimeError – If no active SparkSession is available

static covar_samp(column1, column2)[source]

Sample covariance between two columns.

Parameters:

column1 (Union[Column, str]) – The first column.
column2 (Union[Column, str]) – The second column.

Return type:

Returns:

ColumnOperation representing the covar_samp function (PySpark-compatible).

Raises:

RuntimeError – If no active SparkSession is available

static bool_and(column)[source]

Aggregate AND - returns true if all values are true (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column containing boolean values.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the bool_and function.
Raises:: RuntimeError – If no active SparkSession is available

static bool_or(column)[source]

Aggregate OR - returns true if any value is true (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column containing boolean values.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the bool_or function.
Raises:: RuntimeError – If no active SparkSession is available

static every(column)[source]

Alias for bool_and (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column containing boolean values.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the every function.
Raises:: RuntimeError – If no active SparkSession is available

static some(column)[source]

Alias for bool_or (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column containing boolean values.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the some function.
Raises:: RuntimeError – If no active SparkSession is available

static max_by(column, ord)[source]

Return value associated with the maximum of ord column (PySpark 3.1+).

Parameters:

column (Union[Column, str]) – Column to return value from.
ord (Union[Column, str]) – Column to find maximum of.

Return type:

Returns:

AggregateFunction representing the max_by function.

Raises:

RuntimeError – If no active SparkSession is available

static min_by(column, ord)[source]

Return value associated with the minimum of ord column (PySpark 3.1+).

Parameters:

column (Union[Column, str]) – Column to return value from.
ord (Union[Column, str]) – Column to find minimum of.

Return type:

Returns:

AggregateFunction representing the min_by function.

Raises:

RuntimeError – If no active SparkSession is available

static count_if(column)[source]

Count rows where condition is true (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Boolean column or condition.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the count_if function.
Raises:: RuntimeError – If no active SparkSession is available

static any_value(column)[source]

Return any non-null value (non-deterministic) (PySpark 3.1+).

Parameters:: column (Union[Column, str]) – Column to return value from.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the any_value function.
Raises:: RuntimeError – If no active SparkSession is available

static mean(column)[source]

Aggregate function: returns the mean of the values (alias for avg).

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the mean function.
Raises:: RuntimeError – If no active SparkSession is available

static approx_count_distinct(column, rsd=None)[source]

Returns approximate count of distinct elements (alias for approxCountDistinct).

Parameters:

column (Union[Column, str]) – Column to count distinct values.
rsd (Optional[float]) – Optional relative standard deviation (default: None, which uses PySpark’s default of 0.05). Controls the approximation accuracy. Lower values provide better accuracy but use more memory. Typical values range from 0.01 (1% error) to 0.1 (10% error).

Return type:

Returns:

ColumnOperation representing the approx_count_distinct function (PySpark-compatible).

Raises:

RuntimeError – If no active SparkSession is available

static stddev_pop(column)[source]

Returns population standard deviation.

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the stddev_pop function.
Raises:: RuntimeError – If no active SparkSession is available

static stddev_samp(column)[source]

Returns sample standard deviation.

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the stddev_samp function.
Raises:: RuntimeError – If no active SparkSession is available

static var_pop(column)[source]

Returns population variance.

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the var_pop function.
Raises:: RuntimeError – If no active SparkSession is available

static var_samp(column)[source]

Returns sample variance.

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the var_samp function.
Raises:: RuntimeError – If no active SparkSession is available

static covar_pop(column1, column2)[source]

Returns population covariance.

Parameters:

column1 (Union[Column, str]) – First numeric column.
column2 (Union[Column, str]) – Second numeric column.

Return type:

Returns:

AggregateFunction representing the covar_pop function.

static median(column)[source]

Returns the median value (PySpark 3.4+).

Parameters:: column (Union[Column, str]) – Numeric column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the median function.
Raises:: RuntimeError – If no active SparkSession is available

static mode(column)[source]

Returns the most frequent value (mode) (PySpark 3.4+).

Parameters:: column (Union[Column, str]) – Column to find mode of.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the mode function.
Raises:: RuntimeError – If no active SparkSession is available

static percentile(column, percentage)[source]

Returns the exact percentile value (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – Numeric column.
percentage (float) – Percentile to compute (between 0.0 and 1.0).

Return type:

Returns:

AggregateFunction representing the percentile function.

static approxCountDistinct(*cols)[source]

Deprecated alias for approx_count_distinct (all PySpark versions).

Use approx_count_distinct instead.

Parameters:: cols (Union[Column, str]) – Columns to count distinct values for. Only the first column is used.
Return type:: AggregateFunction
Returns:: AggregateFunction for approximate distinct count.

static sumDistinct(column)[source]

Deprecated alias for sum_distinct (PySpark 3.2+).

Use sum_distinct instead (or sum(distinct(col)) for earlier versions).

Parameters:: column (Union[Column, str]) – Numeric column to sum.
Return type:: AggregateFunction
Returns:: AggregateFunction for distinct sum.

static regr_avgx(y, x)[source]

Linear regression average of x.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_avgx function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_avgy(y, x)[source]

Linear regression average of y.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_avgy function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_count(y, x)[source]

Linear regression count.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_count function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_intercept(y, x)[source]

Linear regression intercept.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_intercept function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_r2(y, x)[source]

Linear regression R-squared.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_r2 function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_slope(y, x)[source]

Linear regression slope.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_slope function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_sxx(y, x)[source]

Linear regression sum of squares of x.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_sxx function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_sxy(y, x)[source]

Linear regression sum of products.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_sxy function.

Raises:

RuntimeError – If no active SparkSession is available

static regr_syy(y, x)[source]

Linear regression sum of squares of y.

Parameters:

y (Union[Column, str]) – The y column.
x (Union[Column, str]) – The x column.

Return type:

Returns:

AggregateFunction representing the regr_syy function.

Raises:

RuntimeError – If no active SparkSession is available

static approx_percentile(column, percentage, accuracy=10000)[source]

Compute approximate percentile (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – The column to compute percentile for.
percentage (Union[float, Column, str]) – The percentage (0.0 to 1.0) or array of percentages.
accuracy (Union[int, Column, str]) – The accuracy parameter (default: 10000).

Return type:

Returns:

AggregateFunction representing the approx_percentile function.

Example

>>> df.groupBy("dept").agg(F.approx_percentile(F.col("salary"), 0.5))

Conditional Functions

Conditional functions for Sparkless.

This module contains conditional functions including CASE WHEN expressions.

sparkless.functions.conditional.validate_rule(column, rule)[source]

Convert validation rule to column expression.

Parameters:

column (Union[Column, str]) – The column to validate.
rule (Union[str, List[Any]]) – Validation rule as string or list.

Return type:

Returns:

Column expression for the validation rule.

Raises:

ValueError – If rule is not recognized.

class sparkless.functions.conditional.CaseWhen(column=None, condition=None, value=None)[source]

Bases: object

Represents a CASE WHEN expression.

This class handles complex conditional logic with multiple conditions and default values, similar to SQL CASE WHEN statements.

Parameters:

column (Any)
condition (Any)
value (Any)

Initialize CaseWhen.

Parameters:

column (Any) – The column or expression being evaluated.
condition (Any) – The condition for this case.
value (Any) – The value to return if condition is true.

__init__(column=None, condition=None, value=None)[source]

Initialize CaseWhen.

Parameters:

column (Any) – The column or expression being evaluated.
condition (Any) – The condition for this case.
value (Any) – The value to return if condition is true.

property else_value: Any: Get the else value (alias for default_value for compatibility).

when(condition, value)[source]

Add another WHEN condition.

Parameters:

condition (Any) – The condition to check.
value (Any) – The value to return if condition is true.

Return type:

Returns:

Self for method chaining.

otherwise(value)[source]

Set the default value for the CASE WHEN expression.

Parameters:: value (Any) – The default value to return if no conditions match.
Return type:: CaseWhen
Returns:: Self for method chaining.

alias(name)[source]

Create an alias for the CASE WHEN expression.

Parameters:: name (str) – The alias name.
Return type:: CaseWhen
Returns:: Self for method chaining.

cast(data_type)[source]

Cast the CASE WHEN expression to a different data type.

Parameters:: data_type (Any) – The target data type (DataType instance or string type name).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cast operation.

Example

>>> F.when(F.col("value") == "A", F.lit(100)).otherwise(F.lit(200)).cast("long")

__add__(other)[source]

Addition operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__sub__(other)[source]

Subtraction operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__mul__(other)[source]

Multiplication operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__truediv__(other)[source]

Division operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__mod__(other)[source]

Modulo operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__radd__(other)[source]

Reverse addition operation (for 2 + case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__rsub__(other)[source]

Reverse subtraction operation (for 2 - case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__rmul__(other)[source]

Reverse multiplication operation (for 2 * case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__rtruediv__(other)[source]

Reverse division operation (for 2 / case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__rmod__(other)[source]

Reverse modulo operation (for 2 % case_when).

Parameters:: other (Any)
Return type:: ColumnOperation

__or__(other)[source]

Bitwise OR operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__and__(other)[source]

Bitwise AND operation (PySpark-compatible).

Parameters:: other (Any)
Return type:: ColumnOperation

__invert__()[source]

Bitwise NOT operation (unary ~, PySpark-compatible).

Return type:: ColumnOperation

evaluate(row)[source]

Evaluate the CASE WHEN expression for a given row.

Parameters:: row (Dict[str, Any]) – The data row to evaluate against.
Return type:: Any
Returns:: The evaluated result.

get_result_type()[source]

Infer the result type from condition values.

Return type:: DataType

class sparkless.functions.conditional.ConditionalFunctions[source]

Bases: object

Collection of conditional functions.

static coalesce(*columns)[source]

Return the first non-null value from a list of columns.

Parameters:: *columns (Union[Column, str, Any]) – Variable number of columns or values to check.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the coalesce function.

static isnull(column)[source]

Check if a column is null.

Parameters:: column (Union[Column, str]) – The column to check.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the isnull function.

static isnotnull(column)[source]

Check if a column is not null.

Parameters:: column (Union[Column, str]) – The column to check.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the isnotnull function.

static isnan(column)[source]

Check if a column is NaN (Not a Number).

Parameters:: column (Union[Column, str]) – The column to check.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the isnan function.

static when(condition, value=None)[source]

Start a CASE WHEN expression.

Parameters:

condition (Any) – The initial condition.
value (Any) – Optional value for the condition.

Return type:

Returns:

CaseWhen object for chaining.

static assert_true(condition)[source]

Assert that a condition is true, raises error if false.

Parameters:: condition (Union[Column, ColumnOperation, str]) – Boolean condition to assert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the assert_true function.

Example

>>> df.select(F.assert_true(F.col("value") > 0))

static ifnull(col1, col2)[source]

Alias for coalesce(col1, col2) - Returns col2 if col1 is null (PySpark 3.5+).

Parameters:

col1 (Union[Column, str]) – First column.
col2 (Union[Column, str]) – Second column (replacement for null).

Return type:

Returns:

ColumnOperation representing the ifnull function.

static equal_null(col1, col2)[source]

Equality check that treats NULL as equal.

Parameters:

col1 (Union[Column, str]) – First column or value.
col2 (Union[Column, str, Any]) – Second column or value.

Return type:

Returns:

ColumnOperation representing the equal_null function.

static nullif(col1, col2)[source]

Returns null if col1 equals col2, otherwise returns col1 (PySpark 3.5+).

Parameters:

col1 (Union[Column, str]) – First column.
col2 (Any) – Column, column name, or literal value to compare.

Return type:

Returns:

ColumnOperation representing the nullif function.

static case_when(*conditions, else_value=None)[source]

Create CASE WHEN expression with multiple conditions.

Parameters:

*conditions (Tuple[Any, Any]) – Variable number of (condition, value) tuples.
else_value (Any) – Default value if no conditions match.

Return type:

Returns:

CaseWhen object representing the CASE WHEN expression.

Example

>>> F.case_when(
...     (F.col("age") > 18, "adult"),
...     (F.col("age") > 12, "teen"),
...     else_value="child"
... )

static try_add(left, right)[source]

Null-safe addition - returns NULL on error (PySpark 3.5+).

Parameters:

left (Union[Column, str, int, float]) – Left operand (column or literal).
right (Union[Column, str, int, float]) – Right operand (column or literal).

Return type:

Returns:

ColumnOperation representing the try_add function.

static try_subtract(left, right)[source]

Null-safe subtraction - returns NULL on error (PySpark 3.5+).

Parameters:

left (Union[Column, str, int, float]) – Left operand (column or literal).
right (Union[Column, str, int, float]) – Right operand (column or literal).

Return type:

Returns:

ColumnOperation representing the try_subtract function.

static try_multiply(left, right)[source]

Null-safe multiplication - returns NULL on error (PySpark 3.5+).

Parameters:

left (Union[Column, str, int, float]) – Left operand (column or literal).
right (Union[Column, str, int, float]) – Right operand (column or literal).

Return type:

Returns:

ColumnOperation representing the try_multiply function.

static try_divide(left, right)[source]

Null-safe division - returns NULL on error (PySpark 3.5+).

Parameters:

left (Union[Column, str, int, float]) – Left operand (column or literal).
right (Union[Column, str, int, float]) – Right operand (column or literal).

Return type:

Returns:

ColumnOperation representing the try_divide function.

static try_sum(column)[source]

Null-safe sum aggregate - returns NULL on error (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – The column to sum.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the try_sum function.

static try_avg(column)[source]

Null-safe average aggregate - returns NULL on error (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – The column to average.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the try_avg function.

static try_element_at(column, index)[source]

Null-safe element_at - returns NULL on error (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – The column containing array or map.
index (Union[Column, str, int]) – The index or key to access.

Return type:

Returns:

ColumnOperation representing the try_element_at function.

static try_to_binary(column, format=None)[source]

Null-safe to_binary - returns NULL on error (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – The column to convert to binary.
format (Optional[str]) – Optional format (‘hex’, ‘base64’, ‘utf-8’).

Return type:

Returns:

ColumnOperation representing the try_to_binary function.

static try_to_number(column, format=None)[source]

Null-safe to_number - returns NULL on error (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – The column to convert to number.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the try_to_number function.

static try_to_timestamp(column, format=None)[source]

Null-safe to_timestamp - returns NULL on error (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – The column to convert to timestamp.
format (Optional[str]) – Optional format string.

Return type:

Returns:

ColumnOperation representing the try_to_timestamp function.

Bitwise Functions

Bitwise functions for Sparkless (PySpark 3.2+).

This module provides bitwise operations on integer columns.

class sparkless.functions.bitwise.BitwiseFunctions[source]

Bases: object

Collection of bitwise manipulation functions.

static bit_count(column)[source]

Count the number of set bits (population count).

Parameters:: column (Union[Column, str]) – Integer column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the bit_count function.

Example

>>> df.select(F.bit_count(F.col("value")))

static bit_get(column, pos)[source]

Get bit value at position.

Parameters:

column (Union[Column, str]) – Integer column.
pos (int) – Bit position (0-based, from right).

Return type:

Returns:

ColumnOperation representing the bit_get function.

Example

>>> df.select(F.bit_get(F.col("value"), 0))

static getbit(column, pos)[source]

Get bit value at position (alias for bit_get) (PySpark 3.5+).

Parameters:

column (Union[Column, str]) – Integer column.
pos (int) – Bit position (0-based, from right).

Return type:

Returns:

ColumnOperation representing the getbit function.

Example

>>> df.select(F.getbit(F.col("value"), 0))

static bitwise_not(column)[source]

Perform bitwise NOT operation.

Parameters:: column (Union[Column, str]) – Integer column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the bitwise_not function.

Example

>>> df.select(F.bitwise_not(F.col("value")))

static bit_and(column)[source]

Aggregate function - bitwise AND of all values (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – Integer column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the bit_and aggregate function.

Example

>>> df.groupBy("dept").agg(F.bit_and("flags"))

static bit_or(column)[source]

Aggregate function - bitwise OR of all values (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – Integer column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the bit_or aggregate function.

Example

>>> df.groupBy("dept").agg(F.bit_or("flags"))

static bit_xor(column)[source]

Aggregate function - bitwise XOR of all values (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – Integer column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the bit_xor aggregate function.

Example

>>> df.groupBy("dept").agg(F.bit_xor("flags"))

static bitwiseNOT(column)[source]

Deprecated alias for bitwise_not (all PySpark versions).

Use bitwise_not instead.

Parameters:: column (Union[Column, str]) – Integer column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing bitwise NOT.

static shiftleft(column, num_bits)[source]

Bitwise left shift.

Parameters:

column (Union[Column, str]) – Integer column.
num_bits (Union[Column, str, int]) – Number of bits to shift left.

Return type:

Returns:

ColumnOperation representing the shiftleft function.

static shiftright(column, num_bits)[source]

Bitwise right shift (signed).

Parameters:

column (Union[Column, str]) – Integer column.
num_bits (Union[Column, str, int]) – Number of bits to shift right.

Return type:

Returns:

ColumnOperation representing the shiftright function.

static shiftrightunsigned(column, num_bits)[source]

Bitwise unsigned right shift.

Parameters:

column (Union[Column, str]) – Integer column.
num_bits (Union[Column, str, int]) – Number of bits to shift right.

Return type:

Returns:

ColumnOperation representing the shiftrightunsigned function.

static shiftLeft(column, num_bits)[source]

Deprecated alias for shiftleft (PySpark 3.0-3.1).

Use shiftleft instead.

Parameters:

column (Union[Column, str]) – Integer column.
num_bits (Union[Column, str, int]) – Number of bits to shift left.

Return type:

Returns:

ColumnOperation representing the shiftLeft function.

static shiftRight(column, num_bits)[source]

Deprecated alias for shiftright (PySpark 3.0-3.1).

Use shiftright instead.

Parameters:

column (Union[Column, str]) – Integer column.
num_bits (Union[Column, str, int]) – Number of bits to shift right.

Return type:

Returns:

ColumnOperation representing the shiftRight function.

static shiftRightUnsigned(column, num_bits)[source]

Deprecated alias for shiftrightunsigned (PySpark 3.0-3.1).

Use shiftrightunsigned instead.

Parameters:

column (Union[Column, str]) – Integer column.
num_bits (Union[Column, str, int]) – Number of bits to shift right.

Return type:

Returns:

ColumnOperation representing the shiftRightUnsigned function.

static bitmap_bit_position(column)[source]

Get the bit position in a bitmap (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – Bitmap column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the bitmap_bit_position function.

Example

>>> df.select(F.bitmap_bit_position(F.col("bitmap")))

static bitmap_bucket_number(column)[source]

Get the bucket number in a bitmap (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – Bitmap column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the bitmap_bucket_number function.

Example

>>> df.select(F.bitmap_bucket_number(F.col("bitmap")))

static bitmap_construct_agg(column)[source]

Aggregate function - construct bitmap from values (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – Integer column to construct bitmap from.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the bitmap_construct_agg function.

Example

>>> df.groupBy("dept").agg(F.bitmap_construct_agg("id"))

static bitmap_count(column)[source]

Count the number of set bits in a bitmap (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – Bitmap column.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the bitmap_count function.

Example

>>> df.select(F.bitmap_count(F.col("bitmap")))

static bitmap_or_agg(column)[source]

Aggregate function - bitwise OR of bitmaps (PySpark 3.5+).

Parameters:: column (Union[Column, str]) – Bitmap column.
Return type:: AggregateFunction
Returns:: AggregateFunction representing the bitmap_or_agg function.

Example

>>> df.groupBy("dept").agg(F.bitmap_or_agg("bitmap"))

Window Functions

Window functions for Sparkless.

This module contains window function implementations including row_number, rank, etc.

class sparkless.functions.window_execution.WindowFunction(function, window_spec)[source]

Bases: object

Represents a window function.

This class handles window functions like row_number(), rank(), etc. that operate over a window specification.

Parameters:

function (Any)
window_spec (WindowSpec)

Initialize WindowFunction.

Parameters:

function (Any) – The window function (e.g., row_number(), rank()).
window_spec (WindowSpec) – The window specification.

__init__(function, window_spec)[source]

Initialize WindowFunction.

Parameters:

function (Any) – The window function (e.g., row_number(), rank()).
window_spec (WindowSpec) – The window specification.

alias(name)[source]

Create an alias for this window function.

Parameters:: name (str) – The alias name.
Return type:: WindowFunction
Returns:: Self for method chaining.

cast(data_type)[source]

Cast the window function result to a different data type.

Parameters:: data_type (Any) – The target data type (DataType instance or string type name).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cast operation.

Example

>>> F.row_number().over(window_spec).cast("long")

__mul__(other)[source]

Multiply window function result by a value.

Parameters:: other (Any) – The value to multiply by.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the multiplication.

Example

>>> F.percent_rank().over(window) * 100

__rmul__(other)[source]

Reverse multiply (e.g., 100 * window_func).

Parameters:: other (Any) – The value to multiply.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the multiplication.

Example

>>> 100 * F.percent_rank().over(window)

__add__(other)[source]

Add a value to window function result.

Parameters:: other (Any) – The value to add.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the addition.

Example

>>> F.row_number().over(window) + 1

__radd__(other)[source]

Reverse add (e.g., 1 + window_func).

Parameters:: other (Any) – The value to add.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the addition.

Example

>>> 1 + F.row_number().over(window)

__sub__(other)[source]

Subtract a value from window function result.

Parameters:: other (Any) – The value to subtract.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the subtraction.

Example

>>> F.row_number().over(window) - 1

__rsub__(other)[source]

Reverse subtract (e.g., 10 - window_func).

Parameters:: other (Any) – The value to subtract from.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the subtraction.

Example

>>> 10 - F.row_number().over(window)

__truediv__(other)[source]

Divide window function result by a value.

Parameters:: other (Any) – The value to divide by.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the division.

Example

>>> F.row_number().over(window) / 10

__rtruediv__(other)[source]

Reverse divide (e.g., 100 / window_func).

Parameters:: other (Any) – The value to divide.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the division.

Example

>>> 100 / F.row_number().over(window)

__neg__()[source]

Negate window function result.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the negation.

Example

>>> -F.row_number().over(window)

__eq__(other)[source]

Equality comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the equality comparison.

Example

>>> F.row_number().over(window) == 1

__ne__(other)[source]

Inequality comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the inequality comparison.

Example

>>> F.row_number().over(window) != 0

__lt__(other)[source]

Less than comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the less than comparison.

Example

>>> F.row_number().over(window) < 5

__le__(other)[source]

Less than or equal comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the less than or equal comparison.

Example

>>> F.row_number().over(window) <= 10

__gt__(other)[source]

Greater than comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the greater than comparison.

Example

>>> F.row_number().over(window) > 0

__ge__(other)[source]

Greater than or equal comparison.

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the greater than or equal comparison.

Example

>>> F.row_number().over(window) >= 1

isnull()[source]

Check if window function result is null.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the isnull check.

Example

>>> F.lag("value", 1).over(window).isnull()

isnotnull()[source]

Check if window function result is not null.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the isnotnull check.

Example

>>> F.lag("value", 1).over(window).isnotnull()

eqNullSafe(other)[source]

Null-safe equality comparison (PySpark eqNullSafe).

Parameters:: other (Any) – The value to compare with.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the null-safe equality comparison.

Example

>>> F.row_number().over(window).eqNullSafe(1)

evaluate(data)[source]

Evaluate the window function over the data.

Parameters:: data (List[Dict[str, Any]]) – List of data rows.
Return type:: List[Any]
Returns:: List of window function results.

XML Functions

XML functions for PySpark 3.2+ compatibility.

class sparkless.functions.xml.XMLFunctions[source]

Bases: object

XML parsing and manipulation functions.

static from_xml(col, schema)[source]

Parse XML string to struct based on schema.

Parameters:

col (Union[Column, str]) – Column containing XML strings.
schema (str) – Schema definition string.

Return type:

Returns:

ColumnOperation representing the from_xml function.

Example

>>> df.select(F.from_xml(F.col("xml"), "name STRING, age INT"))

static to_xml(col)[source]

Convert struct column to XML string.

Parameters:: col (Union[Column, ColumnOperation, str]) – Struct column to convert.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the to_xml function.

Example

>>> df.select(F.to_xml(F.struct(F.col("name"), F.col("age"))))

static schema_of_xml(col)[source]

Infer schema from XML string.

Parameters:: col (Union[Column, str]) – Column containing XML strings.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the schema_of_xml function.

Example

>>> df.select(F.schema_of_xml(F.col("xml")))

static xpath(xml, path)[source]

Extract array of values from XML using XPath.

Parameters:

xml (Union[Column, str]) – Column containing XML strings.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath function.

Example

>>> df.select(F.xpath(F.col("xml"), "/root/item"))

static xpath_boolean(xml, path)[source]

Evaluate XPath expression to boolean.

Parameters:

xml (Union[Column, str]) – Column containing XML strings.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath_boolean function.

Example

>>> df.select(F.xpath_boolean(F.col("xml"), "/root/active='true'"))

static xpath_double(xml, path)[source]

Extract double value from XML using XPath.

Parameters:

xml (Union[Column, str]) – Column containing XML strings.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath_double function.

Example

>>> df.select(F.xpath_double(F.col("xml"), "/root/value"))

static xpath_float(xml, path)[source]

Extract float value from XML using XPath.

Parameters:

xml (Union[Column, str]) – Column containing XML strings.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath_float function.

Example

>>> df.select(F.xpath_float(F.col("xml"), "/root/price"))

static xpath_int(xml, path)[source]

Extract integer value from XML using XPath.

Parameters:

xml (Union[Column, str]) – Column containing XML strings.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath_int function.

Example

>>> df.select(F.xpath_int(F.col("xml"), "/root/age"))

static xpath_long(xml, path)[source]

Extract long value from XML using XPath.

Parameters:

xml (Union[Column, str]) – Column containing XML strings.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath_long function.

Example

>>> df.select(F.xpath_long(F.col("xml"), "/root/value"))

static xpath_short(xml, path)[source]

Extract short value from XML using XPath.

Parameters:

xml (Union[Column, str]) – Column containing XML strings.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath_short function.

Example

>>> df.select(F.xpath_short(F.col("xml"), "/root/count"))

static xpath_string(xml, path)[source]

Extract string value from XML using XPath.

Parameters:

xml (Union[Column, str]) – Column containing XML strings.
path (str) – XPath expression.

Return type:

Returns:

ColumnOperation representing the xpath_string function.

Example

>>> df.select(F.xpath_string(F.col("xml"), "/root/name"))

Crypto Functions

Cryptographic functions for Sparkless.

This module provides cryptographic functions that match PySpark’s crypto function API. Includes encryption and decryption operations for secure data processing in DataFrames.

Key Features:

AES encryption and decryption
Null-safe cryptographic operations
Type-safe operations with proper return types
Support for both column references and string literals

Example

>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"data": "sensitive information", "key": "secretkey"}]
>>> df = spark.createDataFrame(data)
>>> df.select(
...     F.aes_encrypt(F.col("data"), F.col("key")),
...     F.aes_decrypt(F.col("encrypted"), F.col("key"))
... ).show()

class sparkless.functions.crypto.CryptoFunctions[source]

Bases: object

Collection of cryptographic functions.

static aes_encrypt(data, key, mode=None, padding=None)[source]

Encrypt data using AES encryption.

Parameters:

data (Union[Column, str]) – The column containing data to encrypt.
key (Union[Column, str]) – The column containing the encryption key.
mode (Optional[str]) – Encryption mode (optional, defaults to GCM).
padding (Optional[str]) – Padding scheme (optional, defaults to PKCS5).

Return type:

Returns:

ColumnOperation representing the aes_encrypt function.

static aes_decrypt(data, key, mode=None, padding=None)[source]

Decrypt data using AES decryption.

Parameters:

data (Union[Column, str]) – The column containing encrypted data.
key (Union[Column, str]) – The column containing the decryption key.
mode (Optional[str]) – Decryption mode (optional, defaults to GCM).
padding (Optional[str]) – Padding scheme (optional, defaults to PKCS5).

Return type:

Returns:

ColumnOperation representing the aes_decrypt function.

static try_aes_decrypt(data, key, mode=None, padding=None)[source]

Null-safe AES decryption - returns NULL on error instead of throwing exception.

Parameters:

data (Union[Column, str]) – The column containing encrypted data.
key (Union[Column, str]) – The column containing the decryption key.
mode (Optional[str]) – Decryption mode (optional, defaults to GCM).
padding (Optional[str]) – Padding scheme (optional, defaults to PKCS5).

Return type:

Returns:

ColumnOperation representing the try_aes_decrypt function.

JSON/CSV Functions

JSON and CSV functions for Sparkless.

This module provides JSON and CSV processing functions that match PySpark’s API. Includes parsing, generation, and schema inference for JSON and CSV data.

class sparkless.functions.json_csv.JSONCSVFunctions[source]

Bases: object

Collection of JSON and CSV manipulation functions.

static from_json(column, schema, options=None)[source]

Parse JSON string column into struct/array column.

Parameters:

column (Union[Column, str]) – JSON string column
schema (Any) – Target schema
options (Optional[Dict[str, Any]]) – Optional parsing options

Return type:

Returns:

ColumnOperation representing from_json

static to_json(column)[source]

Convert struct/array column to JSON string.

Parameters:: column (Union[Column, str]) – Struct or array column
Return type:: ColumnOperation
Returns:: ColumnOperation representing to_json

static get_json_object(column, path)[source]

Extract JSON object at specified path.

Parameters:

column (Union[Column, str]) – JSON string column
path (str) – JSON path (e.g., ‘$.field’)

Return type:

Returns:

ColumnOperation representing get_json_object

static json_tuple(column, *fields)[source]

Extract multiple fields from JSON string.

Parameters:

column (Union[Column, str]) – JSON string column
*fields (str) – Field names to extract

Return type:

Returns:

ColumnOperation representing json_tuple

static schema_of_json(json_string)[source]

Infer schema from JSON string.

Parameters:: json_string (str) – Sample JSON string
Return type:: ColumnOperation
Returns:: ColumnOperation representing schema_of_json

static from_csv(column, schema, options=None)[source]

Parse CSV string column into struct column.

Parameters:

column (Union[Column, str]) – CSV string column
schema (Any) – Target schema
options (Optional[Dict[str, Any]]) – Optional parsing options

Return type:

Returns:

ColumnOperation representing from_csv

static to_csv(column)[source]

Convert struct column to CSV string.

Parameters:: column (Union[Column, str]) – Struct column
Return type:: ColumnOperation
Returns:: ColumnOperation representing to_csv

static schema_of_csv(csv_string)[source]

Infer schema from CSV string.

Parameters:: csv_string (str) – Sample CSV string
Return type:: ColumnOperation
Returns:: ColumnOperation representing schema_of_csv

Column Operations

Column implementation for Sparkless.

This module provides the Column class for DataFrame column operations, maintaining compatibility with PySpark’s Column interface.

class sparkless.functions.core.column.ColumnOperatorMixin[source]

Bases: object

Mixin providing common operator methods for Column and ColumnOperation.

__eq__(other)[source]

Equality comparison.

Parameters:: other (Any)
Return type:: ColumnOperation

eqNullSafe(other)[source]

Null-safe equality comparison (PySpark eqNullSafe).

This behaves like PySpark’s eqNullSafe: - If both sides are null, the comparison is True. - If exactly one side is null, the comparison is False. - Otherwise, it behaves like standard equality, including any backend-specific type coercion rules.

Parameters:: other (Any)
Return type:: ColumnOperation

__ne__(other)[source]

Inequality comparison.

Parameters:: other (Any)
Return type:: ColumnOperation

__lt__(other)[source]

Less than comparison.

Parameters:: other (Any)
Return type:: ColumnOperation

__le__(other)[source]

Less than or equal comparison.

Parameters:: other (Any)
Return type:: ColumnOperation

__gt__(other)[source]

Greater than comparison.

Parameters:: other (Any)
Return type:: ColumnOperation

__ge__(other)[source]

Greater than or equal comparison.

Parameters:: other (Any)
Return type:: ColumnOperation

__add__(other)[source]

Addition operation.

Parameters:: other (Any)
Return type:: ColumnOperation

__sub__(other)[source]

Subtraction operation.

Parameters:: other (Any)
Return type:: ColumnOperation

__mul__(other)[source]

Multiplication operation.

Parameters:: other (Any)
Return type:: ColumnOperation

__truediv__(other)[source]

Division operation.

Parameters:: other (Any)
Return type:: ColumnOperation

__mod__(other)[source]

Modulo operation.

Parameters:: other (Any)
Return type:: ColumnOperation

__pow__(other)[source]

Power operation (for col ** 2).

Parameters:: other (Any)
Return type:: ColumnOperation

__radd__(other)[source]

Reverse addition operation (for 2 + col).

Parameters:: other (Any)
Return type:: ColumnOperation

__rsub__(other)[source]

Reverse subtraction operation (for 2 - col).

Parameters:: other (Any)
Return type:: ColumnOperation

__rmul__(other)[source]

Reverse multiplication operation (for 2 * col).

Parameters:: other (Any)
Return type:: ColumnOperation

__rtruediv__(other)[source]

Reverse division operation (for 2 / col).

Parameters:: other (Any)
Return type:: ColumnOperation

__rmod__(other)[source]

Reverse modulo operation (for 2 % col).

Parameters:: other (Any)
Return type:: ColumnOperation

__rpow__(other)[source]

Reverse power operation (for 2 ** col or 3.0 ** col).

Parameters:: other (Any)
Return type:: ColumnOperation

__and__(other)[source]

Logical AND operation.

Parameters:: other (Any)
Return type:: ColumnOperation

__or__(other)[source]

Logical OR operation.

Parameters:: other (Any)
Return type:: ColumnOperation

__invert__()[source]

Logical NOT operation.

Return type:: ColumnOperation

__neg__()[source]

Unary minus operation (-column).

Return type:: ColumnOperation

isnull()[source]

Check if column value is null.

Return type:: ColumnOperation

isnotnull()[source]

Check if column value is not null.

Return type:: ColumnOperation

isNull()[source]

Check if column value is null (PySpark compatibility).

Return type:: ColumnOperation

isNotNull()[source]

Check if column value is not null (PySpark compatibility).

Return type:: ColumnOperation

isin(*values)[source]

Check if column value is in list of values.

Parameters:: *values (Any) – Variable number of values to check against. Can be passed as individual arguments (e.g., col.isin(1, 2, 3)) or as a single list (e.g., col.isin([1, 2, 3])) for backward compatibility. Supports automatic type coercion for mixed types (e.g., checking integers in a string column will convert values to strings).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the isin check.

Example

>>> df.filter(F.col("value").isin(1, 2, 3))
>>> df.filter(F.col("value").isin([1, 2, 3]))  # Also supported
>>> df.filter(F.col("str_col").isin(1, 2, 3))  # Auto-converts to strings

Note

Fixed in version 3.23.0 (Issue #226): Added support for *values arguments and automatic type coercion for mixed types to match PySpark behavior.

between(lower, upper)[source]

Check if column value is between lower and upper bounds.

Parameters:

lower (Any)
upper (Any)

Return type:

like(pattern)[source]

SQL LIKE pattern matching.

Parameters:: pattern (str)
Return type:: ColumnOperation

rlike(pattern)[source]

Regular expression pattern matching.

Parameters:: pattern (str)
Return type:: ColumnOperation

contains(literal)[source]

Check if column contains the literal string.

Parameters:: literal (str)
Return type:: ColumnOperation

startswith(literal)[source]

Check if column starts with the literal string.

Parameters:: literal (str)
Return type:: ColumnOperation

endswith(literal)[source]

Check if column ends with the literal string.

Parameters:: literal (str)
Return type:: ColumnOperation

substr(start, length)[source]

Extract substring from string column.

Parameters:

start (int) – Starting position (1-indexed, can be negative for reverse indexing).
length (int) – Length of substring (required).

Return type:

Returns:

ColumnOperation representing the substr operation.

Example

>>> df.select(F.col("name").substr(1, 2))

asc()[source]

Ascending sort order.

Return type:: ColumnOperation

desc()[source]

Descending sort order.

Return type:: ColumnOperation

desc_nulls_last()[source]

Descending sort order with nulls last.

Return type:: ColumnOperation

desc_nulls_first()[source]

Descending sort order with nulls first.

Return type:: ColumnOperation

asc_nulls_last()[source]

Ascending sort order with nulls last.

Return type:: ColumnOperation

asc_nulls_first()[source]

Ascending sort order with nulls first.

Return type:: ColumnOperation

cast(data_type)[source]

Cast column to different data type.

Parameters:: data_type (DataType)
Return type:: ColumnOperation

astype(data_type)[source]

Cast column to different data type (alias for cast).

This method is an alias for cast() and matches PySpark’s API.

Parameters:: data_type (Union[DataType, str]) – The target data type (DataType object or string name like “date”, “string”, etc.).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cast operation.

Example

>>> df.select(F.col("name").astype("string"))
>>> df.select(F.substring("date", 1, 10).astype("date"))

getItem(key)[source]

Get item from array by index or map by key.

Parameters:: key (Any) – Index (int) for array access or key (any) for map access.
Return type:: ColumnOperation
Returns:: ColumnOperation representing the getItem operation. Returns None for out-of-bounds array access (matching PySpark behavior).

Example

>>> df.select(F.col("array_col").getItem(0))
>>> df.select(F.col("map_col").getItem("key"))
>>> df.select(F.col("array_col").getItem(999))  # Returns None if out of bounds

Note

Fixed in version 3.23.0 (Issue #227): Out-of-bounds array access now returns None instead of raising errors, matching PySpark behavior.

withField(fieldName, col)[source]

Add or replace a field in a struct column.

Parameters:

fieldName (str) – Name of the field to add or replace
col (Union[Column, ColumnOperation, Literal, Any]) – Column expression for the new field value. Can be a Column, ColumnOperation, Literal, or any value that will be converted to a Literal.

Return type:

Returns:

ColumnOperation representing the withField operation.

Example

>>> df.withColumn("my_struct", F.col("my_struct").withField("new_field", F.lit("value")))
>>> df.withColumn("my_struct", F.col("my_struct").withField("existing_field", F.col("other_col")))

Note

PySpark 3.1.0+ feature. Works only on struct columns. If field exists, it will be replaced. If it doesn’t exist, it will be added.

class sparkless.functions.core.column.Column(name, column_type=None)[source]

Bases: ColumnOperatorMixin, IColumn

Mock column expression for DataFrame operations.

Provides a PySpark-compatible column expression that supports all comparison and logical operations. Used for creating complex DataFrame transformations and filtering conditions.

Parameters:

name (str)
column_type (Optional[DataType])

Initialize Column.

Parameters:

name (str) – Column name.
column_type (Optional[DataType]) – Optional data type. Defaults to StringType if not specified.

__init__(name, column_type=None)[source]

Initialize Column.

Parameters:

name (str) – Column name.
column_type (Optional[DataType]) – Optional data type. Defaults to StringType if not specified.

property name: str: Get the column name (alias if set, otherwise original name).

property original_column: Column: Get the original column (for aliased columns).

__eq__(other)[source]

Equality comparison.

Parameters:: other (Any)
Return type:: ColumnOperation

__hash__()[source]

Hash method to make Column hashable.

Return type:: int

__getitem__(key)[source]

Support subscript notation for struct field access and map lookup.

Parameters:: key (Any) – Field name (string) for struct field access, or Column for map lookup.
Returns:: New Column with the struct field path (e.g., “StructVal.E1”). For map: ColumnOperation getItem for map[key_column] lookup.
Return type:: For struct

Example

>>> F.col("StructVal")["E1"]  # Returns Column("StructVal.E1")
>>> F.col("map_col")[F.col("key_col")]  # Map lookup by column (Issue #440)

__str__()[source]

Return string representation of column for SQL generation.

Return type:: str

alias(name)[source]

Create an alias for the column.

Parameters:: name (str)
Return type:: IColumn

getField(index_or_name)[source]

Access array element by index or struct field by name (PySpark getField).

Parameters:: index_or_name (Union[int, str]) – int for array index (same as getItem), str for struct field.
Return type:: Union[Column, ColumnOperation]
Returns:: Column for struct field path, ColumnOperation for array/map access.

Example

>>> df.select(F.col("ArrayVal").getField(0))
>>> df.select(F.col("Person").getField("name"))

when(condition, value)[source]

Start a CASE WHEN expression.

Parameters:

condition (ColumnOperation)
value (Any)

Return type:

otherwise(value)[source]

End a CASE WHEN expression with default value.

Parameters:: value (Any)
Return type:: CaseWhen

over(window_spec)[source]

Apply window function over window specification.

Parameters:: window_spec (WindowSpec)
Return type:: WindowFunction

count()[source]

Count non-null values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the count operation.

avg()[source]

Average values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the avg function (PySpark-compatible).

sum()[source]

Sum values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the sum function (PySpark-compatible).

max()[source]

Maximum value in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the max function (PySpark-compatible).

min()[source]

Minimum value in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the min function (PySpark-compatible).

stddev()[source]

Standard deviation of values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the stddev function (PySpark-compatible).

variance()[source]

Variance of values in this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the variance function (PySpark-compatible).

bitwise_not()[source]

Bitwise NOT operation on this column.

Return type:: ColumnOperation
Returns:: ColumnOperation representing the bitwise_not function.

class sparkless.functions.core.column.ColumnOperation(column, operation, value=None, name=None)[source]

Bases: Column

Represents a column operation (comparison, arithmetic, etc.).

This class encapsulates column operations and their operands for evaluation during DataFrame operations. Inherits from Column to ensure isinstance() checks pass for PySpark compatibility.

Parameters:

column (Any)
operation (str)
value (Any)
name (Optional[str])

Initialize ColumnOperation.

Parameters:

column (Any) – The column being operated on (can be None for some operations).
operation (str) – The operation being performed.
value (Any) – The value or operand for the operation.
name (Optional[str]) – Optional custom name for the operation.

__init__(column, operation, value=None, name=None)[source]

Initialize ColumnOperation.

Parameters:

column (Any) – The column being operated on (can be None for some operations).
operation (str) – The operation being performed.
value (Any) – The value or operand for the operation.
name (Optional[str]) – Optional custom name for the operation.

property name: str: Get column name.

__str__()[source]

Generate SQL representation of this operation.

Return type:: str

alias(*alias_names)[source]

Create an alias for this operation (PySpark: one or more names, e.g. posexplode).

Parameters:: alias_names (str)
Return type:: ColumnOperation

getField(index_or_name)[source]

Access array element by index or struct field by name (PySpark getField).

Parameters:: index_or_name (Union[int, str])
Return type:: ColumnOperation

over(window_spec)[source]

Apply window function over window specification.

Parameters:: window_spec (WindowSpec)
Return type:: WindowFunction

Literals

Literal values for Sparkless.

This module provides Literal class for representing literal values in column expressions and transformations.

class sparkless.functions.core.literals.Literal(value, data_type=None, resolver=None)[source]

Bases: IColumn

Literal value for DataFrame operations.

Represents a literal value that can be used in column expressions and transformations, maintaining compatibility with PySpark’s lit function.

Parameters:

value (Any)
data_type (Optional[DataType])
resolver (Optional[Callable[[], Any]])

Initialize Literal.

Parameters:

value (Any) – The literal value.
data_type (Optional[DataType]) – Optional data type. Inferred from value if not specified.
resolver (Optional[Callable[[], Any]]) – Optional callable that returns the resolved value at evaluation time. The resolver should handle session resolution internally.

__init__(value, data_type=None, resolver=None)[source]

Initialize Literal.

Parameters:

value (Any) – The literal value.
data_type (Optional[DataType]) – Optional data type. Inferred from value if not specified.
resolver (Optional[Callable[[], Any]]) – Optional callable that returns the resolved value at evaluation time. The resolver should handle session resolution internally.

property name: str: Get literal name.

__eq__(other)[source]

Equality comparison.

Note: Returns ColumnOperation instead of bool for PySpark compatibility.

Parameters:: other (Any)
Return type:: ColumnOperation

__ne__(other)[source]

Inequality comparison.

Note: Returns ColumnOperation instead of bool for PySpark compatibility.

Parameters:: other (Any)
Return type:: ColumnOperation

__lt__(other)[source]

Less than comparison.

Parameters:: other (Any)
Return type:: IColumn

__le__(other)[source]

Less than or equal comparison.

Parameters:: other (Any)
Return type:: IColumn

__gt__(other)[source]

Greater than comparison.

Parameters:: other (Any)
Return type:: IColumn

__ge__(other)[source]

Greater than or equal comparison.

Parameters:: other (Any)
Return type:: IColumn

__add__(other)[source]

Addition operation.

Parameters:: other (Any)
Return type:: IColumn

__sub__(other)[source]

Subtraction operation.

Parameters:: other (Any)
Return type:: IColumn

__mul__(other)[source]

Multiplication operation.

Parameters:: other (Any)
Return type:: IColumn

__truediv__(other)[source]

Division operation.

Parameters:: other (Any)
Return type:: IColumn

__mod__(other)[source]

Modulo operation.

Parameters:: other (Any)
Return type:: IColumn

__and__(other)[source]

Logical AND operation.

Parameters:: other (Any)
Return type:: IColumn

__or__(other)[source]

Logical OR operation.

Parameters:: other (Any)
Return type:: IColumn

__invert__()[source]

Logical NOT operation.

Return type:: IColumn

__neg__()[source]

Unary minus operation (-literal).

Return type:: ColumnOperation

isnull()[source]

Check if literal value is null.

Return type:: ColumnOperation

isnotnull()[source]

Check if literal value is not null.

Return type:: ColumnOperation

isNull()[source]

Check if literal value is null (PySpark compatibility).

Return type:: ColumnOperation

isNotNull()[source]

Check if literal value is not null (PySpark compatibility).

Return type:: ColumnOperation

eqNullSafe(other)[source]

Null-safe equality comparison (PySpark eqNullSafe).

This behaves like PySpark’s eqNullSafe: - If both sides are null, the comparison is True. - If exactly one side is null, the comparison is False. - Otherwise, it behaves like standard equality, including any backend-specific type coercion rules.

Parameters:: other (Any)
Return type:: ColumnOperation

isin(*values)[source]

Check if literal value is in list of values.

Parameters:: values (Any)
Return type:: ColumnOperation

between(lower, upper)[source]

Check if literal value is between lower and upper bounds.

Parameters:

lower (Any)
upper (Any)

Return type:

like(pattern)[source]

SQL LIKE pattern matching.

Parameters:: pattern (str)
Return type:: ColumnOperation

rlike(pattern)[source]

Regular expression pattern matching.

Parameters:: pattern (str)
Return type:: ColumnOperation

alias(name)[source]

Create an alias for the literal.

Parameters:: name (str)
Return type:: Literal

asc()[source]

Ascending sort order.

Return type:: ColumnOperation

desc()[source]

Descending sort order.

Return type:: ColumnOperation

cast(data_type)[source]

Cast literal to different data type.

Parameters:: data_type (Union[DataType, str])
Return type:: ColumnOperation

astype(data_type)[source]

Cast literal to different data type (alias for cast).

This method is an alias for cast() and matches PySpark’s API.

Parameters:: data_type (Union[DataType, str]) – The target data type (DataType object or string name).
Return type:: ColumnOperation
Returns:: ColumnOperation representing the cast operation.

Example

>>> F.lit(1).astype("string")

when(condition, value)[source]

Start a CASE WHEN expression.

Parameters:

condition (ColumnOperation)
value (Any)

Return type: