Functions Reference
The Functions module provides access to all PySpark-compatible functions through the F namespace.
Main Functions Module
Core functions module for Sparkless.
This module provides the main F namespace and re-exports all function classes for backward compatibility with the original functions.py structure. The Functions class serves as the primary interface for all PySpark-compatible functions.
- Key Features:
Complete PySpark F namespace compatibility
Column functions (col, lit, when, coalesce, isnull)
String functions (upper, lower, length, trim, regexp_replace, split)
Math functions (abs, round, ceil, floor, sqrt, exp, log, pow, sin, cos, tan)
Aggregate functions (count, sum, avg, max, min, stddev, variance)
DateTime functions (current_timestamp, current_date, to_date, to_timestamp)
Window functions (row_number, rank, dense_rank, lag, lead)
Example
>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"name": "Alice", "age": 25}]
>>> df = spark.createDataFrame(data)
>>> df.select(F.upper(F.col("name")), F.col("age") * 2).show()
DataFrame[1 rows, 2 columns]
upper(name) (age * 2)
ALICE 50.0
- class sparkless.functions.functions.Column(name, column_type=None)[source]
Bases:
ColumnOperatorMixin,IColumnMock column expression for DataFrame operations.
Provides a PySpark-compatible column expression that supports all comparison and logical operations. Used for creating complex DataFrame transformations and filtering conditions.
Initialize Column.
- Parameters:
- __getitem__(key)[source]
Support subscript notation for struct field access and map lookup.
- Parameters:
key (
Any) – Field name (string) for struct field access, or Column for map lookup.- Returns:
New Column with the struct field path (e.g., “StructVal.E1”). For map: ColumnOperation getItem for map[key_column] lookup.
- Return type:
For struct
Example
>>> F.col("StructVal")["E1"] # Returns Column("StructVal.E1") >>> F.col("map_col")[F.col("key_col")] # Map lookup by column (Issue #440)
- getField(index_or_name)[source]
Access array element by index or struct field by name (PySpark getField).
- Parameters:
index_or_name (
Union[int,str]) – int for array index (same as getItem), str for struct field.- Return type:
Union[Column,ColumnOperation]- Returns:
Column for struct field path, ColumnOperation for array/map access.
Example
>>> df.select(F.col("ArrayVal").getField(0)) >>> df.select(F.col("Person").getField("name"))
- when(condition, value)[source]
Start a CASE WHEN expression.
- Parameters:
condition (
ColumnOperation)value (
Any)
- Return type:
- over(window_spec)[source]
Apply window function over window specification.
- Parameters:
window_spec (
WindowSpec)- Return type:
- count()[source]
Count non-null values in this column.
- Return type:
- Returns:
ColumnOperation representing the count operation.
- avg()[source]
Average values in this column.
- Return type:
- Returns:
ColumnOperation representing the avg function (PySpark-compatible).
- sum()[source]
Sum values in this column.
- Return type:
- Returns:
ColumnOperation representing the sum function (PySpark-compatible).
- max()[source]
Maximum value in this column.
- Return type:
- Returns:
ColumnOperation representing the max function (PySpark-compatible).
- min()[source]
Minimum value in this column.
- Return type:
- Returns:
ColumnOperation representing the min function (PySpark-compatible).
- stddev()[source]
Standard deviation of values in this column.
- Return type:
- Returns:
ColumnOperation representing the stddev function (PySpark-compatible).
- variance()[source]
Variance of values in this column.
- Return type:
- Returns:
ColumnOperation representing the variance function (PySpark-compatible).
- class sparkless.functions.functions.ColumnOperation(column, operation, value=None, name=None)[source]
Bases:
ColumnRepresents a column operation (comparison, arithmetic, etc.).
This class encapsulates column operations and their operands for evaluation during DataFrame operations. Inherits from Column to ensure isinstance() checks pass for PySpark compatibility.
Initialize ColumnOperation.
- Parameters:
- alias(*alias_names)[source]
Create an alias for this operation (PySpark: one or more names, e.g. posexplode).
- Parameters:
alias_names (
str)- Return type:
- getField(index_or_name)[source]
Access array element by index or struct field by name (PySpark getField).
- Parameters:
- Return type:
- class sparkless.functions.functions.Literal(value, data_type=None, resolver=None)[source]
Bases:
IColumnLiteral value for DataFrame operations.
Represents a literal value that can be used in column expressions and transformations, maintaining compatibility with PySpark’s lit function.
Initialize Literal.
- Parameters:
- __eq__(other)[source]
Equality comparison.
Note: Returns ColumnOperation instead of bool for PySpark compatibility.
- Parameters:
other (
Any)- Return type:
- __ne__(other)[source]
Inequality comparison.
Note: Returns ColumnOperation instead of bool for PySpark compatibility.
- Parameters:
other (
Any)- Return type:
- __ge__(other)[source]
Greater than or equal comparison.
- Parameters:
other (
Any)- Return type:
IColumn
- eqNullSafe(other)[source]
Null-safe equality comparison (PySpark eqNullSafe).
This behaves like PySpark’s eqNullSafe: - If both sides are null, the comparison is True. - If exactly one side is null, the comparison is False. - Otherwise, it behaves like standard equality, including any backend-specific type coercion rules.
- Parameters:
other (
Any)- Return type:
- isin(*values)[source]
Check if literal value is in list of values.
- Parameters:
values (
Any)- Return type:
- between(lower, upper)[source]
Check if literal value is between lower and upper bounds.
- Parameters:
- Return type:
- astype(data_type)[source]
Cast literal to different data type (alias for cast).
This method is an alias for cast() and matches PySpark’s API.
- Parameters:
data_type (
Union[DataType,str]) – The target data type (DataType object or string name).- Return type:
- Returns:
ColumnOperation representing the cast operation.
Example
>>> F.lit(1).astype("string")
- when(condition, value)[source]
Start a CASE WHEN expression.
- Parameters:
condition (
ColumnOperation)value (
Any)
- Return type:
- class sparkless.functions.functions.AggregateFunction(column, function_name, data_type=None, ignorenulls=None)[source]
Bases:
objectBase class for aggregate functions.
This class provides the base functionality for all aggregate functions including count, sum, avg, max, min, etc.
- Parameters:
Initialize AggregateFunction.
- Parameters:
- __init__(column, function_name, data_type=None, ignorenulls=None)[source]
Initialize AggregateFunction.
- over(window_spec)[source]
Apply window function over window specification.
- Parameters:
window_spec (
Any)- Return type:
- alias(name)[source]
Create an alias for this aggregate function.
- Parameters:
name (
str) – The alias name.- Return type:
- Returns:
Self for method chaining.
- cast(data_type)[source]
Cast the aggregate function result to a different data type.
- Parameters:
data_type (
Union[DataType,str]) – The target data type (DataType instance or string type name).- Return type:
- Returns:
ColumnOperation representing the cast operation.
Example
>>> F.mean(F.col("value")).cast("string")
- __add__(other)[source]
Addition operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __sub__(other)[source]
Subtraction operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __mul__(other)[source]
Multiplication operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __truediv__(other)[source]
Division operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __radd__(other)[source]
Reverse addition operation (for 2 + agg_func).
- Parameters:
other (
Any)- Return type:
- __rsub__(other)[source]
Reverse subtraction operation (for 2 - agg_func).
- Parameters:
other (
Any)- Return type:
- __rmul__(other)[source]
Reverse multiplication operation (for 2 * agg_func).
- Parameters:
other (
Any)- Return type:
- __rtruediv__(other)[source]
Reverse division operation (for 2 / agg_func).
- Parameters:
other (
Any)- Return type:
- class sparkless.functions.functions.CaseWhen(column=None, condition=None, value=None)[source]
Bases:
objectRepresents a CASE WHEN expression.
This class handles complex conditional logic with multiple conditions and default values, similar to SQL CASE WHEN statements.
Initialize CaseWhen.
- Parameters:
- cast(data_type)[source]
Cast the CASE WHEN expression to a different data type.
- Parameters:
data_type (
Any) – The target data type (DataType instance or string type name).- Return type:
- Returns:
ColumnOperation representing the cast operation.
Example
>>> F.when(F.col("value") == "A", F.lit(100)).otherwise(F.lit(200)).cast("long")
- __add__(other)[source]
Addition operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __sub__(other)[source]
Subtraction operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __mul__(other)[source]
Multiplication operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __truediv__(other)[source]
Division operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __radd__(other)[source]
Reverse addition operation (for 2 + case_when).
- Parameters:
other (
Any)- Return type:
- __rsub__(other)[source]
Reverse subtraction operation (for 2 - case_when).
- Parameters:
other (
Any)- Return type:
- __rmul__(other)[source]
Reverse multiplication operation (for 2 * case_when).
- Parameters:
other (
Any)- Return type:
- __rtruediv__(other)[source]
Reverse division operation (for 2 / case_when).
- Parameters:
other (
Any)- Return type:
- __rmod__(other)[source]
Reverse modulo operation (for 2 % case_when).
- Parameters:
other (
Any)- Return type:
- __or__(other)[source]
Bitwise OR operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __and__(other)[source]
Bitwise AND operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- class sparkless.functions.functions.WindowFunction(function, window_spec)[source]
Bases:
objectRepresents a window function.
This class handles window functions like row_number(), rank(), etc. that operate over a window specification.
- Parameters:
function (
Any)window_spec (
WindowSpec)
Initialize WindowFunction.
- Parameters:
function (
Any) – The window function (e.g., row_number(), rank()).window_spec (
WindowSpec) – The window specification.
- __init__(function, window_spec)[source]
Initialize WindowFunction.
- Parameters:
function (
Any) – The window function (e.g., row_number(), rank()).window_spec (
WindowSpec) – The window specification.
- alias(name)[source]
Create an alias for this window function.
- Parameters:
name (
str) – The alias name.- Return type:
- Returns:
Self for method chaining.
- cast(data_type)[source]
Cast the window function result to a different data type.
- Parameters:
data_type (
Any) – The target data type (DataType instance or string type name).- Return type:
- Returns:
ColumnOperation representing the cast operation.
Example
>>> F.row_number().over(window_spec).cast("long")
- __mul__(other)[source]
Multiply window function result by a value.
- Parameters:
other (
Any) – The value to multiply by.- Return type:
- Returns:
ColumnOperation representing the multiplication.
Example
>>> F.percent_rank().over(window) * 100
- __rmul__(other)[source]
Reverse multiply (e.g., 100 * window_func).
- Parameters:
other (
Any) – The value to multiply.- Return type:
- Returns:
ColumnOperation representing the multiplication.
Example
>>> 100 * F.percent_rank().over(window)
- __add__(other)[source]
Add a value to window function result.
- Parameters:
other (
Any) – The value to add.- Return type:
- Returns:
ColumnOperation representing the addition.
Example
>>> F.row_number().over(window) + 1
- __radd__(other)[source]
Reverse add (e.g., 1 + window_func).
- Parameters:
other (
Any) – The value to add.- Return type:
- Returns:
ColumnOperation representing the addition.
Example
>>> 1 + F.row_number().over(window)
- __sub__(other)[source]
Subtract a value from window function result.
- Parameters:
other (
Any) – The value to subtract.- Return type:
- Returns:
ColumnOperation representing the subtraction.
Example
>>> F.row_number().over(window) - 1
- __rsub__(other)[source]
Reverse subtract (e.g., 10 - window_func).
- Parameters:
other (
Any) – The value to subtract from.- Return type:
- Returns:
ColumnOperation representing the subtraction.
Example
>>> 10 - F.row_number().over(window)
- __truediv__(other)[source]
Divide window function result by a value.
- Parameters:
other (
Any) – The value to divide by.- Return type:
- Returns:
ColumnOperation representing the division.
Example
>>> F.row_number().over(window) / 10
- __rtruediv__(other)[source]
Reverse divide (e.g., 100 / window_func).
- Parameters:
other (
Any) – The value to divide.- Return type:
- Returns:
ColumnOperation representing the division.
Example
>>> 100 / F.row_number().over(window)
- __neg__()[source]
Negate window function result.
- Return type:
- Returns:
ColumnOperation representing the negation.
Example
>>> -F.row_number().over(window)
- __eq__(other)[source]
Equality comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the equality comparison.
Example
>>> F.row_number().over(window) == 1
- __ne__(other)[source]
Inequality comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the inequality comparison.
Example
>>> F.row_number().over(window) != 0
- __lt__(other)[source]
Less than comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the less than comparison.
Example
>>> F.row_number().over(window) < 5
- __le__(other)[source]
Less than or equal comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the less than or equal comparison.
Example
>>> F.row_number().over(window) <= 10
- __gt__(other)[source]
Greater than comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the greater than comparison.
Example
>>> F.row_number().over(window) > 0
- __ge__(other)[source]
Greater than or equal comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the greater than or equal comparison.
Example
>>> F.row_number().over(window) >= 1
- isnull()[source]
Check if window function result is null.
- Return type:
- Returns:
ColumnOperation representing the isnull check.
Example
>>> F.lag("value", 1).over(window).isnull()
- isnotnull()[source]
Check if window function result is not null.
- Return type:
- Returns:
ColumnOperation representing the isnotnull check.
Example
>>> F.lag("value", 1).over(window).isnotnull()
- class sparkless.functions.functions.Functions(*args, **kwargs)[source]
Bases:
objectMain functions namespace (F) for Sparkless.
This class provides access to all functions in a PySpark-compatible way.
Warn when Functions() is instantiated directly.
- static col(name)[source]
Create a column reference.
Note
In PySpark, col() can be called without an active SparkSession. The column expression is evaluated later when used with a DataFrame.
- static lit(value)[source]
Create a literal value.
Note
In PySpark, lit() can be called without an active SparkSession. The literal expression is evaluated later when used with a DataFrame.
- static cast(column, data_type)[source]
Cast column to different data type.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cast function.
- Raises:
RuntimeError – If no active SparkSession is available
- static char_length(column)[source]
Get character length (alias for length) (PySpark 3.5+).
- Parameters:
- Return type:
- static character_length(column)[source]
Get character length (alias for length) (PySpark 3.5+).
- Parameters:
- Return type:
- static regexp_replace(column, pattern, replacement)[source]
Replace regex pattern.
- Parameters:
- Return type:
- static format_string(format_str, *columns)[source]
Format string using printf-style placeholders.
- Parameters:
- Return type:
- static translate(column, matching_string, replace_string)[source]
Translate characters in a string using a character mapping.
- Parameters:
- Return type:
- static btrim(column, trim_string=None)[source]
Trim characters from both ends of string.
- Parameters:
- Return type:
- static contains(column, substring)[source]
Check if string contains substring.
- Parameters:
- Return type:
- static left(column, length)[source]
Extract left N characters from string.
- Parameters:
- Return type:
- static right(column, length)[source]
Extract right N characters from string.
- Parameters:
- Return type:
- static startswith(column, substring)[source]
Check if string starts with substring.
- Parameters:
- Return type:
- static endswith(column, substring)[source]
Check if string ends with substring.
- Parameters:
- Return type:
- static rlike(column, pattern)[source]
Regular expression pattern matching.
- Parameters:
- Return type:
- static isin(column, *values)[source]
Check if column value is in list of values.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the isin function.
- static replace(column, old, new)[source]
Replace occurrences of substring in string.
- Parameters:
- Return type:
- static substr(column, start, length=None)[source]
Alias for substring - Extract substring from string.
- static split_part(column, delimiter, part)[source]
Extract part of string split by delimiter.
- Parameters:
- Return type:
- static position(substring, column)[source]
Find position of substring in string (1-indexed).
- Parameters:
- Return type:
- static octet_length(column)[source]
Get byte length (octet length) of string.
- Parameters:
- Return type:
- static ucase(column)[source]
Alias for upper - Convert string to uppercase.
- Parameters:
- Return type:
- static lcase(column)[source]
Alias for lower - Convert string to lowercase.
- Parameters:
- Return type:
- static elt(n, *columns)[source]
Return element at index from list of columns.
- Parameters:
- Return type:
- static mask(column, upperChar=None, lowerChar=None, digitChar=None, otherChar=None)[source]
Mask sensitive data in a string (PySpark 3.5+).
- static json_array_length(column, path=None)[source]
Get the length of a JSON array (PySpark 3.5+).
- Parameters:
- Return type:
- static json_object_keys(column, path=None)[source]
Get the keys of a JSON object (PySpark 3.5+).
- Parameters:
- Return type:
- static xpath_number(column, path)[source]
Extract number from XML using XPath (PySpark 3.5+).
- Parameters:
- Return type:
- static aes_encrypt(data, key, mode=None, padding=None)[source]
Encrypt data using AES encryption (PySpark 3.5+).
- static aes_decrypt(data, key, mode=None, padding=None)[source]
Decrypt data using AES decryption (PySpark 3.5+).
- static try_aes_decrypt(data, key, mode=None, padding=None)[source]
Null-safe AES decryption - returns NULL on error (PySpark 3.5+).
- static to_str(column)[source]
Convert column to string (all PySpark versions).
- Parameters:
- Return type:
- static regexp_extract_all(column, pattern, idx=0)[source]
Extract all matches of a regex pattern.
- Parameters:
- Return type:
- static array_join(column, delimiter, null_replacement=None)[source]
Join array elements with a delimiter.
- static concat_ws(sep, *cols)[source]
Concatenate multiple columns with separator.
- Parameters:
- Return type:
- static regexp_extract(column, pattern, idx=0)[source]
Extract specific group matched by regex.
- Parameters:
- Return type:
- static substring_index(column, delim, count)[source]
Returns substring before/after count occurrences of delimiter.
- Parameters:
- Return type:
- static format_number(column, d)[source]
Format number with d decimal places and thousands separator.
- Parameters:
- Return type:
- static instr(column, substr)[source]
Locate position of first occurrence of substr.
- Parameters:
- Return type:
- static locate(substr, column, pos=1)[source]
Locate position of substr starting from pos.
- Parameters:
- Return type:
- static lpad(column, len, pad)[source]
Left-pad string to length len with pad string.
- Parameters:
- Return type:
- static rpad(column, len, pad)[source]
Right-pad string to length len with pad string.
- Parameters:
- Return type:
- static levenshtein(left, right)[source]
Compute Levenshtein distance between two strings.
- Parameters:
- Return type:
- static xxhash64(*cols)[source]
Compute xxHash64 value (all PySpark versions).
- Parameters:
- Return type:
- static conv(column, from_base, to_base)[source]
Convert number between bases.
- Parameters:
- Return type:
- static ceiling(column)[source]
Alias for ceil - Round up to nearest integer.
- Parameters:
- Return type:
- static log(base, column=None)[source]
Logarithm.
PySpark signature: log(base, column) or log(column) for natural log.
- static positive(column)[source]
Return positive value (identity function).
- Parameters:
- Return type:
- static rand(seed=None)[source]
Generate random column with uniform distribution [0.0, 1.0].
- Parameters:
- Return type:
- static randn(seed=None)[source]
Generate random column with standard normal distribution.
- Parameters:
- Return type:
- static rint(column)[source]
Round to nearest integer using banker’s rounding.
- Parameters:
- Return type:
- static bround(column, scale=0)[source]
Round using HALF_EVEN rounding mode.
- Parameters:
- Return type:
- static width_bucket(value, min_value, max_value, num_buckets)[source]
Compute histogram bucket number for value (PySpark 3.5+).
- static count_distinct(column)[source]
Alias for countDistinct - Count distinct values.
- Parameters:
- Return type:
- static percentile_approx(column, percentage, accuracy=10000)[source]
Approximate percentile.
- Parameters:
- Return type:
- static covar_samp(column1, column2)[source]
Sample covariance between two columns.
- Parameters:
- Return type:
- static approx_count_distinct(column, rsd=None)[source]
Approximate count of distinct elements.
- Parameters:
- Return type:
- static percentile(column, percentage)[source]
Exact percentile (PySpark 3.5+).
- Parameters:
- Return type:
- static approx_percentile(column, percentage, accuracy=10000)[source]
Approximate percentile (PySpark 3.5+).
- static max_by(column, ord)[source]
Value with max of ord column (PySpark 3.1+).
- Parameters:
- Return type:
- static min_by(column, ord)[source]
Value with min of ord column (PySpark 3.1+).
- Parameters:
- Return type:
- static count_if(column)[source]
Count where condition is true (PySpark 3.1+).
- Parameters:
- Return type:
- static any_value(column)[source]
Return any non-null value (PySpark 3.1+).
- Parameters:
- Return type:
- static current_timestamp()[source]
Current timestamp.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static current_date()[source]
Current date.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static version()[source]
Return Spark version string (PySpark 3.0+).
- Return type:
- Returns:
Literal with sparkless version
- static date_from_unix_date(days)[source]
Convert unix date (days since epoch) to date (PySpark 3.5+).
- Parameters:
- Return type:
- static to_timestamp_ltz(timestamp_str, format=None)[source]
Convert string to timestamp with local timezone (PySpark 3.5+).
- Parameters:
- Return type:
- static to_timestamp_ntz(timestamp_str, format=None)[source]
Convert string to timestamp with no timezone (PySpark 3.5+).
- Parameters:
- Return type:
- static when(condition, value=None)[source]
Start CASE WHEN expression.
- Raises:
RuntimeError – If no active SparkSession is available
- Parameters:
- Return type:
- static case_when(*conditions, else_value=None)[source]
Create CASE WHEN expression with multiple conditions.
- static expr(expression)[source]
Parse SQL expression into a column.
- Parameters:
expression (
str) – SQL expression string (e.g., “id IS NOT NULL”, “age > 18”). Must use SQL syntax, not Python expressions.- Return type:
Union[ColumnOperation,Column,CaseWhen,Literal]- Returns:
ColumnOperation for the expression.
- Raises:
RuntimeError – If no active SparkSession is available
ParseException – If SQL syntax is invalid
- static months_between(column1, column2)[source]
Calculate months between two dates.
- Parameters:
- Return type:
- static date_format(column, format)[source]
Format date/timestamp as string.
- Parameters:
- Return type:
- static date_trunc(format, timestamp)[source]
Truncate timestamp to specified unit.
- Parameters:
- Return type:
- static date_diff(end, start)[source]
Alias for datediff - Returns number of days between two dates.
- Parameters:
- Return type:
- static unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')[source]
Convert timestamp to Unix timestamp.
- Parameters:
- Return type:
- static next_day(date, dayOfWeek)[source]
First date later than date on specified day of week.
- Parameters:
- Return type:
- static timestamp_seconds(col)[source]
Convert seconds since epoch to timestamp (PySpark 3.1+).
- Parameters:
- Return type:
- static weekday(col)[source]
Day of week as integer (0=Monday, 6=Sunday) (PySpark 3.5+).
- Parameters:
- Return type:
- static extract(field, source)[source]
Extract field from date/timestamp (PySpark 3.5+).
- Parameters:
- Return type:
- static raise_error(msg)[source]
Raise an error with the specified message (PySpark 3.1+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the raise_error function
- static from_unixtime(column, format='yyyy-MM-dd HH:mm:ss')[source]
Convert unix timestamp to string.
- Parameters:
- Return type:
- static nvl(column, default_value)[source]
Return default if null. PySpark uses coalesce internally.
- Parameters:
- Return type:
- static nvl2(column, value_if_not_null, value_if_null)[source]
Return value based on null check. PySpark uses when/otherwise internally.
- static row_number()[source]
Row number window function.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static rank()[source]
Rank window function.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static dense_rank()[source]
Dense rank window function.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static lag(column, offset=1, default=None)[source]
Lag window function.
- Parameters:
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static lead(column, offset=1, default=None)[source]
Lead window function.
- Parameters:
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static nth_value(column, n)[source]
Nth value window function.
- Raises:
RuntimeError – If no active SparkSession is available
- Parameters:
- Return type:
- static ntile(n)[source]
NTILE window function.
- Raises:
RuntimeError – If no active SparkSession is available
- Parameters:
n (
int)- Return type:
- static cume_dist()[source]
Cumulative distribution window function.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static percent_rank()[source]
Percent rank window function.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static first_value(column)[source]
First value window function.
- Raises:
RuntimeError – If no active SparkSession is available
- Parameters:
- Return type:
- static last_value(column)[source]
Last value window function.
- Raises:
RuntimeError – If no active SparkSession is available
- Parameters:
- Return type:
- static array_repeat(col, count)[source]
Repeat value to create array (PySpark 3.0+).
- Parameters:
- Return type:
- static sort_array(col, asc=True)[source]
Sort array elements (PySpark 3.0+).
- Parameters:
- Return type:
- static cardinality(col)[source]
Return size of array or map (PySpark 3.5+).
- Parameters:
- Return type:
- static array_distinct(column)[source]
Remove duplicate elements from array.
- Parameters:
- Return type:
- static array_intersect(column1, column2)[source]
Intersection of two arrays.
- Parameters:
- Return type:
- static array_except(column1, column2)[source]
Elements in first array but not second.
- Parameters:
- Return type:
- static array_position(column, value)[source]
Position of element in array.
- Parameters:
- Return type:
- static array_remove(column, value)[source]
Remove all occurrences of element from array.
- Parameters:
- Return type:
- static aggregate(column, initial_value, merge, finish=None)[source]
Aggregate array elements to single value.
- static array_insert(column, pos, value)[source]
Insert element at position.
- Parameters:
- Return type:
- static arrays_overlap(column1, column2)[source]
Check if arrays have common elements.
- Parameters:
- Return type:
- static array_contains(column, value)[source]
Check if array contains value.
- Parameters:
- Return type:
- static explode(column)[source]
Returns a new row for each element in array or map.
- Parameters:
- Return type:
- static reverse(column)[source]
Reverse string or array elements. Defaults to string reverse.
- Parameters:
- Return type:
- static explode_outer(column)[source]
Explode array including null/empty arrays.
- Parameters:
- Return type:
- static posexplode_outer(column)[source]
Explode array with position including null/empty.
- Parameters:
- Return type:
- static map_entries(column)[source]
Get key-value pairs as array of structs.
- Parameters:
- Return type:
- static map_from_arrays(keys, values)[source]
Create map from key and value arrays.
- Parameters:
- Return type:
- static named_struct(*cols)[source]
Create a struct column with named fields.
- Parameters:
*cols (
Any) – Alternating field names (strings) and column values.- Return type:
- static getbit(column, pos)[source]
Get bit at position (alias for bit_get) (PySpark 3.5+).
- Parameters:
- Return type:
- static bitmap_bit_position(column)[source]
Get the bit position in a bitmap (PySpark 3.5+).
- Parameters:
- Return type:
- static bitmap_bucket_number(column)[source]
Get the bucket number in a bitmap (PySpark 3.5+).
- Parameters:
- Return type:
- static bitmap_construct_agg(column)[source]
Aggregate function - construct bitmap from values (PySpark 3.5+).
- Parameters:
- Return type:
- static bitmap_count(column)[source]
Count the number of set bits in a bitmap (PySpark 3.5+).
- Parameters:
- Return type:
- static bitmap_or_agg(column)[source]
Aggregate function - bitwise OR of bitmaps (PySpark 3.5+).
- Parameters:
- Return type:
- static convert_timezone(sourceTz, targetTz, sourceTs)[source]
Convert timestamp between timezones.
- Parameters:
- Return type:
- static current_timezone()[source]
Get current timezone.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static from_utc_timestamp(ts, tz)[source]
Convert UTC timestamp to timezone.
- Parameters:
- Return type:
- static assert_true(condition)[source]
Assert condition is true.
- Parameters:
condition (
Union[Column,ColumnOperation])- Return type:
- static ifnull(col1, col2)[source]
Return col2 if col1 is null (PySpark 3.5+).
- Parameters:
- Return type:
- static nullif(col1, col2)[source]
Return null if col1 equals col2 (PySpark 3.5+).
- Parameters:
- Return type:
- static try_subtract(left, right)[source]
Null-safe subtraction - returns NULL on error (PySpark 3.5+).
- static try_multiply(left, right)[source]
Null-safe multiplication - returns NULL on error (PySpark 3.5+).
- static try_sum(column)[source]
Null-safe sum aggregate - returns NULL on error (PySpark 3.5+).
- Parameters:
- Return type:
- static try_avg(column)[source]
Null-safe average aggregate - returns NULL on error (PySpark 3.5+).
- Parameters:
- Return type:
- static try_element_at(column, index)[source]
Null-safe element_at - returns NULL on error (PySpark 3.5+).
- static try_to_binary(column, format=None)[source]
Null-safe to_binary - returns NULL on error (PySpark 3.5+).
- Parameters:
- Return type:
- static try_to_number(column, format=None)[source]
Null-safe to_number - returns NULL on error (PySpark 3.5+).
- Parameters:
- Return type:
- static try_to_timestamp(column, format=None)[source]
Null-safe to_timestamp - returns NULL on error (PySpark 3.5+).
- Parameters:
- Return type:
- static to_xml(col)[source]
Convert struct to XML string.
- Parameters:
col (
Union[Column,ColumnOperation])- Return type:
- static xpath_boolean(xml, path)[source]
Extract boolean from XML using XPath.
- Parameters:
- Return type:
- static xpath_double(xml, path)[source]
Extract double from XML using XPath.
- Parameters:
- Return type:
- static xpath_string(xml, path)[source]
Extract string from XML using XPath.
- Parameters:
- Return type:
- static json_tuple(column, *fields)[source]
Extract multiple fields from JSON.
- Parameters:
- Return type:
- static schema_of_json(json_string)[source]
Infer schema from JSON string.
- Parameters:
json_string (
str)- Return type:
- static schema_of_csv(csv_string)[source]
Infer schema from CSV string.
- Parameters:
csv_string (
str)- Return type:
- static udf(f=None, returnType=None)[source]
Create a user-defined function (all PySpark versions).
- Parameters:
- Return type:
- Returns:
Wrapped function that can be used in DataFrame operations
Example
>>> from sparkless.sql import SparkSession, functions as F >>> from sparkless.spark_types import IntegerType >>> spark = SparkSession("test") >>> square = F.udf(lambda x: x * x, IntegerType()) >>> df = spark.createDataFrame([{"value": 5}]) >>> df.select(square("value").alias("squared")).show()
# Decorator pattern: >>> @F.udf(IntegerType()) >>> def square(x): … return x * x >>> df.select(square(“value”)).show()
- static pandas_udf(f=None, returnType=None, functionType=None)[source]
Create a Pandas UDF (vectorized UDF) (all PySpark versions).
Pandas UDFs are user-defined functions that execute vectorized operations using Pandas Series/DataFrame, providing better performance than row-at-a-time UDFs.
- Parameters:
- Return type:
- Returns:
Wrapped function that can be used in DataFrame operations
Example
>>> from sparkless.sql import SparkSession, functions as F >>> from sparkless.spark_types import IntegerType >>> spark = SparkSession("test") >>> @F.pandas_udf(IntegerType()) >>> def multiply_by_two(s): ... return s * 2 >>> df = spark.createDataFrame([{"value": 5}]) >>> df.select(multiply_by_two("value").alias("doubled")).show()
- static window(timeColumn, windowDuration, slideDuration=None, startTime=None)[source]
Create time-based window for grouping operations (all PySpark versions).
- Parameters:
timeColumn (
Union[Column,str]) – Timestamp column to windowwindowDuration (
str) – Duration string (e.g., “10 seconds”, “1 minute”, “2 hours”)slideDuration (
Optional[str]) – Slide duration for sliding windows (defaults to windowDuration)startTime (
Optional[str]) – Offset for window alignment (e.g., “0 seconds”)
- Return type:
- Returns:
Column representing window struct with start and end times
Example
>>> df.groupBy(F.window("timestamp", "10 minutes")).count() >>> df.groupBy(F.window("timestamp", "10 minutes", "5 minutes")).agg(F.sum("value"))
- static window_time(windowColumn)[source]
Extract window start time from window column (PySpark 3.4+).
- Parameters:
windowColumn (
Union[Column,str]) – Window column to extract time from- Return type:
- Returns:
Column operation representing window start timestamp
Example
>>> df.groupBy(F.window("timestamp", "1 hour")).agg( ... F.window_time(F.col("window")).alias("window_start") ... )
- static ilike(column, pattern)[source]
Case-insensitive LIKE pattern matching.
- Parameters:
- Return type:
- static find_in_set(column, str_list)[source]
Find position of value in comma-separated string list.
- Parameters:
- Return type:
- static regexp_count(column, pattern)[source]
Count occurrences of regex pattern in string.
- Parameters:
- Return type:
- static regexp_like(column, pattern)[source]
Regex pattern matching (similar to rlike).
- Parameters:
- Return type:
- static regexp_substr(column, pattern, pos=1, occurrence=1)[source]
Extract substring matching regex pattern.
- static regexp_instr(column, pattern, pos=1, occurrence=1)[source]
Find position of regex pattern match.
- static regexp(column, pattern)[source]
Alias for rlike - regex pattern matching.
- Parameters:
- Return type:
- static printf(format_str, *columns)[source]
Formatted string (like sprintf).
- Parameters:
- Return type:
- static to_char(column, format=None)[source]
Convert number/date to character string.
- Parameters:
- Return type:
- static shiftRightUnsigned(column, num_bits)[source]
Deprecated alias for shiftrightunsigned (PySpark 3.0-3.1).
- static datepart(date_part, date)[source]
SQL Server style date part extraction.
- Parameters:
- Return type:
- static make_timestamp(year, month, day, hour=0, minute=0, second=0)[source]
Create timestamp from components.
- static make_timestamp_ltz(year, month, day, hour=0, minute=0, second=0, timezone=None)[source]
Create timestamp with local timezone.
- static make_timestamp_ntz(year, month, day, hour=0, minute=0, second=0)[source]
Create timestamp with no timezone.
- static make_interval(years=0, months=0, weeks=0, days=0, hours=0, mins=0, secs=0)[source]
Create interval from components.
- static to_unix_timestamp(column, format=None)[source]
Convert to unix timestamp.
- Parameters:
- Return type:
- static unix_millis(column)[source]
Convert timestamp to unix milliseconds.
- Parameters:
- Return type:
- static unix_micros(column)[source]
Convert timestamp to unix microseconds.
- Parameters:
- Return type:
- static timestamp_millis(column)[source]
Create timestamp from unix milliseconds.
- Parameters:
- Return type:
- static timestamp_micros(column)[source]
Create timestamp from unix microseconds.
- Parameters:
- Return type:
- static inline_outer(col)[source]
Explode array of structs into rows (outer join style).
- Parameters:
- Return type:
- static str_to_map(column, pair_delim=',', key_value_delim=':')[source]
Convert string to map using delimiters.
- static approxCountDistinct(*cols)[source]
Deprecated alias for approx_count_distinct (all PySpark versions).
- Parameters:
- Return type:
- static sumDistinct(column)[source]
Deprecated alias for sum_distinct (all PySpark versions).
- Parameters:
- Return type:
- static bitwiseNOT(column)[source]
Deprecated alias for bitwise_not (all PySpark versions).
- Parameters:
- Return type:
- static toDegrees(column)[source]
Deprecated alias for degrees (all PySpark versions).
- Parameters:
- Return type:
- __init__(*args, **kwargs)
Warn when Functions() is instantiated directly.
- static toRadians(column)[source]
Deprecated alias for radians (all PySpark versions).
- Parameters:
- Return type:
- static call_function(function_name, *columns)[source]
Dynamically invoke a function from the sparkless functions namespace.
- Parameters:
- Return type:
- Returns:
Whatever the resolved function returns (typically a ColumnOperation).
- Raises:
PySparkValueError – If the requested function is not registered.
PySparkTypeError – If the supplied arguments are incompatible with the resolved function signature.
- class sparkless.functions.functions.StringFunctions[source]
Bases:
objectCollection of string manipulation functions.
- static upper(column)[source]
Convert string to uppercase.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the upper function.
- static lower(column)[source]
Convert string to lowercase.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the lower function.
- static length(column)[source]
Get the length of a string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the length function.
- static char_length(column)[source]
Alias for length() - Get the character length of a string (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the char_length function.
- static character_length(column)[source]
Alias for length() - Get the character length of a string (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the character_length function.
- static trim(column)[source]
Trim whitespace from string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the trim function.
- static ltrim(column)[source]
Trim whitespace from left side of string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ltrim function.
- static rtrim(column)[source]
Trim whitespace from right side of string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the rtrim function.
- static btrim(column, trim_string=None)[source]
Trim characters from both ends of string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the btrim function.
- static contains(column, substring)[source]
Check if string contains substring.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the contains function.
- static left(column, length)[source]
Extract left N characters from string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the left function.
- static right(column, length)[source]
Extract right N characters from string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the right function.
- static bit_length(column)[source]
Get bit length of string.
- static startswith(column, substring)[source]
Check if string starts with substring.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the startswith function.
- static endswith(column, substring)[source]
Check if string ends with substring.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the endswith function.
- static like(column, pattern)[source]
SQL LIKE pattern matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the like function.
- static rlike(column, pattern)[source]
Regular expression pattern matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the rlike function.
- static replace(column, old, new)[source]
Replace occurrences of substring in string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the replace function.
- static substr(column, start, length=None)[source]
Alias for substring - Extract substring from string.
- static split_part(column, delimiter, part)[source]
Extract part of string split by delimiter.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the split_part function.
- static position(substring, column)[source]
Find position of substring in string (1-indexed).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the position function.
- static octet_length(column)[source]
Get byte length (octet length) of string.
- static char(column)[source]
Convert integer to character.
- static ucase(column)[source]
Alias for upper - Convert string to uppercase.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ucase function.
- static lcase(column)[source]
Alias for lower - Convert string to lowercase.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the lcase function.
- static elt(n, *columns)[source]
Return element at index from list of columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the elt function.
- static regexp_replace(column, pattern, replacement)[source]
Replace regex pattern in string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp_replace function.
- static split(column, delimiter, limit=None)[source]
Split string by delimiter.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the split function.
- static concat(*columns)[source]
Concatenate multiple strings.
- static format_string(format_str, *columns)[source]
Format string using printf-style format string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the format_string function.
- static translate(column, matching_string, replace_string)[source]
Translate characters in string using character mapping.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the translate function.
- static ascii(column)[source]
Get ASCII value of first character in string.
- static base64(column)[source]
Encode string to base64.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the base64 function.
- static unbase64(column)[source]
Decode base64 string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unbase64 function.
- static regexp_extract_all(column, pattern, idx=0)[source]
Extract all matches of a regex pattern.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp_extract_all function.
Example
>>> df.select(F.regexp_extract_all(F.col("text"), r"\d+", 0))
- static array_join(column, delimiter, null_replacement=None)[source]
Join array elements with a delimiter.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_join function.
Example
>>> df.select(F.array_join(F.col("tags"), ", ")) >>> df.select(F.array_join(F.col("tags"), "|", "N/A"))
- static reverse(column)[source]
Reverse a string column.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the reverse function.
Example
>>> df.select(F.reverse(F.col("name")))
- static repeat(column, n)[source]
Repeat a string N times.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the repeat function.
Example
>>> df.select(F.repeat(F.col("text"), 3))
- static initcap(column)[source]
Capitalize first letter of each word.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the initcap function.
Example
>>> df.select(F.initcap(F.col("name")))
- static soundex(column)[source]
Soundex encoding for phonetic matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the soundex function.
Example
>>> df.select(F.soundex(F.col("name")))
- static parse_url(url, part)[source]
Extract a part from a URL.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the parse_url function.
Example
>>> df.select(F.parse_url(F.col("url"), "HOST"))
- static url_encode(url)[source]
URL-encode a string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the url_encode function.
Example
>>> df.select(F.url_encode(F.col("text")))
- static url_decode(url)[source]
URL-decode a string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the url_decode function.
Example
>>> df.select(F.url_decode(F.col("encoded")))
- static concat_ws(sep, *cols)[source]
Concatenate multiple columns with a separator.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing concat_ws
Example
>>> df.select(F.concat_ws("-", F.col("first"), F.col("last")))
- static regexp_extract(column, pattern, idx=0)[source]
Extract a specific group matched by a regex pattern.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing regexp_extract
Example
>>> df.select(F.regexp_extract(F.col("email"), r"(.+)@(.+)", 1)) >>> df.select(F.regexp_extract(F.col("text"), r"(?<=prefix_)\w+", 0))
Note
Fixed in version 3.23.0 (Issue #228): Added fallback support for regex patterns with lookahead and lookbehind assertions using Python’s re module when Polars native support is unavailable.
- static substring_index(column, delim, count)[source]
Returns substring before/after count occurrences of delimiter.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing substring_index
Example
>>> df.select(F.substring_index(F.col("path"), "/", 2))
- static format_number(column, d)[source]
Format number with d decimal places and thousands separator.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing format_number
Example
>>> df.select(F.format_number(F.col("amount"), 2))
- static instr(column, substr)[source]
Locate the position of the first occurrence of substr (1-indexed).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing instr
Example
>>> df.select(F.instr(F.col("text"), "spark"))
- static locate(substr, column, pos=1)[source]
Locate the position of substr starting from pos (1-indexed).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing locate
Example
>>> df.select(F.locate("spark", F.col("text"), 1))
- static lpad(column, len, pad)[source]
Left-pad string column to length len with pad string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing lpad
Example
>>> df.select(F.lpad(F.col("id"), 5, "0"))
- static rpad(column, len, pad)[source]
Right-pad string column to length len with pad string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing rpad
Example
>>> df.select(F.rpad(F.col("id"), 5, "0"))
- static levenshtein(left, right)[source]
Compute Levenshtein distance between two strings.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing levenshtein
Example
>>> df.select(F.levenshtein(F.col("word1"), F.col("word2")))
- static overlay(src, replace, pos, len=-1)[source]
Replace part of a string with another string starting at a position (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation for overlay operation
Example
>>> df.select(F.overlay(F.col("text"), F.lit("NEW"), F.lit(5), F.lit(3)))
- static bin(column)[source]
Convert to binary string representation.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing bin
- static hex(column)[source]
Convert to hexadecimal string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing hex
- static unhex(column)[source]
Convert hex string to binary.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing unhex
- static hash(*cols)[source]
Compute hash value of given columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing hash
- static xxhash64(*cols)[source]
Compute xxHash64 value of given columns (all PySpark versions).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing xxhash64
- static encode(column, charset)[source]
Encode string to binary using charset.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing encode
- static decode(column, charset)[source]
Decode binary to string using charset.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing decode
- static conv(column, from_base, to_base)[source]
Convert number from one base to another.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing conv
- static md5(column)[source]
Calculate MD5 hash of string (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing md5 function (returns 32-char hex string)
Example
>>> df.select(F.md5(F.col("text")))
- static sha1(column)[source]
Calculate SHA-1 hash of string (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing sha1 function (returns 40-char hex string)
Example
>>> df.select(F.sha1(F.col("text")))
- static sha2(column, numBits)[source]
Calculate SHA-2 family hash (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing sha2 function (returns hex string)
Example
>>> df.select(F.sha2(F.col("text"), 256))
- static crc32(column)[source]
Calculate CRC32 checksum (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing crc32 function (returns signed 32-bit int)
Example
>>> df.select(F.crc32(F.col("text")))
- static to_str(column)[source]
Convert column to string representation (all PySpark versions).
- Parameters:
- Return type:
- Returns:
Column operation for string conversion
Example
>>> df.select(F.to_str(F.col("value")))
- static ilike(column, pattern)[source]
Case-insensitive LIKE pattern matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ilike function.
- static find_in_set(column, str_list)[source]
Find position of value in comma-separated string list.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the find_in_set function.
- static regexp_count(column, pattern)[source]
Count occurrences of regex pattern in string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp_count function.
- static regexp_like(column, pattern)[source]
Regex pattern matching (similar to rlike).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp_like function.
- static regexp_substr(column, pattern, pos=1, occurrence=1)[source]
Extract substring matching regex pattern.
- static regexp_instr(column, pattern, pos=1, occurrence=1)[source]
Find position of regex pattern match.
- static regexp(column, pattern)[source]
Alias for rlike - regex pattern matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp function.
- static printf(format_str, *columns)[source]
Formatted string (like sprintf).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the printf function.
- static to_char(column, format=None)[source]
Convert number/date to character string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_char function.
- static to_varchar(column, length=None)[source]
Convert to varchar type.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_varchar function.
- static typeof(column)[source]
Get type of value as string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the typeof function.
- static stack(n, *cols)[source]
Stack multiple columns into rows.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the stack function.
- static sha(column)[source]
Alias for sha1 - Calculate SHA-1 hash of string (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing sha function (returns 40-char hex string).
Example
>>> df.select(F.sha(F.col("text")))
- static mask(column, upperChar=None, lowerChar=None, digitChar=None, otherChar=None)[source]
Mask sensitive data in a string (PySpark 3.5+).
- Parameters:
upperChar (
Optional[str]) – Character to use for uppercase letters (default: ‘X’).lowerChar (
Optional[str]) – Character to use for lowercase letters (default: ‘x’).digitChar (
Optional[str]) – Character to use for digits (default: ‘n’).otherChar (
Optional[str]) – Character to use for other characters (default: ‘-‘).
- Return type:
- Returns:
ColumnOperation representing the mask function.
Example
>>> df.select(F.mask(F.col("email"), upperChar='U', lowerChar='l', digitChar='d'))
- static json_array_length(column, path=None)[source]
Get the length of a JSON array (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the json_array_length function.
Example
>>> df.select(F.json_array_length(F.col("json_col"), "$.array"))
- static json_object_keys(column, path=None)[source]
Get the keys of a JSON object (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the json_object_keys function.
Example
>>> df.select(F.json_object_keys(F.col("json_col"), "$.object"))
- static xpath_number(column, path)[source]
Extract number from XML using XPath (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the xpath_number function.
Example
>>> df.select(F.xpath_number(F.col("xml_col"), "/root/value"))
- class sparkless.functions.functions.MathFunctions[source]
Bases:
objectCollection of mathematical functions.
- static abs(column)[source]
Get absolute value.
- static positive(column)[source]
Return positive value (identity function).
- static negative(column)[source]
Return negative value.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the negative function.
- static round(column, scale=0)[source]
Round to specified number of decimal places.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the round function.
- static ceil(column)[source]
Round up to nearest integer.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ceil function.
- static ceiling(column)[source]
Alias for ceil - Round up to nearest integer.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ceiling function.
- static floor(column)[source]
Round down to nearest integer.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the floor function.
- static sqrt(column)[source]
Get square root.
- static exp(column)[source]
Get exponential (e^x).
- static log(base, column=None)[source]
Get logarithm.
PySpark signature: log(base, column) or log(column) for natural log.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the log function.
- static log10(column)[source]
Get base-10 logarithm (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the log10 function.
Example
>>> df.select(F.log10(F.col("value")))
- static log2(column)[source]
Get base-2 logarithm (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the log2 function.
Example
>>> df.select(F.log2(F.col("value")))
- static log1p(column)[source]
Get natural logarithm of (1 + x) (PySpark 3.0+).
Computes ln(1 + x) accurately for small values of x.
- Parameters:
column (
Union[Column,str]) – The column to compute log1p of.- Return type:
- Returns:
ColumnOperation representing the log1p function.
Example
>>> df.select(F.log1p(F.col("value")))
- static expm1(column)[source]
Get exp(x) - 1 (PySpark 3.0+).
Computes e^x - 1 accurately for small values of x.
- Parameters:
column (
Union[Column,str]) – The column to compute expm1 of.- Return type:
- Returns:
ColumnOperation representing the expm1 function.
Example
>>> df.select(F.expm1(F.col("value")))
- static sin(column)[source]
Get sine.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sin function.
- static cos(column)[source]
Get cosine.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cos function.
- static tan(column)[source]
Get tangent.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the tan function.
- static sign(column)[source]
Get sign of number (-1, 0, or 1).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sign function.
- static greatest(*columns)[source]
Get the greatest value among columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the greatest function.
- static least(*columns)[source]
Get the least value among columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the least function.
- static acosh(col)[source]
Compute inverse hyperbolic cosine (arc hyperbolic cosine).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the acosh function.
Note
Input must be >= 1. Returns NaN for invalid inputs.
- static asinh(col)[source]
Compute inverse hyperbolic sine (arc hyperbolic sine).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the asinh function.
- static atanh(col)[source]
Compute inverse hyperbolic tangent (arc hyperbolic tangent).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the atanh function.
Note
Input must be in range (-1, 1). Returns NaN for invalid inputs.
- static acos(col)[source]
Compute inverse cosine (arc cosine).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the acos function.
- static asin(col)[source]
Compute inverse sine (arc sine).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the asin function.
- static atan(col)[source]
Compute inverse tangent (arc tangent).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the atan function.
- static atan2(y, x)[source]
Compute 2-argument arctangent (PySpark 3.0+).
Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the atan2 function.
Example
>>> df.select(F.atan2(F.col("y"), F.col("x"))) >>> df.select(F.atan2(F.lit(1.0), F.lit(1.0))) # Returns π/4
- static cosh(col)[source]
Compute hyperbolic cosine.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cosh function.
- static sinh(col)[source]
Compute hyperbolic sine.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sinh function.
- static tanh(col)[source]
Compute hyperbolic tangent.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the tanh function.
- static degrees(col)[source]
Convert radians to degrees.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the degrees function.
- static radians(col)[source]
Convert degrees to radians.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the radians function.
- static cbrt(col)[source]
Compute cube root.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cbrt function.
- static factorial(col)[source]
Compute factorial.
- static rand(seed=None)[source]
Generate a random column with i.i.d. samples from U[0.0, 1.0].
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the rand function.
- static randn(seed=None)[source]
Generate a random column with i.i.d. samples from standard normal distribution.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the randn function.
- static rint(col)[source]
Round to nearest integer using banker’s rounding (half to even).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the rint function.
- static bround(col, scale=0)[source]
Round using HALF_EVEN rounding mode (banker’s rounding).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the bround function.
- static hypot(col1, col2)[source]
Compute sqrt(col1^2 + col2^2) (hypotenuse).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the hypot function.
- static signum(col)[source]
Compute the signum function (sign: -1, 0, or 1).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the signum function.
- static cot(col)[source]
Compute cotangent (PySpark 3.3+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cot function.
- static csc(col)[source]
Compute cosecant (PySpark 3.3+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the csc function.
- static sec(col)[source]
Compute secant (PySpark 3.3+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sec function.
- static e()[source]
Return Euler’s number e (PySpark 3.5+).
- Return type:
- Returns:
ColumnOperation representing Euler’s number constant.
- static pi()[source]
Return the value of pi (PySpark 3.5+).
- Return type:
- Returns:
ColumnOperation representing pi constant.
- static ln(col)[source]
Compute natural logarithm (alias for log) (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ln function.
- static toDegrees(column)[source]
Deprecated alias for degrees (all PySpark versions).
Use degrees instead.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the degrees conversion.
- static toRadians(column)[source]
Deprecated alias for radians (all PySpark versions).
Use radians instead.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the radians conversion.
- static negate(column)[source]
Negate value (alias for negative).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the negate function.
- static getbit(column, bit)[source]
Get bit at specified position (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the getbit function.
Example
>>> df.select(F.getbit(F.col("value"), 3))
- class sparkless.functions.functions.AggregateFunctions[source]
Bases:
objectCollection of aggregate functions.
- static count(column=None)[source]
Count non-null values.
- Parameters:
column (
Union[Column,str,None]) – The column to count (None for count(*)).- Return type:
- Returns:
ColumnOperation representing the count function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static sum(column)[source]
Sum values.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sum function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static avg(column)[source]
Average values.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the avg function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static max(column)[source]
Maximum value.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the max function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static min(column)[source]
Minimum value.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the min function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static first(column, ignorenulls=False)[source]
First value.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the first function.
- Raises:
RuntimeError – If no active SparkSession is available
- static last(column)[source]
Last value.
- Parameters:
column (
Union[Column,str]) – The column to get last value of.- Return type:
- Returns:
AggregateFunction representing the last function.
- Raises:
RuntimeError – If no active SparkSession is available
- static collect_list(column)[source]
Collect values into a list.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the collect_list function.
- Raises:
RuntimeError – If no active SparkSession is available
- static collect_set(column)[source]
Collect unique values into a set.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the collect_set function.
- Raises:
RuntimeError – If no active SparkSession is available
- static stddev(column)[source]
Standard deviation.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the stddev function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static std(column)[source]
Alias for stddev - Standard deviation.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the std function.
- Raises:
RuntimeError – If no active SparkSession is available
- static product(column)[source]
Multiply all values in column.
- Parameters:
column (
Union[Column,str]) – The column to multiply values of.- Return type:
- Returns:
AggregateFunction representing the product function.
- Raises:
RuntimeError – If no active SparkSession is available
- static sum_distinct(column)[source]
Sum of distinct values.
- Parameters:
column (
Union[Column,str]) – The column to sum distinct values of.- Return type:
- Returns:
AggregateFunction representing the sum_distinct function.
- Raises:
RuntimeError – If no active SparkSession is available
- static variance(column)[source]
Variance.
- Parameters:
column (
Union[Column,str]) – The column to get variance of.- Return type:
- Returns:
ColumnOperation representing the variance function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static skewness(column)[source]
Skewness.
- Parameters:
column (
Union[Column,str]) – The column to get skewness of.- Return type:
- Returns:
AggregateFunction representing the skewness function.
- Raises:
RuntimeError – If no active SparkSession is available
- static kurtosis(column)[source]
Kurtosis.
- Parameters:
column (
Union[Column,str]) – The column to get kurtosis of.- Return type:
- Returns:
AggregateFunction representing the kurtosis function.
- Raises:
RuntimeError – If no active SparkSession is available
- static countDistinct(column)[source]
Count distinct values.
- Parameters:
column (
Union[Column,str]) – The column to count distinct values of.- Return type:
- Returns:
AggregateFunction representing the countDistinct function.
- Raises:
RuntimeError – If no active SparkSession is available
- static count_distinct(column)[source]
Alias for countDistinct - Count distinct values.
- Parameters:
column (
Union[Column,str]) – The column to count distinct values of.- Return type:
- Returns:
AggregateFunction representing the count_distinct function.
- Raises:
RuntimeError – If no active SparkSession is available
- static percentile_approx(column, percentage, accuracy=10000)[source]
Approximate percentile.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the percentile_approx function.
- Raises:
RuntimeError – If no active SparkSession is available
- static corr(column1, column2)[source]
Correlation between two columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the corr function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static covar_samp(column1, column2)[source]
Sample covariance between two columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the covar_samp function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static bool_and(column)[source]
Aggregate AND - returns true if all values are true (PySpark 3.1+).
- Parameters:
column (
Union[Column,str]) – Column containing boolean values.- Return type:
- Returns:
AggregateFunction representing the bool_and function.
- Raises:
RuntimeError – If no active SparkSession is available
- static bool_or(column)[source]
Aggregate OR - returns true if any value is true (PySpark 3.1+).
- Parameters:
column (
Union[Column,str]) – Column containing boolean values.- Return type:
- Returns:
AggregateFunction representing the bool_or function.
- Raises:
RuntimeError – If no active SparkSession is available
- static every(column)[source]
Alias for bool_and (PySpark 3.1+).
- Parameters:
column (
Union[Column,str]) – Column containing boolean values.- Return type:
- Returns:
AggregateFunction representing the every function.
- Raises:
RuntimeError – If no active SparkSession is available
- static some(column)[source]
Alias for bool_or (PySpark 3.1+).
- Parameters:
column (
Union[Column,str]) – Column containing boolean values.- Return type:
- Returns:
AggregateFunction representing the some function.
- Raises:
RuntimeError – If no active SparkSession is available
- static max_by(column, ord)[source]
Return value associated with the maximum of ord column (PySpark 3.1+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the max_by function.
- Raises:
RuntimeError – If no active SparkSession is available
- static min_by(column, ord)[source]
Return value associated with the minimum of ord column (PySpark 3.1+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the min_by function.
- Raises:
RuntimeError – If no active SparkSession is available
- static count_if(column)[source]
Count rows where condition is true (PySpark 3.1+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the count_if function.
- Raises:
RuntimeError – If no active SparkSession is available
- static any_value(column)[source]
Return any non-null value (non-deterministic) (PySpark 3.1+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the any_value function.
- Raises:
RuntimeError – If no active SparkSession is available
- static mean(column)[source]
Aggregate function: returns the mean of the values (alias for avg).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the mean function.
- Raises:
RuntimeError – If no active SparkSession is available
- static approx_count_distinct(column, rsd=None)[source]
Returns approximate count of distinct elements (alias for approxCountDistinct).
- Parameters:
column (
Union[Column,str]) – Column to count distinct values.rsd (
Optional[float]) – Optional relative standard deviation (default: None, which uses PySpark’s default of 0.05). Controls the approximation accuracy. Lower values provide better accuracy but use more memory. Typical values range from 0.01 (1% error) to 0.1 (10% error).
- Return type:
- Returns:
ColumnOperation representing the approx_count_distinct function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static stddev_pop(column)[source]
Returns population standard deviation.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the stddev_pop function.
- Raises:
RuntimeError – If no active SparkSession is available
- static stddev_samp(column)[source]
Returns sample standard deviation.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the stddev_samp function.
- Raises:
RuntimeError – If no active SparkSession is available
- static var_pop(column)[source]
Returns population variance.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the var_pop function.
- Raises:
RuntimeError – If no active SparkSession is available
- static var_samp(column)[source]
Returns sample variance.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the var_samp function.
- Raises:
RuntimeError – If no active SparkSession is available
- static covar_pop(column1, column2)[source]
Returns population covariance.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the covar_pop function.
- static median(column)[source]
Returns the median value (PySpark 3.4+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the median function.
- Raises:
RuntimeError – If no active SparkSession is available
- static mode(column)[source]
Returns the most frequent value (mode) (PySpark 3.4+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the mode function.
- Raises:
RuntimeError – If no active SparkSession is available
- static percentile(column, percentage)[source]
Returns the exact percentile value (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the percentile function.
- static approxCountDistinct(*cols)[source]
Deprecated alias for approx_count_distinct (all PySpark versions).
Use approx_count_distinct instead.
- static sumDistinct(column)[source]
Deprecated alias for sum_distinct (PySpark 3.2+).
Use sum_distinct instead (or sum(distinct(col)) for earlier versions).
- Parameters:
- Return type:
- Returns:
AggregateFunction for distinct sum.
- static regr_avgx(y, x)[source]
Linear regression average of x.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_avgx function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_avgy(y, x)[source]
Linear regression average of y.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_avgy function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_count(y, x)[source]
Linear regression count.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_count function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_intercept(y, x)[source]
Linear regression intercept.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_intercept function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_r2(y, x)[source]
Linear regression R-squared.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_r2 function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_slope(y, x)[source]
Linear regression slope.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_slope function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_sxx(y, x)[source]
Linear regression sum of squares of x.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_sxx function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_sxy(y, x)[source]
Linear regression sum of products.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_sxy function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_syy(y, x)[source]
Linear regression sum of squares of y.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_syy function.
- Raises:
RuntimeError – If no active SparkSession is available
- class sparkless.functions.functions.DateTimeFunctions[source]
Bases:
objectCollection of datetime functions.
- static current_timestamp()[source]
Get current timestamp.
- Return type:
- Returns:
ColumnOperation representing the current_timestamp function.
- Raises:
RuntimeError – If no active SparkSession is available
- static current_date()[source]
Get current date.
- Return type:
- Returns:
ColumnOperation representing the current_date function.
- Raises:
RuntimeError – If no active SparkSession is available
- static now()[source]
Alias for current_timestamp - Get current timestamp.
- Return type:
- Returns:
ColumnOperation representing the now function.
- static curdate()[source]
Alias for current_date - Get current date.
- Return type:
- Returns:
ColumnOperation representing the curdate function.
- static days(column)[source]
Convert number to days interval.
- static hours(column)[source]
Convert number to hours interval.
- static months(column)[source]
Convert number to months interval.
- static years(column)[source]
Convert number to years interval.
- static localtimestamp()[source]
Get local timestamp (without timezone).
- Return type:
- Returns:
ColumnOperation representing the localtimestamp function.
- static datepart(date_part, date)[source]
SQL Server style date part extraction.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the datepart function.
- static make_timestamp(year, month, day, hour=0, minute=0, second=0)[source]
Create timestamp from components.
- static make_timestamp_ltz(year, month, day, hour=0, minute=0, second=0, timezone=None)[source]
Create timestamp with local timezone.
- static make_timestamp_ntz(year, month, day, hour=0, minute=0, second=0)[source]
Create timestamp with no timezone.
- static make_interval(years=0, months=0, weeks=0, days=0, hours=0, mins=0, secs=0)[source]
Create interval from components.
- Parameters:
years (
Union[Column,str,int]) – Years component (default 0).months (
Union[Column,str,int]) – Months component (default 0).weeks (
Union[Column,str,int]) – Weeks component (default 0).days (
Union[Column,str,int]) – Days component (default 0).hours (
Union[Column,str,int]) – Hours component (default 0).mins (
Union[Column,str,int]) – Minutes component (default 0).secs (
Union[Column,str,int]) – Seconds component (default 0).
- Return type:
- Returns:
ColumnOperation representing the make_interval function.
- static make_dt_interval(days=0, hours=0, mins=0, secs=0)[source]
Create day-time interval.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the make_dt_interval function.
- static to_number(column, format=None)[source]
Convert string to number.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_number function.
- static to_binary(column, format=None)[source]
Convert to binary format.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_binary function.
- static to_unix_timestamp(column, format=None)[source]
Convert to unix timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_unix_timestamp function.
- static unix_date(column)[source]
Convert unix timestamp to date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_date function.
- static unix_seconds(column)[source]
Convert timestamp to unix seconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_seconds function.
- static unix_millis(column)[source]
Convert timestamp to unix milliseconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_millis function.
- static unix_micros(column)[source]
Convert timestamp to unix microseconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_micros function.
- static timestamp_millis(column)[source]
Create timestamp from unix milliseconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the timestamp_millis function.
- static timestamp_micros(column)[source]
Create timestamp from unix microseconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the timestamp_micros function.
- static to_date(column, format=None)[source]
Convert string, timestamp, or date to date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_date function.
- Raises:
TypeError – If input column type is not StringType, TimestampType, or DateType
- static to_timestamp(column, format=None)[source]
Convert to timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_timestamp function.
- Raises:
TypeError – If input column type is not one of the supported types.
- static hour(column)[source]
Extract hour from timestamp.
- static day(column)[source]
Extract day from date/timestamp.
- static dayofmonth(column)[source]
Extract day of month from date/timestamp (alias for day).
- static month(column)[source]
Extract month from date/timestamp.
- static year(column)[source]
Extract year from date/timestamp.
- static dayofweek(column)[source]
Extract day of week from date/timestamp.
- static dayofyear(column)[source]
Extract day of year from date/timestamp.
- static weekofyear(column)[source]
Extract week of year from date/timestamp.
- static quarter(column)[source]
Extract quarter from date/timestamp.
- static minute(column)[source]
Extract minute from timestamp.
- static second(column)[source]
Extract second from timestamp.
- static add_months(column, num_months)[source]
Add months to date/timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the add_months function.
- static months_between(column1, column2)[source]
Calculate months between two dates.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the months_between function.
- static date_add(column, days)[source]
Add days to date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_add function.
- static date_sub(column, days)[source]
Subtract days from date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_sub function.
- static date_format(column, format)[source]
Format date/timestamp as string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_format function.
- static from_unixtime(column, format='yyyy-MM-dd HH:mm:ss')[source]
Convert unix timestamp to string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the from_unixtime function.
- static timestampadd(unit, quantity, timestamp)[source]
Add time units to a timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the timestampadd function.
Example
>>> df.select(F.timestampadd("DAY", 7, F.col("created_at"))) >>> df.select(F.timestampadd("HOUR", F.col("offset"), "2024-01-01"))
- static timestampdiff(unit, start, end)[source]
Calculate difference between two timestamps.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the timestampdiff function.
Example
>>> df.select(F.timestampdiff("DAY", F.col("start_date"), F.col("end_date"))) >>> df.select(F.timestampdiff("HOUR", "2024-01-01", F.col("end_time")))
- static convert_timezone(sourceTz, targetTz, sourceTs)[source]
Convert timestamp from source to target timezone.
- Parameters:
- Return type:
- static current_timezone()[source]
Get current timezone.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static from_utc_timestamp(ts, tz)[source]
Convert UTC timestamp to given timezone.
- Parameters:
- Return type:
- static to_utc_timestamp(ts, tz)[source]
Convert timestamp from given timezone to UTC.
- Parameters:
- Return type:
- static date_part(field, source)[source]
Extract a field from a date/timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_part function.
Example
>>> df.select(F.date_part("YEAR", F.col("date")))
- static dayname(date)[source]
Get the name of the day of the week.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the dayname function.
Example
>>> df.select(F.dayname(F.col("date")))
- static make_date(year, month, day)[source]
Construct a date from year, month, day integers (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the make_date function
Example
>>> df.select(F.make_date(F.lit(2024), F.lit(3), F.lit(15)))
- static date_trunc(format, timestamp)[source]
Truncate timestamp to specified unit (year, month, day, hour, etc.).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_trunc function
Example
>>> df.select(F.date_trunc('month', F.col('timestamp')))
- static datediff(end, start)[source]
Returns number of days between two dates.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the datediff function
Example
>>> df.select(F.datediff(F.col('end_date'), F.lit('2024-01-01')))
- static date_diff(end, start)[source]
Alias for datediff - Returns number of days between two dates.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_diff function
Example
>>> df.select(F.date_diff(F.col('end_date'), F.col('start_date')))
- static unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')[source]
Convert timestamp string to Unix timestamp (seconds since epoch).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_timestamp function
Example
>>> df.select(F.unix_timestamp(F.col('timestamp'), 'yyyy-MM-dd'))
- static last_day(date)[source]
Returns the last day of the month for a given date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the last_day function
Example
>>> df.select(F.last_day(F.col('date')))
- static next_day(date, dayOfWeek)[source]
Returns the first date which is later than the value of the date column that is on the specified day of the week.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the next_day function
Example
>>> df.select(F.next_day(F.col('date'), 'Monday'))
- static trunc(date, format)[source]
Truncate date to specified unit (year, month, etc.).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the trunc function
Example
>>> df.select(F.trunc(F.col('date'), 'year'))
- static timestamp_seconds(col)[source]
Convert seconds since epoch to timestamp (PySpark 3.1+).
- Parameters:
col (
Union[Column,str,int]) – Column or integer representing seconds since epoch- Return type:
- Returns:
ColumnOperation representing the timestamp
Example
>>> df.select(F.timestamp_seconds(F.col("seconds")))
- static weekday(col)[source]
Get the day of week as an integer (0 = Monday, 6 = Sunday) (PySpark 3.5+).
- Parameters:
col (
Union[Column,str]) – Column or column name containing date/timestamp values.- Return type:
- Returns:
ColumnOperation representing the weekday function.
Note
Returns 0 for Monday through 6 for Sunday.
- static extract(field, source)[source]
Extract a field from a date/timestamp column (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the extract function.
Example
>>> df.select(F.extract("YEAR", F.col("date"))) >>> df.select(F.extract("MONTH", F.col("timestamp")))
- static date_from_unix_date(days)[source]
Convert unix date (days since epoch) to date (PySpark 3.5+).
- Parameters:
days (
Union[Column,str,int]) – Column or integer representing days since epoch (1970-01-01).- Return type:
- Returns:
ColumnOperation representing the date_from_unix_date function.
Example
>>> df.select(F.date_from_unix_date(F.col("days")))
- static to_timestamp_ltz(timestamp_str, format=None)[source]
Convert string to timestamp with local timezone (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_timestamp_ltz function.
Example
>>> df.select(F.to_timestamp_ltz(F.col("ts_str"), "yyyy-MM-dd HH:mm:ss"))
String Functions
String functions for Sparkless.
This module provides comprehensive string manipulation functions that match PySpark’s string function API. Includes case conversion, trimming, pattern matching, and string transformation operations for text processing in DataFrames.
- Key Features:
Complete PySpark string function API compatibility
Case conversion (upper, lower)
Length and trimming operations (length, trim, ltrim, rtrim)
Pattern matching and replacement (regexp_replace, split)
String manipulation (substring, concat)
Type-safe operations with proper return types
Support for both column references and string literals
Example
>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"name": " Alice ", "email": "alice@example.com"}]
>>> df = spark.createDataFrame(data)
>>> df.select(
... F.upper(F.trim(F.col("name"))),
... F.regexp_replace(F.col("email"), "@.*", "@company.com")
... ).show()
DataFrame[1 rows, 2 columns]
upper(trim(name)) regexp_replace(email, @.*, @company.com, 1)
ALICE alice@example.com
- class sparkless.functions.string.StringFunctions[source]
Bases:
objectCollection of string manipulation functions.
- static upper(column)[source]
Convert string to uppercase.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the upper function.
- static lower(column)[source]
Convert string to lowercase.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the lower function.
- static length(column)[source]
Get the length of a string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the length function.
- static char_length(column)[source]
Alias for length() - Get the character length of a string (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the char_length function.
- static character_length(column)[source]
Alias for length() - Get the character length of a string (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the character_length function.
- static trim(column)[source]
Trim whitespace from string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the trim function.
- static ltrim(column)[source]
Trim whitespace from left side of string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ltrim function.
- static rtrim(column)[source]
Trim whitespace from right side of string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the rtrim function.
- static btrim(column, trim_string=None)[source]
Trim characters from both ends of string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the btrim function.
- static contains(column, substring)[source]
Check if string contains substring.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the contains function.
- static left(column, length)[source]
Extract left N characters from string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the left function.
- static right(column, length)[source]
Extract right N characters from string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the right function.
- static bit_length(column)[source]
Get bit length of string.
- static startswith(column, substring)[source]
Check if string starts with substring.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the startswith function.
- static endswith(column, substring)[source]
Check if string ends with substring.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the endswith function.
- static like(column, pattern)[source]
SQL LIKE pattern matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the like function.
- static rlike(column, pattern)[source]
Regular expression pattern matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the rlike function.
- static replace(column, old, new)[source]
Replace occurrences of substring in string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the replace function.
- static substr(column, start, length=None)[source]
Alias for substring - Extract substring from string.
- static split_part(column, delimiter, part)[source]
Extract part of string split by delimiter.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the split_part function.
- static position(substring, column)[source]
Find position of substring in string (1-indexed).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the position function.
- static octet_length(column)[source]
Get byte length (octet length) of string.
- static char(column)[source]
Convert integer to character.
- static ucase(column)[source]
Alias for upper - Convert string to uppercase.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ucase function.
- static lcase(column)[source]
Alias for lower - Convert string to lowercase.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the lcase function.
- static elt(n, *columns)[source]
Return element at index from list of columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the elt function.
- static regexp_replace(column, pattern, replacement)[source]
Replace regex pattern in string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp_replace function.
- static split(column, delimiter, limit=None)[source]
Split string by delimiter.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the split function.
- static concat(*columns)[source]
Concatenate multiple strings.
- static format_string(format_str, *columns)[source]
Format string using printf-style format string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the format_string function.
- static translate(column, matching_string, replace_string)[source]
Translate characters in string using character mapping.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the translate function.
- static ascii(column)[source]
Get ASCII value of first character in string.
- static base64(column)[source]
Encode string to base64.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the base64 function.
- static unbase64(column)[source]
Decode base64 string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unbase64 function.
- static regexp_extract_all(column, pattern, idx=0)[source]
Extract all matches of a regex pattern.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp_extract_all function.
Example
>>> df.select(F.regexp_extract_all(F.col("text"), r"\d+", 0))
- static array_join(column, delimiter, null_replacement=None)[source]
Join array elements with a delimiter.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_join function.
Example
>>> df.select(F.array_join(F.col("tags"), ", ")) >>> df.select(F.array_join(F.col("tags"), "|", "N/A"))
- static reverse(column)[source]
Reverse a string column.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the reverse function.
Example
>>> df.select(F.reverse(F.col("name")))
- static repeat(column, n)[source]
Repeat a string N times.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the repeat function.
Example
>>> df.select(F.repeat(F.col("text"), 3))
- static initcap(column)[source]
Capitalize first letter of each word.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the initcap function.
Example
>>> df.select(F.initcap(F.col("name")))
- static soundex(column)[source]
Soundex encoding for phonetic matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the soundex function.
Example
>>> df.select(F.soundex(F.col("name")))
- static parse_url(url, part)[source]
Extract a part from a URL.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the parse_url function.
Example
>>> df.select(F.parse_url(F.col("url"), "HOST"))
- static url_encode(url)[source]
URL-encode a string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the url_encode function.
Example
>>> df.select(F.url_encode(F.col("text")))
- static url_decode(url)[source]
URL-decode a string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the url_decode function.
Example
>>> df.select(F.url_decode(F.col("encoded")))
- static concat_ws(sep, *cols)[source]
Concatenate multiple columns with a separator.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing concat_ws
Example
>>> df.select(F.concat_ws("-", F.col("first"), F.col("last")))
- static regexp_extract(column, pattern, idx=0)[source]
Extract a specific group matched by a regex pattern.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing regexp_extract
Example
>>> df.select(F.regexp_extract(F.col("email"), r"(.+)@(.+)", 1)) >>> df.select(F.regexp_extract(F.col("text"), r"(?<=prefix_)\w+", 0))
Note
Fixed in version 3.23.0 (Issue #228): Added fallback support for regex patterns with lookahead and lookbehind assertions using Python’s re module when Polars native support is unavailable.
- static substring_index(column, delim, count)[source]
Returns substring before/after count occurrences of delimiter.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing substring_index
Example
>>> df.select(F.substring_index(F.col("path"), "/", 2))
- static format_number(column, d)[source]
Format number with d decimal places and thousands separator.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing format_number
Example
>>> df.select(F.format_number(F.col("amount"), 2))
- static instr(column, substr)[source]
Locate the position of the first occurrence of substr (1-indexed).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing instr
Example
>>> df.select(F.instr(F.col("text"), "spark"))
- static locate(substr, column, pos=1)[source]
Locate the position of substr starting from pos (1-indexed).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing locate
Example
>>> df.select(F.locate("spark", F.col("text"), 1))
- static lpad(column, len, pad)[source]
Left-pad string column to length len with pad string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing lpad
Example
>>> df.select(F.lpad(F.col("id"), 5, "0"))
- static rpad(column, len, pad)[source]
Right-pad string column to length len with pad string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing rpad
Example
>>> df.select(F.rpad(F.col("id"), 5, "0"))
- static levenshtein(left, right)[source]
Compute Levenshtein distance between two strings.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing levenshtein
Example
>>> df.select(F.levenshtein(F.col("word1"), F.col("word2")))
- static overlay(src, replace, pos, len=-1)[source]
Replace part of a string with another string starting at a position (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation for overlay operation
Example
>>> df.select(F.overlay(F.col("text"), F.lit("NEW"), F.lit(5), F.lit(3)))
- static bin(column)[source]
Convert to binary string representation.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing bin
- static hex(column)[source]
Convert to hexadecimal string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing hex
- static unhex(column)[source]
Convert hex string to binary.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing unhex
- static hash(*cols)[source]
Compute hash value of given columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing hash
- static xxhash64(*cols)[source]
Compute xxHash64 value of given columns (all PySpark versions).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing xxhash64
- static encode(column, charset)[source]
Encode string to binary using charset.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing encode
- static decode(column, charset)[source]
Decode binary to string using charset.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing decode
- static conv(column, from_base, to_base)[source]
Convert number from one base to another.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing conv
- static md5(column)[source]
Calculate MD5 hash of string (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing md5 function (returns 32-char hex string)
Example
>>> df.select(F.md5(F.col("text")))
- static sha1(column)[source]
Calculate SHA-1 hash of string (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing sha1 function (returns 40-char hex string)
Example
>>> df.select(F.sha1(F.col("text")))
- static sha2(column, numBits)[source]
Calculate SHA-2 family hash (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing sha2 function (returns hex string)
Example
>>> df.select(F.sha2(F.col("text"), 256))
- static crc32(column)[source]
Calculate CRC32 checksum (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing crc32 function (returns signed 32-bit int)
Example
>>> df.select(F.crc32(F.col("text")))
- static to_str(column)[source]
Convert column to string representation (all PySpark versions).
- Parameters:
- Return type:
- Returns:
Column operation for string conversion
Example
>>> df.select(F.to_str(F.col("value")))
- static ilike(column, pattern)[source]
Case-insensitive LIKE pattern matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ilike function.
- static find_in_set(column, str_list)[source]
Find position of value in comma-separated string list.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the find_in_set function.
- static regexp_count(column, pattern)[source]
Count occurrences of regex pattern in string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp_count function.
- static regexp_like(column, pattern)[source]
Regex pattern matching (similar to rlike).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp_like function.
- static regexp_substr(column, pattern, pos=1, occurrence=1)[source]
Extract substring matching regex pattern.
- static regexp_instr(column, pattern, pos=1, occurrence=1)[source]
Find position of regex pattern match.
- static regexp(column, pattern)[source]
Alias for rlike - regex pattern matching.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the regexp function.
- static printf(format_str, *columns)[source]
Formatted string (like sprintf).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the printf function.
- static to_char(column, format=None)[source]
Convert number/date to character string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_char function.
- static to_varchar(column, length=None)[source]
Convert to varchar type.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_varchar function.
- static typeof(column)[source]
Get type of value as string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the typeof function.
- static stack(n, *cols)[source]
Stack multiple columns into rows.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the stack function.
- static sha(column)[source]
Alias for sha1 - Calculate SHA-1 hash of string (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing sha function (returns 40-char hex string).
Example
>>> df.select(F.sha(F.col("text")))
- static mask(column, upperChar=None, lowerChar=None, digitChar=None, otherChar=None)[source]
Mask sensitive data in a string (PySpark 3.5+).
- Parameters:
upperChar (
Optional[str]) – Character to use for uppercase letters (default: ‘X’).lowerChar (
Optional[str]) – Character to use for lowercase letters (default: ‘x’).digitChar (
Optional[str]) – Character to use for digits (default: ‘n’).otherChar (
Optional[str]) – Character to use for other characters (default: ‘-‘).
- Return type:
- Returns:
ColumnOperation representing the mask function.
Example
>>> df.select(F.mask(F.col("email"), upperChar='U', lowerChar='l', digitChar='d'))
- static json_array_length(column, path=None)[source]
Get the length of a JSON array (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the json_array_length function.
Example
>>> df.select(F.json_array_length(F.col("json_col"), "$.array"))
- static json_object_keys(column, path=None)[source]
Get the keys of a JSON object (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the json_object_keys function.
Example
>>> df.select(F.json_object_keys(F.col("json_col"), "$.object"))
- static xpath_number(column, path)[source]
Extract number from XML using XPath (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the xpath_number function.
Example
>>> df.select(F.xpath_number(F.col("xml_col"), "/root/value"))
Math Functions
Mathematical functions for Sparkless.
This module provides comprehensive mathematical functions that match PySpark’s math function API. Includes arithmetic operations, rounding functions, trigonometric functions, and mathematical transformations for numerical processing in DataFrames.
- Key Features:
Complete PySpark math function API compatibility
Arithmetic operations (abs, round, ceil, floor)
Advanced math functions (sqrt, exp, log, pow)
Trigonometric functions (sin, cos, tan)
Type-safe operations with proper return types
Support for both column references and numeric literals
Proper handling of edge cases and null values
Example
>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"value": 3.7, "angle": 1.57}]
>>> df = spark.createDataFrame(data)
>>> df.select(
... F.round(F.col("value"), 1),
... F.ceil(F.col("value")),
... F.sin(F.col("angle"))
... ).show()
DataFrame[1 rows, 3 columns]
round(value, 1) CEIL(value) SIN(angle)
3.7 4.0 0.9999996829318346
- class sparkless.functions.math.MathFunctions[source]
Bases:
objectCollection of mathematical functions.
- static abs(column)[source]
Get absolute value.
- static positive(column)[source]
Return positive value (identity function).
- static negative(column)[source]
Return negative value.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the negative function.
- static round(column, scale=0)[source]
Round to specified number of decimal places.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the round function.
- static ceil(column)[source]
Round up to nearest integer.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ceil function.
- static ceiling(column)[source]
Alias for ceil - Round up to nearest integer.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ceiling function.
- static floor(column)[source]
Round down to nearest integer.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the floor function.
- static sqrt(column)[source]
Get square root.
- static exp(column)[source]
Get exponential (e^x).
- static log(base, column=None)[source]
Get logarithm.
PySpark signature: log(base, column) or log(column) for natural log.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the log function.
- static log10(column)[source]
Get base-10 logarithm (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the log10 function.
Example
>>> df.select(F.log10(F.col("value")))
- static log2(column)[source]
Get base-2 logarithm (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the log2 function.
Example
>>> df.select(F.log2(F.col("value")))
- static log1p(column)[source]
Get natural logarithm of (1 + x) (PySpark 3.0+).
Computes ln(1 + x) accurately for small values of x.
- Parameters:
column (
Union[Column,str]) – The column to compute log1p of.- Return type:
- Returns:
ColumnOperation representing the log1p function.
Example
>>> df.select(F.log1p(F.col("value")))
- static expm1(column)[source]
Get exp(x) - 1 (PySpark 3.0+).
Computes e^x - 1 accurately for small values of x.
- Parameters:
column (
Union[Column,str]) – The column to compute expm1 of.- Return type:
- Returns:
ColumnOperation representing the expm1 function.
Example
>>> df.select(F.expm1(F.col("value")))
- static sin(column)[source]
Get sine.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sin function.
- static cos(column)[source]
Get cosine.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cos function.
- static tan(column)[source]
Get tangent.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the tan function.
- static sign(column)[source]
Get sign of number (-1, 0, or 1).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sign function.
- static greatest(*columns)[source]
Get the greatest value among columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the greatest function.
- static least(*columns)[source]
Get the least value among columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the least function.
- static acosh(col)[source]
Compute inverse hyperbolic cosine (arc hyperbolic cosine).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the acosh function.
Note
Input must be >= 1. Returns NaN for invalid inputs.
- static asinh(col)[source]
Compute inverse hyperbolic sine (arc hyperbolic sine).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the asinh function.
- static atanh(col)[source]
Compute inverse hyperbolic tangent (arc hyperbolic tangent).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the atanh function.
Note
Input must be in range (-1, 1). Returns NaN for invalid inputs.
- static acos(col)[source]
Compute inverse cosine (arc cosine).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the acos function.
- static asin(col)[source]
Compute inverse sine (arc sine).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the asin function.
- static atan(col)[source]
Compute inverse tangent (arc tangent).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the atan function.
- static atan2(y, x)[source]
Compute 2-argument arctangent (PySpark 3.0+).
Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the atan2 function.
Example
>>> df.select(F.atan2(F.col("y"), F.col("x"))) >>> df.select(F.atan2(F.lit(1.0), F.lit(1.0))) # Returns π/4
- static cosh(col)[source]
Compute hyperbolic cosine.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cosh function.
- static sinh(col)[source]
Compute hyperbolic sine.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sinh function.
- static tanh(col)[source]
Compute hyperbolic tangent.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the tanh function.
- static degrees(col)[source]
Convert radians to degrees.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the degrees function.
- static radians(col)[source]
Convert degrees to radians.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the radians function.
- static cbrt(col)[source]
Compute cube root.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cbrt function.
- static factorial(col)[source]
Compute factorial.
- static rand(seed=None)[source]
Generate a random column with i.i.d. samples from U[0.0, 1.0].
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the rand function.
- static randn(seed=None)[source]
Generate a random column with i.i.d. samples from standard normal distribution.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the randn function.
- static rint(col)[source]
Round to nearest integer using banker’s rounding (half to even).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the rint function.
- static bround(col, scale=0)[source]
Round using HALF_EVEN rounding mode (banker’s rounding).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the bround function.
- static hypot(col1, col2)[source]
Compute sqrt(col1^2 + col2^2) (hypotenuse).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the hypot function.
- static signum(col)[source]
Compute the signum function (sign: -1, 0, or 1).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the signum function.
- static cot(col)[source]
Compute cotangent (PySpark 3.3+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cot function.
- static csc(col)[source]
Compute cosecant (PySpark 3.3+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the csc function.
- static sec(col)[source]
Compute secant (PySpark 3.3+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sec function.
- static e()[source]
Return Euler’s number e (PySpark 3.5+).
- Return type:
- Returns:
ColumnOperation representing Euler’s number constant.
- static pi()[source]
Return the value of pi (PySpark 3.5+).
- Return type:
- Returns:
ColumnOperation representing pi constant.
- static ln(col)[source]
Compute natural logarithm (alias for log) (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ln function.
- static toDegrees(column)[source]
Deprecated alias for degrees (all PySpark versions).
Use degrees instead.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the degrees conversion.
- static toRadians(column)[source]
Deprecated alias for radians (all PySpark versions).
Use radians instead.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the radians conversion.
- static negate(column)[source]
Negate value (alias for negative).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the negate function.
- static getbit(column, bit)[source]
Get bit at specified position (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the getbit function.
Example
>>> df.select(F.getbit(F.col("value"), 3))
DateTime Functions
Datetime functions for Sparkless.
This module provides comprehensive datetime functions that match PySpark’s datetime function API. Includes date/time conversion, extraction, and manipulation operations for temporal data processing in DataFrames.
- Key Features:
Complete PySpark datetime function API compatibility
Current date/time functions (current_timestamp, current_date)
Date conversion (to_date, to_timestamp)
Date extraction (year, month, day, hour, minute, second)
Date manipulation (dayofweek, dayofyear, weekofyear, quarter)
Type-safe operations with proper return types
Support for various date formats and time zones
Proper handling of date parsing and validation
Example
>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"timestamp": "2024-01-15 10:30:00", "date_str": "2024-01-15"}]
>>> df = spark.createDataFrame(data)
>>> df.select(
... F.year(F.col("timestamp")),
... F.month(F.col("timestamp")),
... F.to_date(F.col("date_str"))
... ).show()
DataFrame[1 rows, 3 columns]
year(timestamp) month(timestamp) to_date(date_str)
2024 1 2024-01-15
- class sparkless.functions.datetime.DateTimeFunctions[source]
Bases:
objectCollection of datetime functions.
- static current_timestamp()[source]
Get current timestamp.
- Return type:
- Returns:
ColumnOperation representing the current_timestamp function.
- Raises:
RuntimeError – If no active SparkSession is available
- static current_date()[source]
Get current date.
- Return type:
- Returns:
ColumnOperation representing the current_date function.
- Raises:
RuntimeError – If no active SparkSession is available
- static now()[source]
Alias for current_timestamp - Get current timestamp.
- Return type:
- Returns:
ColumnOperation representing the now function.
- static curdate()[source]
Alias for current_date - Get current date.
- Return type:
- Returns:
ColumnOperation representing the curdate function.
- static days(column)[source]
Convert number to days interval.
- static hours(column)[source]
Convert number to hours interval.
- static months(column)[source]
Convert number to months interval.
- static years(column)[source]
Convert number to years interval.
- static localtimestamp()[source]
Get local timestamp (without timezone).
- Return type:
- Returns:
ColumnOperation representing the localtimestamp function.
- static datepart(date_part, date)[source]
SQL Server style date part extraction.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the datepart function.
- static make_timestamp(year, month, day, hour=0, minute=0, second=0)[source]
Create timestamp from components.
- static make_timestamp_ltz(year, month, day, hour=0, minute=0, second=0, timezone=None)[source]
Create timestamp with local timezone.
- static make_timestamp_ntz(year, month, day, hour=0, minute=0, second=0)[source]
Create timestamp with no timezone.
- static make_interval(years=0, months=0, weeks=0, days=0, hours=0, mins=0, secs=0)[source]
Create interval from components.
- Parameters:
years (
Union[Column,str,int]) – Years component (default 0).months (
Union[Column,str,int]) – Months component (default 0).weeks (
Union[Column,str,int]) – Weeks component (default 0).days (
Union[Column,str,int]) – Days component (default 0).hours (
Union[Column,str,int]) – Hours component (default 0).mins (
Union[Column,str,int]) – Minutes component (default 0).secs (
Union[Column,str,int]) – Seconds component (default 0).
- Return type:
- Returns:
ColumnOperation representing the make_interval function.
- static make_dt_interval(days=0, hours=0, mins=0, secs=0)[source]
Create day-time interval.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the make_dt_interval function.
- static to_number(column, format=None)[source]
Convert string to number.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_number function.
- static to_binary(column, format=None)[source]
Convert to binary format.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_binary function.
- static to_unix_timestamp(column, format=None)[source]
Convert to unix timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_unix_timestamp function.
- static unix_date(column)[source]
Convert unix timestamp to date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_date function.
- static unix_seconds(column)[source]
Convert timestamp to unix seconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_seconds function.
- static unix_millis(column)[source]
Convert timestamp to unix milliseconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_millis function.
- static unix_micros(column)[source]
Convert timestamp to unix microseconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_micros function.
- static timestamp_millis(column)[source]
Create timestamp from unix milliseconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the timestamp_millis function.
- static timestamp_micros(column)[source]
Create timestamp from unix microseconds.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the timestamp_micros function.
- static to_date(column, format=None)[source]
Convert string, timestamp, or date to date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_date function.
- Raises:
TypeError – If input column type is not StringType, TimestampType, or DateType
- static to_timestamp(column, format=None)[source]
Convert to timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_timestamp function.
- Raises:
TypeError – If input column type is not one of the supported types.
- static hour(column)[source]
Extract hour from timestamp.
- static day(column)[source]
Extract day from date/timestamp.
- static dayofmonth(column)[source]
Extract day of month from date/timestamp (alias for day).
- static month(column)[source]
Extract month from date/timestamp.
- static year(column)[source]
Extract year from date/timestamp.
- static dayofweek(column)[source]
Extract day of week from date/timestamp.
- static dayofyear(column)[source]
Extract day of year from date/timestamp.
- static weekofyear(column)[source]
Extract week of year from date/timestamp.
- static quarter(column)[source]
Extract quarter from date/timestamp.
- static minute(column)[source]
Extract minute from timestamp.
- static second(column)[source]
Extract second from timestamp.
- static add_months(column, num_months)[source]
Add months to date/timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the add_months function.
- static months_between(column1, column2)[source]
Calculate months between two dates.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the months_between function.
- static date_add(column, days)[source]
Add days to date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_add function.
- static date_sub(column, days)[source]
Subtract days from date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_sub function.
- static date_format(column, format)[source]
Format date/timestamp as string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_format function.
- static from_unixtime(column, format='yyyy-MM-dd HH:mm:ss')[source]
Convert unix timestamp to string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the from_unixtime function.
- static timestampadd(unit, quantity, timestamp)[source]
Add time units to a timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the timestampadd function.
Example
>>> df.select(F.timestampadd("DAY", 7, F.col("created_at"))) >>> df.select(F.timestampadd("HOUR", F.col("offset"), "2024-01-01"))
- static timestampdiff(unit, start, end)[source]
Calculate difference between two timestamps.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the timestampdiff function.
Example
>>> df.select(F.timestampdiff("DAY", F.col("start_date"), F.col("end_date"))) >>> df.select(F.timestampdiff("HOUR", "2024-01-01", F.col("end_time")))
- static convert_timezone(sourceTz, targetTz, sourceTs)[source]
Convert timestamp from source to target timezone.
- Parameters:
- Return type:
- static current_timezone()[source]
Get current timezone.
- Raises:
RuntimeError – If no active SparkSession is available
- Return type:
- static from_utc_timestamp(ts, tz)[source]
Convert UTC timestamp to given timezone.
- Parameters:
- Return type:
- static to_utc_timestamp(ts, tz)[source]
Convert timestamp from given timezone to UTC.
- Parameters:
- Return type:
- static date_part(field, source)[source]
Extract a field from a date/timestamp.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_part function.
Example
>>> df.select(F.date_part("YEAR", F.col("date")))
- static dayname(date)[source]
Get the name of the day of the week.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the dayname function.
Example
>>> df.select(F.dayname(F.col("date")))
- static make_date(year, month, day)[source]
Construct a date from year, month, day integers (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the make_date function
Example
>>> df.select(F.make_date(F.lit(2024), F.lit(3), F.lit(15)))
- static date_trunc(format, timestamp)[source]
Truncate timestamp to specified unit (year, month, day, hour, etc.).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_trunc function
Example
>>> df.select(F.date_trunc('month', F.col('timestamp')))
- static datediff(end, start)[source]
Returns number of days between two dates.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the datediff function
Example
>>> df.select(F.datediff(F.col('end_date'), F.lit('2024-01-01')))
- static date_diff(end, start)[source]
Alias for datediff - Returns number of days between two dates.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the date_diff function
Example
>>> df.select(F.date_diff(F.col('end_date'), F.col('start_date')))
- static unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')[source]
Convert timestamp string to Unix timestamp (seconds since epoch).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the unix_timestamp function
Example
>>> df.select(F.unix_timestamp(F.col('timestamp'), 'yyyy-MM-dd'))
- static last_day(date)[source]
Returns the last day of the month for a given date.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the last_day function
Example
>>> df.select(F.last_day(F.col('date')))
- static next_day(date, dayOfWeek)[source]
Returns the first date which is later than the value of the date column that is on the specified day of the week.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the next_day function
Example
>>> df.select(F.next_day(F.col('date'), 'Monday'))
- static trunc(date, format)[source]
Truncate date to specified unit (year, month, etc.).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the trunc function
Example
>>> df.select(F.trunc(F.col('date'), 'year'))
- static timestamp_seconds(col)[source]
Convert seconds since epoch to timestamp (PySpark 3.1+).
- Parameters:
col (
Union[Column,str,int]) – Column or integer representing seconds since epoch- Return type:
- Returns:
ColumnOperation representing the timestamp
Example
>>> df.select(F.timestamp_seconds(F.col("seconds")))
- static weekday(col)[source]
Get the day of week as an integer (0 = Monday, 6 = Sunday) (PySpark 3.5+).
- Parameters:
col (
Union[Column,str]) – Column or column name containing date/timestamp values.- Return type:
- Returns:
ColumnOperation representing the weekday function.
Note
Returns 0 for Monday through 6 for Sunday.
- static extract(field, source)[source]
Extract a field from a date/timestamp column (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the extract function.
Example
>>> df.select(F.extract("YEAR", F.col("date"))) >>> df.select(F.extract("MONTH", F.col("timestamp")))
- static date_from_unix_date(days)[source]
Convert unix date (days since epoch) to date (PySpark 3.5+).
- Parameters:
days (
Union[Column,str,int]) – Column or integer representing days since epoch (1970-01-01).- Return type:
- Returns:
ColumnOperation representing the date_from_unix_date function.
Example
>>> df.select(F.date_from_unix_date(F.col("days")))
- static to_timestamp_ltz(timestamp_str, format=None)[source]
Convert string to timestamp with local timezone (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the to_timestamp_ltz function.
Example
>>> df.select(F.to_timestamp_ltz(F.col("ts_str"), "yyyy-MM-dd HH:mm:ss"))
Array Functions
Array functions for Sparkless.
This module provides comprehensive array manipulation functions that match PySpark’s array function API. Includes array operations like distinct, intersect, union, except, and element operations for working with array columns in DataFrames.
- Key Features:
Complete PySpark array function API compatibility
Array set operations (distinct, intersect, union, except)
Element operations (position, remove)
Type-safe operations with proper return types
Support for both column references and array literals
Example
>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"tags": ["a", "b", "c", "a"]}, {"tags": ["d", "e", "f"]}]
>>> df = spark.createDataFrame(data)
>>> df.select(F.array_distinct(F.col("tags"))).show()
DataFrame[2 rows, 1 columns]
array_distinct(tags)
['a', 'c', 'b']
['e', 'f', 'd']
- class sparkless.functions.array.ArrayFunctions[source]
Bases:
objectCollection of array manipulation functions.
- static array_distinct(column)[source]
Remove duplicate elements from an array, preserving original element type.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_distinct function.
Example
>>> df.select(F.array_distinct(F.col("tags")))
- static array_intersect(column1, column2)[source]
Return the intersection of two arrays.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_intersect function.
Example
>>> df.select(F.array_intersect(F.col("tags1"), F.col("tags2")))
- static array_union(column1, column2)[source]
Return the union of two arrays (with duplicates removed).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_union function.
Example
>>> df.select(F.array_union(F.col("tags1"), F.col("tags2")))
- static array_except(column1, column2)[source]
Return elements in first array but not in second.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_except function.
Example
>>> df.select(F.array_except(F.col("tags1"), F.col("tags2")))
- static array_position(column, value)[source]
Return the (1-based) index of the first occurrence of value in the array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_position function.
Example
>>> df.select(F.array_position(F.col("tags"), "target"))
- static array_remove(column, value)[source]
Remove all occurrences of a value from the array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_remove function.
Example
>>> df.select(F.array_remove(F.col("tags"), "unwanted"))
- static transform(column, function)[source]
Apply a function to each element in the array.
This is a higher-order function that transforms each element of an array using the provided lambda function.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the transform function.
Example
>>> df.select(F.transform(F.col("numbers"), lambda x: x * 2))
- static filter(column, function)[source]
Filter array elements based on a predicate function.
This is a higher-order function that filters array elements using the provided lambda function.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the filter function.
Example
>>> df.select(F.filter(F.col("numbers"), lambda x: x > 10))
- static exists(column, function)[source]
Check if any element in the array satisfies the predicate.
This is a higher-order function that returns True if at least one element matches the condition.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the exists function.
Example
>>> df.select(F.exists(F.col("numbers"), lambda x: x > 100))
- static forall(column, function)[source]
Check if all elements in the array satisfy the predicate.
This is a higher-order function that returns True only if all elements match the condition.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the forall function.
Example
>>> df.select(F.forall(F.col("numbers"), lambda x: x > 0))
- static aggregate(column, initial_value, merge, finish=None)[source]
Reduce array elements to a single value.
This is a higher-order function that aggregates array elements using an accumulator pattern.
- Parameters:
column (
Union[Column,str]) – The array column to aggregate.initial_value (
Any) – Starting value for the accumulator.merge (
Callable[[Any,Any],Any]) – Lambda function (acc, x) -> acc that combines accumulator and element.finish (
Optional[Callable[[Any],Any]]) – Optional lambda to transform final accumulator value.
- Return type:
- Returns:
ColumnOperation representing the aggregate function.
Example
>>> df.select(F.aggregate(F.col("nums"), F.lit(0), lambda acc, x: acc + x))
- static zip_with(left, right, function)[source]
Merge two arrays element-wise using a function.
This is a higher-order function that combines elements from two arrays using the provided lambda function.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the zip_with function.
Example
>>> df.select(F.zip_with(F.col("arr1"), F.col("arr2"), lambda x, y: x + y))
- static array_compact(column)[source]
Remove null values from an array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_compact function.
Example
>>> df.select(F.array_compact(F.col("nums")))
- static slice(column, start, length)[source]
Extract array slice starting at position for given length.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the slice function.
Example
>>> df.select(F.slice(F.col("nums"), 2, 3))
- static element_at(column, index)[source]
Get element at index (1-based, negative for reverse indexing).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the element_at function.
Example
>>> df.select(F.element_at(F.col("nums"), 1)) # First element >>> df.select(F.element_at(F.col("nums"), -1)) # Last element
- static array_append(column, element)[source]
Append element to end of array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_append function.
Example
>>> df.select(F.array_append(F.col("nums"), 10))
- static array_prepend(column, element)[source]
Prepend element to start of array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_prepend function.
Example
>>> df.select(F.array_prepend(F.col("nums"), 0))
- static array_insert(column, pos, value)[source]
Insert element at position in array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_insert function.
Example
>>> df.select(F.array_insert(F.col("nums"), 2, 99))
- static array_size(column)[source]
Get array length.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_size function.
Example
>>> df.select(F.array_size(F.col("nums")))
- static array_sort(column)[source]
Sort array elements in ascending order.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_sort function.
Example
>>> df.select(F.array_sort(F.col("nums")))
- static array_contains(column, value)[source]
Check if array contains a specific value.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_contains function.
Example
>>> df.select(F.array_contains(F.col("tags"), "spark"))
- static array_max(column)[source]
Return maximum value from array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_max function.
Example
>>> df.select(F.array_max(F.col("nums")))
- static array_min(column)[source]
Return minimum value from array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_min function.
Example
>>> df.select(F.array_min(F.col("nums")))
- static explode(column)[source]
Returns a new row for each element in the given array or map.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the explode function.
Example
>>> df.select(F.explode(F.col("tags")))
- static size(column)[source]
Return the size (length) of an array or map.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the size function.
Example
>>> df.select(F.size(F.col("tags")))
- static flatten(column)[source]
Flatten array of arrays into a single array.
- Parameters:
column (
Union[Column,str]) – The array column containing nested arrays.- Return type:
- Returns:
ColumnOperation representing the flatten function.
Example
>>> df.select(F.flatten(F.col("nested_arrays")))
- static reverse(column)[source]
Reverse the elements of an array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the reverse function.
Example
>>> df.select(F.reverse(F.col("nums")))
- static arrays_overlap(column1, column2)[source]
Check if two arrays have any common elements.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the arrays_overlap function.
Example
>>> df.select(F.arrays_overlap(F.col("arr1"), F.col("arr2")))
- static explode_outer(column)[source]
Returns a new row for each element, including rows with null/empty arrays.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the explode_outer function.
Example
>>> df.select(F.explode_outer(F.col("tags")))
- static posexplode(column)[source]
Returns a new row for each element with position in array.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the posexplode function.
- static posexplode_outer(column)[source]
Returns a new row for each element with position, including null/empty arrays.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the posexplode_outer function.
- static arrays_zip(*columns)[source]
Merge arrays into array of structs (alias for array_zip).
- static sequence(start, stop, step=1)[source]
Generate array of integers from start to stop by step.
- static shuffle(column)[source]
Randomly shuffle array elements.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the shuffle function.
- static array(*cols)[source]
Create array from multiple columns (PySpark 3.0+).
- Parameters:
*cols (
Union[Column,str,List[Union[Column,str]]]) – Variable number of columns to combine into array. Supports multiple formats: - F.array(“Name”, “Type”) - string column names - F.array([“Name”, “Type”]) - list of string column names - F.array(F.col(“Name”), F.col(“Type”)) - Column objects - F.array([F.col(“Name”), F.col(“Type”)]) - list of Column objects- Return type:
- Returns:
ColumnOperation representing the array function.
Example
>>> df.select(F.array(F.col("a"), F.col("b"), F.col("c"))) >>> df.select(F.array(["a", "b", "c"])) # List format >>> df.select(F.array()) # Returns empty array [] (Issue #367) >>> df.select(F.array([])) # Returns empty array [] (Issue #367)
- static array_repeat(col, count)[source]
Create array by repeating value N times (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the array_repeat function.
Example
>>> df.select(F.array_repeat(F.col("value"), 3))
- static sort_array(col, asc=True)[source]
Sort array elements (PySpark 3.0+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sort_array function.
Example
>>> df.select(F.sort_array(F.col("values"), asc=False))
- static array_agg(col)[source]
Aggregate function to collect values into an array (PySpark 3.5+).
- Parameters:
col (
Union[Column,str]) – Column to aggregate into an array- Return type:
- Returns:
AggregateFunction representing the array_agg function.
Example
>>> df.groupBy("dept").agg(F.array_agg("name"))
- static cardinality(col)[source]
Return the size of an array or map (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the cardinality function.
Example
>>> df.select(F.cardinality(F.col("array_col")))
- static inline(col)[source]
Explode array of structs into rows.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the inline function.
Map Functions
Map functions for Sparkless.
This module provides comprehensive map manipulation functions that match PySpark’s map function API. Includes operations for extracting keys, values, entries, and combining maps for working with map columns in DataFrames.
- Key Features:
Complete PySpark map function API compatibility
Key/value extraction (map_keys, map_values)
Entry operations (map_entries)
Map combination (map_concat, map_from_arrays)
Type-safe operations with proper return types
Support for both column references and map literals
Example
>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"properties": {"key1": "val1", "key2": "val2"}}]
>>> df = spark.createDataFrame(data)
>>> df.select(F.map_keys(F.col("properties"))).show()
DataFrame[1 rows, 1 columns]
map_keys(properties)
['key1', 'key2']
- class sparkless.functions.map.MapFunctions[source]
Bases:
objectCollection of map manipulation functions.
- static map_keys(column)[source]
Return an array of all keys in the map.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the map_keys function.
Example
>>> df.select(F.map_keys(F.col("properties")))
- static map_values(column)[source]
Return an array of all values in the map.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the map_values function.
Example
>>> df.select(F.map_values(F.col("properties")))
- static map_entries(column)[source]
Return an array of structs with key-value pairs.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the map_entries function.
Example
>>> df.select(F.map_entries(F.col("properties")))
- static map_concat(*columns)[source]
Concatenate multiple maps into a single map.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the map_concat function.
Example
>>> df.select(F.map_concat(F.col("map1"), F.col("map2"), F.col("map3")))
- static map_from_arrays(keys, values)[source]
Create a map from two arrays (keys and values).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the map_from_arrays function.
Example
>>> df.select(F.map_from_arrays(F.col("keys"), F.col("values")))
- static create_map(*cols)[source]
Create a map from key-value pairs.
- Parameters:
*cols (
Union[Column,str,Any]) – Alternating key-value columns/literals. If no arguments are provided, returns an empty map {}.- Return type:
- Returns:
ColumnOperation representing the create_map function.
Example
>>> df.select(F.create_map(F.col("k1"), F.col("v1"), F.col("k2"), F.col("v2"))) >>> df.select(F.create_map()) # Returns empty map {}
- static map_contains_key(column, key)[source]
Check if map contains a specific key.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the map_contains_key function.
Example
>>> df.select(F.map_contains_key(F.col("map"), "key"))
- static map_from_entries(column)[source]
Convert array of key-value structs to map.
- Parameters:
column (
Union[Column,str]) – Array column containing structs with ‘key’ and ‘value’ fields.- Return type:
- Returns:
ColumnOperation representing the map_from_entries function.
Example
>>> df.select(F.map_from_entries(F.col("entries")))
- static map_filter(column, function)[source]
Filter map entries based on key-value predicate.
This is a higher-order function that filters map entries using the provided lambda function.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the map_filter function.
Example
>>> df.select(F.map_filter(F.col("map"), lambda k, v: v > 10))
- static transform_keys(column, function)[source]
Transform map keys using a function.
This is a higher-order function that transforms map keys using the provided lambda function.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the transform_keys function.
Example
>>> df.select(F.transform_keys(F.col("map"), lambda k, v: F.upper(k)))
- static transform_values(column, function)[source]
Transform map values using a function.
This is a higher-order function that transforms map values using the provided lambda function.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the transform_values function.
Example
>>> df.select(F.transform_values(F.col("map"), lambda k, v: v * 2))
- static map_zip_with(col1, col2, function)[source]
Merge two maps into a single map using a function (PySpark 3.1+).
This is a higher-order function that combines two maps by applying the provided lambda function to matching keys.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the map_zip_with function.
Example
>>> df.select(F.map_zip_with(F.col("map1"), F.col("map2"), lambda k, v1, v2: v1 + v2))
Aggregate Functions
Aggregate functions for Sparkless.
This module provides comprehensive aggregate functions that match PySpark’s aggregate function API. Includes statistical operations, counting functions, and data summarization operations for grouped data processing in DataFrames.
- Key Features:
Complete PySpark aggregate function API compatibility
Basic aggregates (count, sum, avg, max, min)
Statistical functions (stddev, variance, skewness, kurtosis)
Collection aggregates (collect_list, collect_set, first, last)
Distinct counting (countDistinct)
Type-safe operations with proper return types
Support for both column references and expressions
Proper handling of null values and edge cases
Example
>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"dept": "IT", "salary": 50000}, {"dept": "IT", "salary": 60000}]
>>> df = spark.createDataFrame(data)
>>> grouped = df.groupBy("dept")
>>> result = grouped.agg(
... F.count("*").alias("count"),
... F.avg("salary").alias("avg_salary"),
... F.max("salary").alias("max_salary")
... )
>>> result.show()
DataFrame[1 rows, 4 columns]
dept count avg_salary max_salary
IT 2 55000.0 60000
- class sparkless.functions.aggregate.AggregateFunctions[source]
Bases:
objectCollection of aggregate functions.
- static count(column=None)[source]
Count non-null values.
- Parameters:
column (
Union[Column,str,None]) – The column to count (None for count(*)).- Return type:
- Returns:
ColumnOperation representing the count function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static sum(column)[source]
Sum values.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the sum function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static avg(column)[source]
Average values.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the avg function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static max(column)[source]
Maximum value.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the max function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static min(column)[source]
Minimum value.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the min function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static first(column, ignorenulls=False)[source]
First value.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the first function.
- Raises:
RuntimeError – If no active SparkSession is available
- static last(column)[source]
Last value.
- Parameters:
column (
Union[Column,str]) – The column to get last value of.- Return type:
- Returns:
AggregateFunction representing the last function.
- Raises:
RuntimeError – If no active SparkSession is available
- static collect_list(column)[source]
Collect values into a list.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the collect_list function.
- Raises:
RuntimeError – If no active SparkSession is available
- static collect_set(column)[source]
Collect unique values into a set.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the collect_set function.
- Raises:
RuntimeError – If no active SparkSession is available
- static stddev(column)[source]
Standard deviation.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the stddev function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static std(column)[source]
Alias for stddev - Standard deviation.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the std function.
- Raises:
RuntimeError – If no active SparkSession is available
- static product(column)[source]
Multiply all values in column.
- Parameters:
column (
Union[Column,str]) – The column to multiply values of.- Return type:
- Returns:
AggregateFunction representing the product function.
- Raises:
RuntimeError – If no active SparkSession is available
- static sum_distinct(column)[source]
Sum of distinct values.
- Parameters:
column (
Union[Column,str]) – The column to sum distinct values of.- Return type:
- Returns:
AggregateFunction representing the sum_distinct function.
- Raises:
RuntimeError – If no active SparkSession is available
- static variance(column)[source]
Variance.
- Parameters:
column (
Union[Column,str]) – The column to get variance of.- Return type:
- Returns:
ColumnOperation representing the variance function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static skewness(column)[source]
Skewness.
- Parameters:
column (
Union[Column,str]) – The column to get skewness of.- Return type:
- Returns:
AggregateFunction representing the skewness function.
- Raises:
RuntimeError – If no active SparkSession is available
- static kurtosis(column)[source]
Kurtosis.
- Parameters:
column (
Union[Column,str]) – The column to get kurtosis of.- Return type:
- Returns:
AggregateFunction representing the kurtosis function.
- Raises:
RuntimeError – If no active SparkSession is available
- static countDistinct(column)[source]
Count distinct values.
- Parameters:
column (
Union[Column,str]) – The column to count distinct values of.- Return type:
- Returns:
AggregateFunction representing the countDistinct function.
- Raises:
RuntimeError – If no active SparkSession is available
- static count_distinct(column)[source]
Alias for countDistinct - Count distinct values.
- Parameters:
column (
Union[Column,str]) – The column to count distinct values of.- Return type:
- Returns:
AggregateFunction representing the count_distinct function.
- Raises:
RuntimeError – If no active SparkSession is available
- static percentile_approx(column, percentage, accuracy=10000)[source]
Approximate percentile.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the percentile_approx function.
- Raises:
RuntimeError – If no active SparkSession is available
- static corr(column1, column2)[source]
Correlation between two columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the corr function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static covar_samp(column1, column2)[source]
Sample covariance between two columns.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the covar_samp function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static bool_and(column)[source]
Aggregate AND - returns true if all values are true (PySpark 3.1+).
- Parameters:
column (
Union[Column,str]) – Column containing boolean values.- Return type:
- Returns:
AggregateFunction representing the bool_and function.
- Raises:
RuntimeError – If no active SparkSession is available
- static bool_or(column)[source]
Aggregate OR - returns true if any value is true (PySpark 3.1+).
- Parameters:
column (
Union[Column,str]) – Column containing boolean values.- Return type:
- Returns:
AggregateFunction representing the bool_or function.
- Raises:
RuntimeError – If no active SparkSession is available
- static every(column)[source]
Alias for bool_and (PySpark 3.1+).
- Parameters:
column (
Union[Column,str]) – Column containing boolean values.- Return type:
- Returns:
AggregateFunction representing the every function.
- Raises:
RuntimeError – If no active SparkSession is available
- static some(column)[source]
Alias for bool_or (PySpark 3.1+).
- Parameters:
column (
Union[Column,str]) – Column containing boolean values.- Return type:
- Returns:
AggregateFunction representing the some function.
- Raises:
RuntimeError – If no active SparkSession is available
- static max_by(column, ord)[source]
Return value associated with the maximum of ord column (PySpark 3.1+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the max_by function.
- Raises:
RuntimeError – If no active SparkSession is available
- static min_by(column, ord)[source]
Return value associated with the minimum of ord column (PySpark 3.1+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the min_by function.
- Raises:
RuntimeError – If no active SparkSession is available
- static count_if(column)[source]
Count rows where condition is true (PySpark 3.1+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the count_if function.
- Raises:
RuntimeError – If no active SparkSession is available
- static any_value(column)[source]
Return any non-null value (non-deterministic) (PySpark 3.1+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the any_value function.
- Raises:
RuntimeError – If no active SparkSession is available
- static mean(column)[source]
Aggregate function: returns the mean of the values (alias for avg).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the mean function.
- Raises:
RuntimeError – If no active SparkSession is available
- static approx_count_distinct(column, rsd=None)[source]
Returns approximate count of distinct elements (alias for approxCountDistinct).
- Parameters:
column (
Union[Column,str]) – Column to count distinct values.rsd (
Optional[float]) – Optional relative standard deviation (default: None, which uses PySpark’s default of 0.05). Controls the approximation accuracy. Lower values provide better accuracy but use more memory. Typical values range from 0.01 (1% error) to 0.1 (10% error).
- Return type:
- Returns:
ColumnOperation representing the approx_count_distinct function (PySpark-compatible).
- Raises:
RuntimeError – If no active SparkSession is available
- static stddev_pop(column)[source]
Returns population standard deviation.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the stddev_pop function.
- Raises:
RuntimeError – If no active SparkSession is available
- static stddev_samp(column)[source]
Returns sample standard deviation.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the stddev_samp function.
- Raises:
RuntimeError – If no active SparkSession is available
- static var_pop(column)[source]
Returns population variance.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the var_pop function.
- Raises:
RuntimeError – If no active SparkSession is available
- static var_samp(column)[source]
Returns sample variance.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the var_samp function.
- Raises:
RuntimeError – If no active SparkSession is available
- static covar_pop(column1, column2)[source]
Returns population covariance.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the covar_pop function.
- static median(column)[source]
Returns the median value (PySpark 3.4+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the median function.
- Raises:
RuntimeError – If no active SparkSession is available
- static mode(column)[source]
Returns the most frequent value (mode) (PySpark 3.4+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the mode function.
- Raises:
RuntimeError – If no active SparkSession is available
- static percentile(column, percentage)[source]
Returns the exact percentile value (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the percentile function.
- static approxCountDistinct(*cols)[source]
Deprecated alias for approx_count_distinct (all PySpark versions).
Use approx_count_distinct instead.
- static sumDistinct(column)[source]
Deprecated alias for sum_distinct (PySpark 3.2+).
Use sum_distinct instead (or sum(distinct(col)) for earlier versions).
- Parameters:
- Return type:
- Returns:
AggregateFunction for distinct sum.
- static regr_avgx(y, x)[source]
Linear regression average of x.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_avgx function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_avgy(y, x)[source]
Linear regression average of y.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_avgy function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_count(y, x)[source]
Linear regression count.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_count function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_intercept(y, x)[source]
Linear regression intercept.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_intercept function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_r2(y, x)[source]
Linear regression R-squared.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_r2 function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_slope(y, x)[source]
Linear regression slope.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_slope function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_sxx(y, x)[source]
Linear regression sum of squares of x.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_sxx function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_sxy(y, x)[source]
Linear regression sum of products.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_sxy function.
- Raises:
RuntimeError – If no active SparkSession is available
- static regr_syy(y, x)[source]
Linear regression sum of squares of y.
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the regr_syy function.
- Raises:
RuntimeError – If no active SparkSession is available
Conditional Functions
Conditional functions for Sparkless.
This module contains conditional functions including CASE WHEN expressions.
- sparkless.functions.conditional.validate_rule(column, rule)[source]
Convert validation rule to column expression.
- Parameters:
- Return type:
- Returns:
Column expression for the validation rule.
- Raises:
ValueError – If rule is not recognized.
- class sparkless.functions.conditional.CaseWhen(column=None, condition=None, value=None)[source]
Bases:
objectRepresents a CASE WHEN expression.
This class handles complex conditional logic with multiple conditions and default values, similar to SQL CASE WHEN statements.
Initialize CaseWhen.
- Parameters:
- cast(data_type)[source]
Cast the CASE WHEN expression to a different data type.
- Parameters:
data_type (
Any) – The target data type (DataType instance or string type name).- Return type:
- Returns:
ColumnOperation representing the cast operation.
Example
>>> F.when(F.col("value") == "A", F.lit(100)).otherwise(F.lit(200)).cast("long")
- __add__(other)[source]
Addition operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __sub__(other)[source]
Subtraction operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __mul__(other)[source]
Multiplication operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __truediv__(other)[source]
Division operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __radd__(other)[source]
Reverse addition operation (for 2 + case_when).
- Parameters:
other (
Any)- Return type:
- __rsub__(other)[source]
Reverse subtraction operation (for 2 - case_when).
- Parameters:
other (
Any)- Return type:
- __rmul__(other)[source]
Reverse multiplication operation (for 2 * case_when).
- Parameters:
other (
Any)- Return type:
- __rtruediv__(other)[source]
Reverse division operation (for 2 / case_when).
- Parameters:
other (
Any)- Return type:
- __rmod__(other)[source]
Reverse modulo operation (for 2 % case_when).
- Parameters:
other (
Any)- Return type:
- __or__(other)[source]
Bitwise OR operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- __and__(other)[source]
Bitwise AND operation (PySpark-compatible).
- Parameters:
other (
Any)- Return type:
- class sparkless.functions.conditional.ConditionalFunctions[source]
Bases:
objectCollection of conditional functions.
- static coalesce(*columns)[source]
Return the first non-null value from a list of columns.
- static isnull(column)[source]
Check if a column is null.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the isnull function.
- static isnotnull(column)[source]
Check if a column is not null.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the isnotnull function.
- static isnan(column)[source]
Check if a column is NaN (Not a Number).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the isnan function.
- static assert_true(condition)[source]
Assert that a condition is true, raises error if false.
- Parameters:
condition (
Union[Column,ColumnOperation,str]) – Boolean condition to assert.- Return type:
- Returns:
ColumnOperation representing the assert_true function.
Example
>>> df.select(F.assert_true(F.col("value") > 0))
- static ifnull(col1, col2)[source]
Alias for coalesce(col1, col2) - Returns col2 if col1 is null (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the ifnull function.
- static nullif(col1, col2)[source]
Returns null if col1 equals col2, otherwise returns col1 (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the nullif function.
- static case_when(*conditions, else_value=None)[source]
Create CASE WHEN expression with multiple conditions.
- Parameters:
- Return type:
- Returns:
CaseWhen object representing the CASE WHEN expression.
Example
>>> F.case_when( ... (F.col("age") > 18, "adult"), ... (F.col("age") > 12, "teen"), ... else_value="child" ... )
- static try_subtract(left, right)[source]
Null-safe subtraction - returns NULL on error (PySpark 3.5+).
- static try_multiply(left, right)[source]
Null-safe multiplication - returns NULL on error (PySpark 3.5+).
- static try_sum(column)[source]
Null-safe sum aggregate - returns NULL on error (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the try_sum function.
- static try_avg(column)[source]
Null-safe average aggregate - returns NULL on error (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the try_avg function.
- static try_element_at(column, index)[source]
Null-safe element_at - returns NULL on error (PySpark 3.5+).
- static try_to_binary(column, format=None)[source]
Null-safe to_binary - returns NULL on error (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the try_to_binary function.
- static try_to_number(column, format=None)[source]
Null-safe to_number - returns NULL on error (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the try_to_number function.
Bitwise Functions
Bitwise functions for Sparkless (PySpark 3.2+).
This module provides bitwise operations on integer columns.
- class sparkless.functions.bitwise.BitwiseFunctions[source]
Bases:
objectCollection of bitwise manipulation functions.
- static bit_count(column)[source]
Count the number of set bits (population count).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the bit_count function.
Example
>>> df.select(F.bit_count(F.col("value")))
- static bit_get(column, pos)[source]
Get bit value at position.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the bit_get function.
Example
>>> df.select(F.bit_get(F.col("value"), 0))
- static getbit(column, pos)[source]
Get bit value at position (alias for bit_get) (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the getbit function.
Example
>>> df.select(F.getbit(F.col("value"), 0))
- static bitwise_not(column)[source]
Perform bitwise NOT operation.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the bitwise_not function.
Example
>>> df.select(F.bitwise_not(F.col("value")))
- static bit_and(column)[source]
Aggregate function - bitwise AND of all values (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the bit_and aggregate function.
Example
>>> df.groupBy("dept").agg(F.bit_and("flags"))
- static bit_or(column)[source]
Aggregate function - bitwise OR of all values (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the bit_or aggregate function.
Example
>>> df.groupBy("dept").agg(F.bit_or("flags"))
- static bit_xor(column)[source]
Aggregate function - bitwise XOR of all values (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
AggregateFunction representing the bit_xor aggregate function.
Example
>>> df.groupBy("dept").agg(F.bit_xor("flags"))
- static bitwiseNOT(column)[source]
Deprecated alias for bitwise_not (all PySpark versions).
Use bitwise_not instead.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing bitwise NOT.
- static shiftLeft(column, num_bits)[source]
Deprecated alias for shiftleft (PySpark 3.0-3.1).
Use shiftleft instead.
- static shiftRight(column, num_bits)[source]
Deprecated alias for shiftright (PySpark 3.0-3.1).
Use shiftright instead.
- static shiftRightUnsigned(column, num_bits)[source]
Deprecated alias for shiftrightunsigned (PySpark 3.0-3.1).
Use shiftrightunsigned instead.
- static bitmap_bit_position(column)[source]
Get the bit position in a bitmap (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the bitmap_bit_position function.
Example
>>> df.select(F.bitmap_bit_position(F.col("bitmap")))
- static bitmap_bucket_number(column)[source]
Get the bucket number in a bitmap (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the bitmap_bucket_number function.
Example
>>> df.select(F.bitmap_bucket_number(F.col("bitmap")))
- static bitmap_construct_agg(column)[source]
Aggregate function - construct bitmap from values (PySpark 3.5+).
- Parameters:
column (
Union[Column,str]) – Integer column to construct bitmap from.- Return type:
- Returns:
AggregateFunction representing the bitmap_construct_agg function.
Example
>>> df.groupBy("dept").agg(F.bitmap_construct_agg("id"))
- static bitmap_count(column)[source]
Count the number of set bits in a bitmap (PySpark 3.5+).
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the bitmap_count function.
Example
>>> df.select(F.bitmap_count(F.col("bitmap")))
Window Functions
Window functions for Sparkless.
This module contains window function implementations including row_number, rank, etc.
- class sparkless.functions.window_execution.WindowFunction(function, window_spec)[source]
Bases:
objectRepresents a window function.
This class handles window functions like row_number(), rank(), etc. that operate over a window specification.
- Parameters:
function (
Any)window_spec (
WindowSpec)
Initialize WindowFunction.
- Parameters:
function (
Any) – The window function (e.g., row_number(), rank()).window_spec (
WindowSpec) – The window specification.
- __init__(function, window_spec)[source]
Initialize WindowFunction.
- Parameters:
function (
Any) – The window function (e.g., row_number(), rank()).window_spec (
WindowSpec) – The window specification.
- alias(name)[source]
Create an alias for this window function.
- Parameters:
name (
str) – The alias name.- Return type:
- Returns:
Self for method chaining.
- cast(data_type)[source]
Cast the window function result to a different data type.
- Parameters:
data_type (
Any) – The target data type (DataType instance or string type name).- Return type:
- Returns:
ColumnOperation representing the cast operation.
Example
>>> F.row_number().over(window_spec).cast("long")
- __mul__(other)[source]
Multiply window function result by a value.
- Parameters:
other (
Any) – The value to multiply by.- Return type:
- Returns:
ColumnOperation representing the multiplication.
Example
>>> F.percent_rank().over(window) * 100
- __rmul__(other)[source]
Reverse multiply (e.g., 100 * window_func).
- Parameters:
other (
Any) – The value to multiply.- Return type:
- Returns:
ColumnOperation representing the multiplication.
Example
>>> 100 * F.percent_rank().over(window)
- __add__(other)[source]
Add a value to window function result.
- Parameters:
other (
Any) – The value to add.- Return type:
- Returns:
ColumnOperation representing the addition.
Example
>>> F.row_number().over(window) + 1
- __radd__(other)[source]
Reverse add (e.g., 1 + window_func).
- Parameters:
other (
Any) – The value to add.- Return type:
- Returns:
ColumnOperation representing the addition.
Example
>>> 1 + F.row_number().over(window)
- __sub__(other)[source]
Subtract a value from window function result.
- Parameters:
other (
Any) – The value to subtract.- Return type:
- Returns:
ColumnOperation representing the subtraction.
Example
>>> F.row_number().over(window) - 1
- __rsub__(other)[source]
Reverse subtract (e.g., 10 - window_func).
- Parameters:
other (
Any) – The value to subtract from.- Return type:
- Returns:
ColumnOperation representing the subtraction.
Example
>>> 10 - F.row_number().over(window)
- __truediv__(other)[source]
Divide window function result by a value.
- Parameters:
other (
Any) – The value to divide by.- Return type:
- Returns:
ColumnOperation representing the division.
Example
>>> F.row_number().over(window) / 10
- __rtruediv__(other)[source]
Reverse divide (e.g., 100 / window_func).
- Parameters:
other (
Any) – The value to divide.- Return type:
- Returns:
ColumnOperation representing the division.
Example
>>> 100 / F.row_number().over(window)
- __neg__()[source]
Negate window function result.
- Return type:
- Returns:
ColumnOperation representing the negation.
Example
>>> -F.row_number().over(window)
- __eq__(other)[source]
Equality comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the equality comparison.
Example
>>> F.row_number().over(window) == 1
- __ne__(other)[source]
Inequality comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the inequality comparison.
Example
>>> F.row_number().over(window) != 0
- __lt__(other)[source]
Less than comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the less than comparison.
Example
>>> F.row_number().over(window) < 5
- __le__(other)[source]
Less than or equal comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the less than or equal comparison.
Example
>>> F.row_number().over(window) <= 10
- __gt__(other)[source]
Greater than comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the greater than comparison.
Example
>>> F.row_number().over(window) > 0
- __ge__(other)[source]
Greater than or equal comparison.
- Parameters:
other (
Any) – The value to compare with.- Return type:
- Returns:
ColumnOperation representing the greater than or equal comparison.
Example
>>> F.row_number().over(window) >= 1
- isnull()[source]
Check if window function result is null.
- Return type:
- Returns:
ColumnOperation representing the isnull check.
Example
>>> F.lag("value", 1).over(window).isnull()
- isnotnull()[source]
Check if window function result is not null.
- Return type:
- Returns:
ColumnOperation representing the isnotnull check.
Example
>>> F.lag("value", 1).over(window).isnotnull()
XML Functions
XML functions for PySpark 3.2+ compatibility.
- class sparkless.functions.xml.XMLFunctions[source]
Bases:
objectXML parsing and manipulation functions.
- static from_xml(col, schema)[source]
Parse XML string to struct based on schema.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the from_xml function.
Example
>>> df.select(F.from_xml(F.col("xml"), "name STRING, age INT"))
- static to_xml(col)[source]
Convert struct column to XML string.
- Parameters:
col (
Union[Column,ColumnOperation,str]) – Struct column to convert.- Return type:
- Returns:
ColumnOperation representing the to_xml function.
Example
>>> df.select(F.to_xml(F.struct(F.col("name"), F.col("age"))))
- static schema_of_xml(col)[source]
Infer schema from XML string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the schema_of_xml function.
Example
>>> df.select(F.schema_of_xml(F.col("xml")))
- static xpath(xml, path)[source]
Extract array of values from XML using XPath.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the xpath function.
Example
>>> df.select(F.xpath(F.col("xml"), "/root/item"))
- static xpath_boolean(xml, path)[source]
Evaluate XPath expression to boolean.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the xpath_boolean function.
Example
>>> df.select(F.xpath_boolean(F.col("xml"), "/root/active='true'"))
- static xpath_double(xml, path)[source]
Extract double value from XML using XPath.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the xpath_double function.
Example
>>> df.select(F.xpath_double(F.col("xml"), "/root/value"))
- static xpath_float(xml, path)[source]
Extract float value from XML using XPath.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the xpath_float function.
Example
>>> df.select(F.xpath_float(F.col("xml"), "/root/price"))
- static xpath_int(xml, path)[source]
Extract integer value from XML using XPath.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the xpath_int function.
Example
>>> df.select(F.xpath_int(F.col("xml"), "/root/age"))
- static xpath_long(xml, path)[source]
Extract long value from XML using XPath.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the xpath_long function.
Example
>>> df.select(F.xpath_long(F.col("xml"), "/root/value"))
- static xpath_short(xml, path)[source]
Extract short value from XML using XPath.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the xpath_short function.
Example
>>> df.select(F.xpath_short(F.col("xml"), "/root/count"))
Crypto Functions
Cryptographic functions for Sparkless.
This module provides cryptographic functions that match PySpark’s crypto function API. Includes encryption and decryption operations for secure data processing in DataFrames.
- Key Features:
AES encryption and decryption
Null-safe cryptographic operations
Type-safe operations with proper return types
Support for both column references and string literals
Example
>>> from sparkless.sql import SparkSession, functions as F
>>> spark = SparkSession("test")
>>> data = [{"data": "sensitive information", "key": "secretkey"}]
>>> df = spark.createDataFrame(data)
>>> df.select(
... F.aes_encrypt(F.col("data"), F.col("key")),
... F.aes_decrypt(F.col("encrypted"), F.col("key"))
... ).show()
- class sparkless.functions.crypto.CryptoFunctions[source]
Bases:
objectCollection of cryptographic functions.
- static aes_encrypt(data, key, mode=None, padding=None)[source]
Encrypt data using AES encryption.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the aes_encrypt function.
- static aes_decrypt(data, key, mode=None, padding=None)[source]
Decrypt data using AES decryption.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the aes_decrypt function.
JSON/CSV Functions
JSON and CSV functions for Sparkless.
This module provides JSON and CSV processing functions that match PySpark’s API. Includes parsing, generation, and schema inference for JSON and CSV data.
- class sparkless.functions.json_csv.JSONCSVFunctions[source]
Bases:
objectCollection of JSON and CSV manipulation functions.
- static from_json(column, schema, options=None)[source]
Parse JSON string column into struct/array column.
- static to_json(column)[source]
Convert struct/array column to JSON string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing to_json
- static get_json_object(column, path)[source]
Extract JSON object at specified path.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing get_json_object
- static json_tuple(column, *fields)[source]
Extract multiple fields from JSON string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing json_tuple
- static schema_of_json(json_string)[source]
Infer schema from JSON string.
- Parameters:
json_string (
str) – Sample JSON string- Return type:
- Returns:
ColumnOperation representing schema_of_json
- static to_csv(column)[source]
Convert struct column to CSV string.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing to_csv
Column Operations
Column implementation for Sparkless.
This module provides the Column class for DataFrame column operations, maintaining compatibility with PySpark’s Column interface.
- class sparkless.functions.core.column.ColumnOperatorMixin[source]
Bases:
objectMixin providing common operator methods for Column and ColumnOperation.
- eqNullSafe(other)[source]
Null-safe equality comparison (PySpark eqNullSafe).
This behaves like PySpark’s eqNullSafe: - If both sides are null, the comparison is True. - If exactly one side is null, the comparison is False. - Otherwise, it behaves like standard equality, including any backend-specific type coercion rules.
- Parameters:
other (
Any)- Return type:
- __radd__(other)[source]
Reverse addition operation (for 2 + col).
- Parameters:
other (
Any)- Return type:
- __rsub__(other)[source]
Reverse subtraction operation (for 2 - col).
- Parameters:
other (
Any)- Return type:
- __rmul__(other)[source]
Reverse multiplication operation (for 2 * col).
- Parameters:
other (
Any)- Return type:
- __rtruediv__(other)[source]
Reverse division operation (for 2 / col).
- Parameters:
other (
Any)- Return type:
- __rmod__(other)[source]
Reverse modulo operation (for 2 % col).
- Parameters:
other (
Any)- Return type:
- __rpow__(other)[source]
Reverse power operation (for 2 ** col or 3.0 ** col).
- Parameters:
other (
Any)- Return type:
- isin(*values)[source]
Check if column value is in list of values.
- Parameters:
*values (
Any) – Variable number of values to check against. Can be passed as individual arguments (e.g., col.isin(1, 2, 3)) or as a single list (e.g., col.isin([1, 2, 3])) for backward compatibility. Supports automatic type coercion for mixed types (e.g., checking integers in a string column will convert values to strings).- Return type:
- Returns:
ColumnOperation representing the isin check.
Example
>>> df.filter(F.col("value").isin(1, 2, 3)) >>> df.filter(F.col("value").isin([1, 2, 3])) # Also supported >>> df.filter(F.col("str_col").isin(1, 2, 3)) # Auto-converts to strings
Note
Fixed in version 3.23.0 (Issue #226): Added support for
*valuesarguments and automatic type coercion for mixed types to match PySpark behavior.
- between(lower, upper)[source]
Check if column value is between lower and upper bounds.
- Parameters:
- Return type:
- contains(literal)[source]
Check if column contains the literal string.
- Parameters:
literal (
str)- Return type:
- startswith(literal)[source]
Check if column starts with the literal string.
- Parameters:
literal (
str)- Return type:
- endswith(literal)[source]
Check if column ends with the literal string.
- Parameters:
literal (
str)- Return type:
- substr(start, length)[source]
Extract substring from string column.
- Parameters:
- Return type:
- Returns:
ColumnOperation representing the substr operation.
Example
>>> df.select(F.col("name").substr(1, 2))
- cast(data_type)[source]
Cast column to different data type.
- Parameters:
data_type (
DataType)- Return type:
- astype(data_type)[source]
Cast column to different data type (alias for cast).
This method is an alias for cast() and matches PySpark’s API.
- Parameters:
data_type (
Union[DataType,str]) – The target data type (DataType object or string name like “date”, “string”, etc.).- Return type:
- Returns:
ColumnOperation representing the cast operation.
Example
>>> df.select(F.col("name").astype("string")) >>> df.select(F.substring("date", 1, 10).astype("date"))
- getItem(key)[source]
Get item from array by index or map by key.
- Parameters:
key (
Any) – Index (int) for array access or key (any) for map access.- Return type:
- Returns:
ColumnOperation representing the getItem operation. Returns None for out-of-bounds array access (matching PySpark behavior).
Example
>>> df.select(F.col("array_col").getItem(0)) >>> df.select(F.col("map_col").getItem("key")) >>> df.select(F.col("array_col").getItem(999)) # Returns None if out of bounds
Note
Fixed in version 3.23.0 (Issue #227): Out-of-bounds array access now returns None instead of raising errors, matching PySpark behavior.
- withField(fieldName, col)[source]
Add or replace a field in a struct column.
- Parameters:
fieldName (
str) – Name of the field to add or replacecol (
Union[Column,ColumnOperation,Literal,Any]) – Column expression for the new field value. Can be a Column, ColumnOperation, Literal, or any value that will be converted to a Literal.
- Return type:
- Returns:
ColumnOperation representing the withField operation.
Example
>>> df.withColumn("my_struct", F.col("my_struct").withField("new_field", F.lit("value"))) >>> df.withColumn("my_struct", F.col("my_struct").withField("existing_field", F.col("other_col")))
Note
PySpark 3.1.0+ feature. Works only on struct columns. If field exists, it will be replaced. If it doesn’t exist, it will be added.
- class sparkless.functions.core.column.Column(name, column_type=None)[source]
Bases:
ColumnOperatorMixin,IColumnMock column expression for DataFrame operations.
Provides a PySpark-compatible column expression that supports all comparison and logical operations. Used for creating complex DataFrame transformations and filtering conditions.
Initialize Column.
- Parameters:
- __getitem__(key)[source]
Support subscript notation for struct field access and map lookup.
- Parameters:
key (
Any) – Field name (string) for struct field access, or Column for map lookup.- Returns:
New Column with the struct field path (e.g., “StructVal.E1”). For map: ColumnOperation getItem for map[key_column] lookup.
- Return type:
For struct
Example
>>> F.col("StructVal")["E1"] # Returns Column("StructVal.E1") >>> F.col("map_col")[F.col("key_col")] # Map lookup by column (Issue #440)
- getField(index_or_name)[source]
Access array element by index or struct field by name (PySpark getField).
- Parameters:
index_or_name (
Union[int,str]) – int for array index (same as getItem), str for struct field.- Return type:
Union[Column,ColumnOperation]- Returns:
Column for struct field path, ColumnOperation for array/map access.
Example
>>> df.select(F.col("ArrayVal").getField(0)) >>> df.select(F.col("Person").getField("name"))
- when(condition, value)[source]
Start a CASE WHEN expression.
- Parameters:
condition (
ColumnOperation)value (
Any)
- Return type:
- over(window_spec)[source]
Apply window function over window specification.
- Parameters:
window_spec (
WindowSpec)- Return type:
- count()[source]
Count non-null values in this column.
- Return type:
- Returns:
ColumnOperation representing the count operation.
- avg()[source]
Average values in this column.
- Return type:
- Returns:
ColumnOperation representing the avg function (PySpark-compatible).
- sum()[source]
Sum values in this column.
- Return type:
- Returns:
ColumnOperation representing the sum function (PySpark-compatible).
- max()[source]
Maximum value in this column.
- Return type:
- Returns:
ColumnOperation representing the max function (PySpark-compatible).
- min()[source]
Minimum value in this column.
- Return type:
- Returns:
ColumnOperation representing the min function (PySpark-compatible).
- stddev()[source]
Standard deviation of values in this column.
- Return type:
- Returns:
ColumnOperation representing the stddev function (PySpark-compatible).
- variance()[source]
Variance of values in this column.
- Return type:
- Returns:
ColumnOperation representing the variance function (PySpark-compatible).
- class sparkless.functions.core.column.ColumnOperation(column, operation, value=None, name=None)[source]
Bases:
ColumnRepresents a column operation (comparison, arithmetic, etc.).
This class encapsulates column operations and their operands for evaluation during DataFrame operations. Inherits from Column to ensure isinstance() checks pass for PySpark compatibility.
Initialize ColumnOperation.
- Parameters:
- alias(*alias_names)[source]
Create an alias for this operation (PySpark: one or more names, e.g. posexplode).
- Parameters:
alias_names (
str)- Return type:
- getField(index_or_name)[source]
Access array element by index or struct field by name (PySpark getField).
- Parameters:
- Return type:
Literals
Literal values for Sparkless.
This module provides Literal class for representing literal values in column expressions and transformations.
- class sparkless.functions.core.literals.Literal(value, data_type=None, resolver=None)[source]
Bases:
IColumnLiteral value for DataFrame operations.
Represents a literal value that can be used in column expressions and transformations, maintaining compatibility with PySpark’s lit function.
Initialize Literal.
- Parameters:
- __eq__(other)[source]
Equality comparison.
Note: Returns ColumnOperation instead of bool for PySpark compatibility.
- Parameters:
other (
Any)- Return type:
- __ne__(other)[source]
Inequality comparison.
Note: Returns ColumnOperation instead of bool for PySpark compatibility.
- Parameters:
other (
Any)- Return type:
- __ge__(other)[source]
Greater than or equal comparison.
- Parameters:
other (
Any)- Return type:
IColumn
- eqNullSafe(other)[source]
Null-safe equality comparison (PySpark eqNullSafe).
This behaves like PySpark’s eqNullSafe: - If both sides are null, the comparison is True. - If exactly one side is null, the comparison is False. - Otherwise, it behaves like standard equality, including any backend-specific type coercion rules.
- Parameters:
other (
Any)- Return type:
- isin(*values)[source]
Check if literal value is in list of values.
- Parameters:
values (
Any)- Return type:
- between(lower, upper)[source]
Check if literal value is between lower and upper bounds.
- Parameters:
- Return type:
- astype(data_type)[source]
Cast literal to different data type (alias for cast).
This method is an alias for cast() and matches PySpark’s API.
- Parameters:
data_type (
Union[DataType,str]) – The target data type (DataType object or string name).- Return type:
- Returns:
ColumnOperation representing the cast operation.
Example
>>> F.lit(1).astype("string")
- when(condition, value)[source]
Start a CASE WHEN expression.
- Parameters:
condition (
ColumnOperation)value (
Any)
- Return type:
UDF Functions
User-Defined Function (UDF) implementation for Sparkless.
This module provides the UserDefinedFunction class for wrapping Python functions to use in DataFrame transformations.
- class sparkless.functions.udf.UserDefinedFunction(func, returnType, name=None, evalType='SQL')[source]
Bases:
objectUser-defined function wrapper (all PySpark versions).
Wraps a Python function to be used in DataFrame transformations. Supports marking as nondeterministic and applying to columns.
Example
>>> def upper_case(s): ... return s.upper() >>> udf_func = UserDefinedFunction(upper_case, StringType()) >>> df.select(udf_func("name").alias("upper_name"))
Initialize UserDefinedFunction.
- Parameters:
- asNondeterministic()[source]
Mark UDF as nondeterministic.
Nondeterministic UDFs may return different results for the same input. This affects query optimization and caching.
- Return type:
- Returns:
Self with nondeterministic flag set
- class sparkless.functions.udf.UserDefinedTableFunction(func, returnType, name=None)[source]
Bases:
objectUser-defined table function wrapper (PySpark 3.5+).
Wraps a Python function that returns multiple rows (table-valued function). Similar to UserDefinedFunction but for functions that return tables.
Example
>>> def split_string(s): ... return [(char,) for char in s] >>> table_udf = UserDefinedTableFunction(split_string, StructType([...])) >>> df.select(table_udf("name").alias("chars"))
Initialize UserDefinedTableFunction.
- Parameters: