Configurationο
Sparkless configuration is managed via SparkSession constructor options and the session builder.
Basic Configurationο
from sparkless.sql import SparkSession
spark = SparkSession(
validation_mode="relaxed", # strict | relaxed | minimal
enable_type_coercion=True,
)
Key settings:
validation_mode: controls strictness of schema/data checks
enable_type_coercion: attempts to coerce types during DataFrame creation
Case Sensitivity Configurationο
Sparkless supports PySpark-compatible case sensitivity configuration via spark.sql.caseSensitive:
# Default: case-insensitive (matches PySpark default)
spark = SparkSession("MyApp")
assert spark.conf.is_case_sensitive() == False
# Enable case-sensitive mode
spark = SparkSession.builder \
.config("spark.sql.caseSensitive", "true") \
.getOrCreate()
assert spark.conf.is_case_sensitive() == True
Default Behavior (case-insensitive):
Column names can be referenced in any case (e.g.,
df.select("name")matches"Name"column)Output preserves original column names from schema
Matches PySparkβs default behavior
Case-Sensitive Mode:
Column names must match exactly (case-sensitive)
Useful when working with data sources that have case-sensitive schemas
Can be enabled via
spark.conf.set("spark.sql.caseSensitive", "true")or builder config
Ambiguity Detection:
When multiple columns differ only by case (e.g.,
["Name", "name"]) and case-insensitive mode is enabled, resolution raisesAnalysisExceptiondue to ambiguityThis matches PySpark behavior
Example:
from sparkless.sql import SparkSession
# Case-insensitive (default)
spark = SparkSession("TestApp")
df = spark.createDataFrame([{"Name": "Alice", "Age": 25}])
result = df.select("name").collect() # Works - resolves to "Name"
assert result[0]["Name"] == "Alice" # Output uses original case "Name"
# Case-sensitive
spark.conf.set("spark.sql.caseSensitive", "true")
result = df.select("Name").collect() # Must use exact case
# df.select("name") would raise column not found error
Backend Configuration (v3.0.0+)ο
Default Backend (Polars)ο
# Polars is the default backend in v3.0.0+
spark = SparkSession("MyApp")
Explicit Backend Selectionο
# Use Polars explicitly
spark = SparkSession.builder \
.config("spark.sparkless.backend", "polars") \
.getOrCreate()
# Use DuckDB backend (legacy, requires duckdb package)
spark = SparkSession.builder \
.config("spark.sparkless.backend", "duckdb") \
.config("spark.sparkless.backend.maxMemory", "4GB") \
.config("spark.sparkless.backend.allowDiskSpillover", True) \
.getOrCreate()
# Use memory backend
spark = SparkSession.builder \
.config("spark.sparkless.backend", "memory") \
.getOrCreate()
Backend-Specific Optionsο
Polars Backend (default):
No configuration needed - Polars handles memory and performance automatically
Thread-safe by design
Uses Parquet files for persistence
DuckDB Backend (legacy):
spark.sparkless.backend.maxMemory: Maximum memory (e.g., β1GBβ, β4GBβ)spark.sparkless.backend.allowDiskSpillover: Allow disk spillover when memory is full
Note: maxMemory and allowDiskSpillover options are ignored for Polars backend.
Performance knobsο
You can tune mock behaviour per pipeline using the following.
Lazy vs eager evaluation β
SparkSession(..., enable_lazy_evaluation=True)(default) defers execution until an action (collect,show,count, etc.). Set toFalsefor legacy eager behaviour.Logical plan path β Set
spark.conf.set("spark.sparkless.useLogicalPlan", "true")(or builder config) to use the serialized logical plan path when the backend supports it (e.g. Robin). This can change execution strategy and performance.Backend selection β The Polars backend is the default and is optimized for in-memory pipelines. For very large or I/O-bound workloads, consider splitting data or using the appropriate backend (see Backend selection).
Profiling β Optional profiling utilities are documented in Profiling. Use them to identify hot paths before tuning.