Backend Selectionο
Sparkless ships with multiple storage backends. Polars is the default engine in v3.x, replacing the legacy DuckDB implementation used in previous releases. You can still switch between backends when you need specific behaviour or when benchmarks favour alternative engines.
Available Backendsο
polars(default) β fast, thread-safe, zero-install beyond the Sparkless dependency.memoryβ compatibility shim that keeps all data in Python collections. Useful for extremely small unit tests or debugging expression translation.fileβ persists tables undersparkless_storage/as JSON/CSV files. Handy for sharing fixtures across sessions.duckdb(optional) β legacy SQL-backed engine. Requires thesparkless.backend.duckdbmodules to be installed (available in the 2.x releases) alongsideduckdb/duckdb-enginePython packages.robin(optional) β Rust/Polars engine via therobin-sparklesspackage (0.4.0+). Install withpip install sparkless[robin]orpip install robin-sparkless>=0.4.0. If the package is not installed, selectingrobinraises aValueErrorwith install instructions.
Call sparkless.backend.factory.BackendFactory.list_available_backends() to see
which backends are currently importable in your environment.
Selecting a Backendο
Sparkless resolves the backend in the following order:
Explicit constructor argument β
SparkSession(backend_type="memory")Environment variable β set
SPARKLESS_BACKEND=memorybefore starting the process. This is convenient for CI toggles.Builder configuration β
SparkSession.builder.config("spark.sparkless.backend", "file")Default β falls back to
polarswhen no overrides are provided.
Examples:
import os
from sparkless import SparkSession
# Environment variable
os.environ["SPARKLESS_BACKEND"] = "memory"
spark = SparkSession()
# Builder configuration
spark = (
SparkSession.builder
.config("spark.sparkless.backend", "file")
.getOrCreate()
)
# Direct constructor
spark = SparkSession(backend_type="polars")
Behavioural Notesο
Polars enforces strict schemas and eager type conversion. When migrating from DuckDB you may need the datetime helpers described in
docs/known_issues.md.The memory backend is intended for developer diagnostics; it is not feature-complete and skips several optimisations.
DuckDB support is best-effort. If the optional modules are missing, selecting
duckdbraises a clearValueErrorsuggesting the required packages.Robin support is optional. If
robin-sparklessis not installed,robindoes not appear inlist_available_backends()and selecting it raises aValueErrorwith install instructions.
Robin backend (optional)ο
When using backend_type="robin", Sparkless runs in pure Robin mode (no
fallback to Polars). With robin-sparkless 0.4.0+, the Robin materializer
supports: filter (including ColumnβColumn comparisons), select (column
names and Column expressions), limit, orderBy, withColumn (Column
expressions), join, union, distinct, and drop. groupBy + agg
and some expressions (e.g. F.regexp_extract_all(...)) may not yet be
supported and will raise SparkUnsupportedOperationError. For a full list and
recommended Sparkless improvements, see robin_compatibility_recommendations.md.
When running tests in Robin mode with many parallel workers, you may see worker
crashes (βnode down: Not properly terminatedβ) or an INTERNALERROR. Use fewer
workers (e.g. -n 4) or run serially (-n 0); the test runner defaults to 4
workers for Robin. See robin_mode_worker_crash_investigation.md.
Running tests with a specific backendο
To run the full test suite using the Robin backend (requires pip install sparkless[robin]):
SPARKLESS_TEST_BACKEND=robin bash tests/run_all_tests.sh
Or with pytest directly:
SPARKLESS_TEST_BACKEND=robin SPARKLESS_BACKEND=robin python -m pytest tests/ -n 10 --dist loadfile -v
# Without parallelism:
SPARKLESS_BACKEND=robin python -m pytest tests/ -v
Individual tests can request the Robin backend via the marker:
@pytest.mark.backend('robin').
Troubleshootingο
ValueError: Unsupported backend typeβ verify the spelling and check the available list viaBackendFactory.list_available_backends().DuckDB import errors β install Sparkless 2.x (which includes the DuckDB modules) or vendor the legacy backend into your project. You also need the
duckdbandduckdb-enginepip packages.Robin backend not available β install with
pip install sparkless[robin]orpip install robin-sparkless; thenrobinwill appear inlist_available_backends().Robin: worker crashes or INTERNALERROR β when using pytest-xdist with Robin, use fewer workers (e.g.
-n 4) or serial (-n 0). See robin_mode_worker_crash_investigation.md.Permission issues with
filebackend β adjust the base path by passingSparkSession.builder.config("spark.sparkless.backend.basePath", "/tmp/mock")and ensure the process can read/write there.