Backend Selection

Sparkless ships with multiple storage backends. Polars is the default engine in v3.x, replacing the legacy DuckDB implementation used in previous releases. You can still switch between backends when you need specific behaviour or when benchmarks favour alternative engines.

Available Backends

  • polars (default) – fast, thread-safe, zero-install beyond the Sparkless dependency.

  • memory – compatibility shim that keeps all data in Python collections. Useful for extremely small unit tests or debugging expression translation.

  • file – persists tables under sparkless_storage/ as JSON/CSV files. Handy for sharing fixtures across sessions.

  • duckdb (optional) – legacy SQL-backed engine. Requires the sparkless.backend.duckdb modules to be installed (available in the 2.x releases) alongside duckdb/duckdb-engine Python packages.

  • robin (optional) – Rust/Polars engine via the robin-sparkless package (0.4.0+). Install with pip install sparkless[robin] or pip install robin-sparkless>=0.4.0. If the package is not installed, selecting robin raises a ValueError with install instructions.

Call sparkless.backend.factory.BackendFactory.list_available_backends() to see which backends are currently importable in your environment.

Selecting a Backend

Sparkless resolves the backend in the following order:

  1. Explicit constructor argument – SparkSession(backend_type="memory")

  2. Environment variable – set SPARKLESS_BACKEND=memory before starting the process. This is convenient for CI toggles.

  3. Builder configuration – SparkSession.builder.config("spark.sparkless.backend", "file")

  4. Default – falls back to polars when no overrides are provided.

Examples:

import os
from sparkless import SparkSession

# Environment variable
os.environ["SPARKLESS_BACKEND"] = "memory"
spark = SparkSession()

# Builder configuration
spark = (
    SparkSession.builder
    .config("spark.sparkless.backend", "file")
    .getOrCreate()
)

# Direct constructor
spark = SparkSession(backend_type="polars")

Behavioural Notes

  • Polars enforces strict schemas and eager type conversion. When migrating from DuckDB you may need the datetime helpers described in docs/known_issues.md.

  • The memory backend is intended for developer diagnostics; it is not feature-complete and skips several optimisations.

  • DuckDB support is best-effort. If the optional modules are missing, selecting duckdb raises a clear ValueError suggesting the required packages.

  • Robin support is optional. If robin-sparkless is not installed, robin does not appear in list_available_backends() and selecting it raises a ValueError with install instructions.

Robin backend (optional)

When using backend_type="robin", Sparkless runs in pure Robin mode (no fallback to Polars). With robin-sparkless 0.4.0+, the Robin materializer supports: filter (including Column–Column comparisons), select (column names and Column expressions), limit, orderBy, withColumn (Column expressions), join, union, distinct, and drop. groupBy + agg and some expressions (e.g. F.regexp_extract_all(...)) may not yet be supported and will raise SparkUnsupportedOperationError. For a full list and recommended Sparkless improvements, see robin_compatibility_recommendations.md.

When running tests in Robin mode with many parallel workers, you may see worker crashes (β€œnode down: Not properly terminated”) or an INTERNALERROR. Use fewer workers (e.g. -n 4) or run serially (-n 0); the test runner defaults to 4 workers for Robin. See robin_mode_worker_crash_investigation.md.

Running tests with a specific backend

To run the full test suite using the Robin backend (requires pip install sparkless[robin]):

SPARKLESS_TEST_BACKEND=robin bash tests/run_all_tests.sh

Or with pytest directly:

SPARKLESS_TEST_BACKEND=robin SPARKLESS_BACKEND=robin python -m pytest tests/ -n 10 --dist loadfile -v
# Without parallelism:
SPARKLESS_BACKEND=robin python -m pytest tests/ -v

Individual tests can request the Robin backend via the marker: @pytest.mark.backend('robin').

Troubleshooting

  • ValueError: Unsupported backend type – verify the spelling and check the available list via BackendFactory.list_available_backends().

  • DuckDB import errors – install Sparkless 2.x (which includes the DuckDB modules) or vendor the legacy backend into your project. You also need the duckdb and duckdb-engine pip packages.

  • Robin backend not available – install with pip install sparkless[robin] or pip install robin-sparkless; then robin will appear in list_available_backends().

  • Robin: worker crashes or INTERNALERROR – when using pytest-xdist with Robin, use fewer workers (e.g. -n 4) or serial (-n 0). See robin_mode_worker_crash_investigation.md.

  • Permission issues with file backend – adjust the base path by passing SparkSession.builder.config("spark.sparkless.backend.basePath", "/tmp/mock") and ensure the process can read/write there.