# Backend Selection Sparkless ships with multiple storage backends. Polars is the default engine in v3.x, replacing the legacy DuckDB implementation used in previous releases. You can still switch between backends when you need specific behaviour or when benchmarks favour alternative engines. ## Available Backends - `polars` (default) – fast, thread-safe, zero-install beyond the Sparkless dependency. - `memory` – compatibility shim that keeps all data in Python collections. Useful for extremely small unit tests or debugging expression translation. - `file` – persists tables under `sparkless_storage/` as JSON/CSV files. Handy for sharing fixtures across sessions. - `duckdb` (optional) – legacy SQL-backed engine. Requires the `sparkless.backend.duckdb` modules to be installed (available in the 2.x releases) alongside `duckdb`/`duckdb-engine` Python packages. - `robin` (optional) – Rust/Polars engine via the `robin-sparkless` package (0.4.0+). Install with `pip install sparkless[robin]` or `pip install robin-sparkless>=0.4.0`. If the package is not installed, selecting `robin` raises a `ValueError` with install instructions. Call `sparkless.backend.factory.BackendFactory.list_available_backends()` to see which backends are currently importable in your environment. ## Selecting a Backend Sparkless resolves the backend in the following order: 1. **Explicit constructor argument** – `SparkSession(backend_type="memory")` 2. **Environment variable** – set `SPARKLESS_BACKEND=memory` before starting the process. This is convenient for CI toggles. 3. **Builder configuration** – `SparkSession.builder.config("spark.sparkless.backend", "file")` 4. **Default** – falls back to `polars` when no overrides are provided. Examples: ``` import os from sparkless import SparkSession # Environment variable os.environ["SPARKLESS_BACKEND"] = "memory" spark = SparkSession() # Builder configuration spark = ( SparkSession.builder .config("spark.sparkless.backend", "file") .getOrCreate() ) # Direct constructor spark = SparkSession(backend_type="polars") ``` ## Behavioural Notes - Polars enforces strict schemas and eager type conversion. When migrating from DuckDB you may need the datetime helpers described in `docs/known_issues.md`. - The memory backend is intended for developer diagnostics; it is not feature-complete and skips several optimisations. - DuckDB support is best-effort. If the optional modules are missing, selecting `duckdb` raises a clear `ValueError` suggesting the required packages. - Robin support is optional. If `robin-sparkless` is not installed, `robin` does not appear in `list_available_backends()` and selecting it raises a `ValueError` with install instructions. ## Robin backend (optional) When using `backend_type="robin"`, Sparkless runs in **pure Robin mode** (no fallback to Polars). With **robin-sparkless 0.4.0+**, the Robin materializer supports: **filter** (including Column–Column comparisons), **select** (column names and Column expressions), **limit**, **orderBy**, **withColumn** (Column expressions), **join**, **union**, **distinct**, and **drop**. **groupBy** + agg and some expressions (e.g. `F.regexp_extract_all(...)`) may not yet be supported and will raise `SparkUnsupportedOperationError`. For a full list and recommended Sparkless improvements, see [robin_compatibility_recommendations.md](robin_compatibility_recommendations.md). When running tests in Robin mode with many parallel workers, you may see worker crashes ("node down: Not properly terminated") or an INTERNALERROR. Use fewer workers (e.g. `-n 4`) or run serially (`-n 0`); the test runner defaults to 4 workers for Robin. See [robin_mode_worker_crash_investigation.md](robin_mode_worker_crash_investigation.md). ## Running tests with a specific backend To run the full test suite using the Robin backend (requires `pip install sparkless[robin]`): ```bash SPARKLESS_TEST_BACKEND=robin bash tests/run_all_tests.sh ``` Or with pytest directly: ```bash SPARKLESS_TEST_BACKEND=robin SPARKLESS_BACKEND=robin python -m pytest tests/ -n 10 --dist loadfile -v # Without parallelism: SPARKLESS_BACKEND=robin python -m pytest tests/ -v ``` Individual tests can request the Robin backend via the marker: `@pytest.mark.backend('robin')`. ## Troubleshooting - **`ValueError: Unsupported backend type`** – verify the spelling and check the available list via `BackendFactory.list_available_backends()`. - **DuckDB import errors** – install Sparkless 2.x (which includes the DuckDB modules) or vendor the legacy backend into your project. You also need the `duckdb` and `duckdb-engine` pip packages. - **Robin backend not available** – install with `pip install sparkless[robin]` or `pip install robin-sparkless`; then `robin` will appear in `list_available_backends()`. - **Robin: worker crashes or INTERNALERROR** – when using pytest-xdist with Robin, use fewer workers (e.g. `-n 4`) or serial (`-n 0`). See [robin_mode_worker_crash_investigation.md](robin_mode_worker_crash_investigation.md). - **Permission issues with `file` backend** – adjust the base path by passing `SparkSession.builder.config("spark.sparkless.backend.basePath", "/tmp/mock")` and ensure the process can read/write there.