Pandas Fallback
Sparkless ships with a lightweight stub that mimics the minimal slice of the pandas API that appears in tests and documentation examples. The stub keeps the default installation lean and avoids the heavy NumPy dependency tree. When parity with native pandas is needed, you can opt-in to the real implementation and compare behaviour using the benchmarking harness described below.
Installing the Native Backend
pip install .[pandas]
The optional extra brings in pandas and type stubs only; NumPy wheels are
pulled in transitively where available.
Switching Backends
Set the MOCK_SPARK_PANDAS_MODE environment variable before importing pandas:
Value |
Behaviour |
|---|---|
|
Always use the built-in shim (default) |
|
Require real pandas; raise if it is missing |
|
Use native pandas when installed, otherwise fallback |
export MOCK_SPARK_PANDAS_MODE=native # or stub / auto
You can also query the active backend at runtime:
import pandas
print(pandas.get_backend()) # "native" or "stub"
Running the Benchmark Suite
The helper script compares core operations (DataFrame construction, to_dict,
concat, and basic indexing) between the stub and native backends.
python scripts/benchmark_pandas_fallback.py --rows 50000 --samples 7
Example output (M3 Pro laptop, macOS 14.6.1):
Backend | Create (mean ms) | to_dict (mean ms) | concat (mean ms) | iloc (mean ms)
------------------------------------------------------------------------
stub | 4.812 | 2.337 | 3.921 | 0.118
native | 7.604 | 1.281 | 2.017 | 0.064
Export raw metrics for further analysis:
python scripts/benchmark_pandas_fallback.py --output benchmark.json
Interpreting Results
Stub strengths: predictable performance, zero external dependencies, ideal for unit tests and CI pipelines without binary wheels.
Native strengths: faster
to_dict/concatoperations on larger datasets, full pandas feature coverage for interactive analysis or downstream tooling.
Choose the mode that best matches your scenario:
For CI and fast feedback loops, keep the default stub enabled.
For parity investigations or integration environments where pandas is already present, enable the native backend with
MOCK_SPARK_PANDAS_MODE=native.
The benchmarking harness provides a quick way to validate regressions when upgrading either the stub or native dependency chain.