Pandas Fallback

Sparkless ships with a lightweight stub that mimics the minimal slice of the pandas API that appears in tests and documentation examples. The stub keeps the default installation lean and avoids the heavy NumPy dependency tree. When parity with native pandas is needed, you can opt-in to the real implementation and compare behaviour using the benchmarking harness described below.

Installing the Native Backend

pip install .[pandas]

The optional extra brings in pandas and type stubs only; NumPy wheels are pulled in transitively where available.

Switching Backends

Set the MOCK_SPARK_PANDAS_MODE environment variable before importing pandas:

Value

Behaviour

stub

Always use the built-in shim (default)

native

Require real pandas; raise if it is missing

auto

Use native pandas when installed, otherwise fallback

export MOCK_SPARK_PANDAS_MODE=native   # or stub / auto

You can also query the active backend at runtime:

import pandas
print(pandas.get_backend())           # "native" or "stub"

Running the Benchmark Suite

The helper script compares core operations (DataFrame construction, to_dict, concat, and basic indexing) between the stub and native backends.

python scripts/benchmark_pandas_fallback.py --rows 50000 --samples 7

Example output (M3 Pro laptop, macOS 14.6.1):

Backend    | Create (mean ms)   | to_dict (mean ms) | concat (mean ms)  | iloc (mean ms)
------------------------------------------------------------------------
stub       | 4.812              | 2.337             | 3.921             | 0.118
native     | 7.604              | 1.281             | 2.017             | 0.064

Export raw metrics for further analysis:

python scripts/benchmark_pandas_fallback.py --output benchmark.json

Interpreting Results

  • Stub strengths: predictable performance, zero external dependencies, ideal for unit tests and CI pipelines without binary wheels.

  • Native strengths: faster to_dict/concat operations on larger datasets, full pandas feature coverage for interactive analysis or downstream tooling.

Choose the mode that best matches your scenario:

  • For CI and fast feedback loops, keep the default stub enabled.

  • For parity investigations or integration environments where pandas is already present, enable the native backend with MOCK_SPARK_PANDAS_MODE=native.

The benchmarking harness provides a quick way to validate regressions when upgrading either the stub or native dependency chain.