# Pandas Fallback


Sparkless ships with a lightweight stub that mimics the minimal slice of the
pandas API that appears in tests and documentation examples. The stub keeps the
default installation lean and avoids the heavy NumPy dependency tree. When
parity with native pandas is needed, you can opt-in to the real implementation
and compare behaviour using the benchmarking harness described below.

## Installing the Native Backend

```bash
pip install .[pandas]
```

The optional extra brings in `pandas` and type stubs only; NumPy wheels are
pulled in transitively where available.

## Switching Backends

Set the `MOCK_SPARK_PANDAS_MODE` environment variable before importing `pandas`:

| Value   | Behaviour                                           |
|---------|-----------------------------------------------------|
| `stub`  | Always use the built-in shim (default)              |
| `native`| Require real pandas; raise if it is missing         |
| `auto`  | Use native pandas when installed, otherwise fallback|

```bash
export MOCK_SPARK_PANDAS_MODE=native   # or stub / auto
```

You can also query the active backend at runtime:

```python
import pandas
print(pandas.get_backend())           # "native" or "stub"
```

## Running the Benchmark Suite

The helper script compares core operations (`DataFrame` construction, `to_dict`,
`concat`, and basic indexing) between the stub and native backends.

```bash
python scripts/benchmark_pandas_fallback.py --rows 50000 --samples 7
```

Example output (M3 Pro laptop, macOS 14.6.1):

```
Backend    | Create (mean ms)   | to_dict (mean ms) | concat (mean ms)  | iloc (mean ms)
------------------------------------------------------------------------
stub       | 4.812              | 2.337             | 3.921             | 0.118
native     | 7.604              | 1.281             | 2.017             | 0.064
```

Export raw metrics for further analysis:

```bash
python scripts/benchmark_pandas_fallback.py --output benchmark.json
```

## Interpreting Results

- **Stub strengths:** predictable performance, zero external dependencies,
  ideal for unit tests and CI pipelines without binary wheels.
- **Native strengths:** faster `to_dict`/`concat` operations on larger datasets,
  full pandas feature coverage for interactive analysis or downstream tooling.

Choose the mode that best matches your scenario:

- For CI and fast feedback loops, keep the default stub enabled.
- For parity investigations or integration environments where pandas is already
  present, enable the native backend with `MOCK_SPARK_PANDAS_MODE=native`.

The benchmarking harness provides a quick way to validate regressions when
upgrading either the stub or native dependency chain.