# Sparkless changes to work better with robin-sparkless

This document lists **concrete Sparkless-side changes** that improve Robin backend compatibility. It draws on [robin_sparkless_ownership_analysis.md](robin_sparkless_ownership_analysis.md), [robin_improvement_plan.md](robin_improvement_plan.md), and the worker-crash investigation.

---

## Already done

| Change | Notes |
|--------|--------|
| Filter: use method API (`.gt()`, `.lt()`, etc.) | `_simple_filter_to_robin()` uses robin Column methods instead of Python operators. |
| Filter: AND/OR | `ColumnOperation("&", ...)` and `ColumnOperation("\|", ...)` supported recursively. |
| Join: on= expression | `_join_on_to_column_names()` extracts keys from `col == col`; join supported. |
| Join: semi/anti | `can_handle_join` accepts left_semi, left_anti; materialize passes through. |
| Robin storage `db_path` | Parquet teardown ERRORs fixed. |
| Fewer workers default in Robin mode | `run_all_tests.sh` uses 4 workers when `BACKEND=robin` to reduce worker crashes. |
| Skip `test_regexp_extract_all_*` in Robin | Test skipped; robin-sparkless doesn’t support select with that expression (issue #176). |
| Polars: regex in closure | `regexp_extract_all` compiles regex inside closure to avoid hang under xdist. |

---

## Recommended Sparkless changes (by priority)

### High priority

1. **Add groupBy + agg to Robin materializer**  
   - **Status:** Not implemented. In the current design, `groupBy` never appears in a DataFrame’s `_operations_queue`; `GroupedData.agg()` materializes the parent DataFrame (via Robin when applicable) and then runs aggregation in Python. So typical `df.filter(...).groupBy("a").count()` already works with Robin (filter is materialized by Robin, then group+count run in Sparkless). Adding groupBy to the Robin materializer would only matter if a future path put groupBy in the queue.  
   - **What (if needed):** Add `"groupBy"` to `SUPPORTED_OPERATIONS` and in the materialize loop translate to robin_sparkless `df.group_by([...]).agg(...)` when the queue contains groupBy.

2. **Document Robin-supported operations and limits**  
   - **Status:** Done. [backend_selection.md](backend_selection.md) has a “Robin backend” section; [robin_compatibility_recommendations.md](robin_compatibility_recommendations.md) lists supported/unsupported ops and the worker-crash workaround.

### Medium priority

3. **Join result: 0 rows**  
   - **Status:** Not changed. Conversion from `collect()` to `List[Row]` (and `_right` merge) is in place; if Robin still returns 0 rows for some joins, the cause is likely in robin-sparkless or in the join keys we pass. Debug with a minimal repro and report upstream if needed.  
   - **Files:** `sparkless/backend/robin/materializer.py` (collect → merged → Row).

4. **Extend withColumn expression translation**  
   - **Status:** Done. `_expression_to_robin()` now supports: **alias** (e.g. `col.alias("x")`), **commutative ops** with `Literal` on the left (e.g. `lit(2) + col("x")`), and the same comparison/arithmetic ops as before.  
   - **Files:** `sparkless/backend/robin/materializer.py`.

### Low priority / optional

5. **Robin worker-crash note in backend_selection**  
   - **Why:** Users who see “node down” or INTERNALERROR in Robin mode need a quick pointer.  
   - **What:** In backend_selection “Troubleshooting” or “Robin backend”, add one line: “If you see worker crashes or INTERNALERROR when running with many workers, use fewer workers (e.g. `-n 4`) or serial (`-n 0`); see [robin_mode_worker_crash_investigation.md](robin_mode_worker_crash_investigation.md).”

6. **Pure Robin mode (no Polars fallback)**  
   - **Status:** Done. When Robin is selected, Sparkless runs in pure Robin mode only. Unsupported operations raise `SparkUnsupportedOperationError`. Use the Polars backend for full operation coverage.

---

## What to leave to robin-sparkless

- **select() / with_column() with Column expressions** – **Resolved in robin-sparkless 0.4.0** (issue #182). Sparkless requires 0.4.0+ and uses these APIs.
- **Filter Column–Column, filter(bool), lit(date/datetime), Window API** – **Resolved in 0.4.0** (issues #184–#187).
- **select() with some expressions** (e.g. `regexp_extract_all`) – robin-sparkless issue #176 if still needed.  
- **join(on="col")** – single-column string; we pass list; robin-sparkless issue #175.  
- **Worker crashes under pytest-xdist** – robin-sparkless issue #178 (fork-safety / stability).

No Sparkless code change is required for the remaining items; we work around or skip where needed and track upstream fixes.

---

## Quick reference: current Robin materializer support

| Operation | Supported | Notes |
|-----------|-----------|--------|
| filter | Yes | Simple col op literal; AND/OR; isin (translated to OR of equality). Method API (.gt, .lt, …). |
| select | Yes (0.4.0+) | Column names (strings) and Column expressions. |
| limit | Yes | Non-negative int. |
| orderBy | Yes | Column names + optional ascending. |
| withColumn | Yes (0.4.0+) | Expressions translatable by `_expression_to_robin`; robin-sparkless 0.4.0+ accepts Column in `with_column()`. |
| join | Yes | on= list or extracted from col==col; how= inner/left/right/outer/semi/anti. |
| union | Yes | |
| distinct | Yes | Deduplicate rows. |
| drop | Yes | Drop column(s) by name. |
| groupBy + agg | **No** | Recommended to add (high priority). |