Sparkless changes to work better with robin-sparkless

This document lists concrete Sparkless-side changes that improve Robin backend compatibility. It draws on robin_sparkless_ownership_analysis.md, robin_improvement_plan.md, and the worker-crash investigation.

Already done

Change	Notes
Filter: use method API (`.gt()`, `.lt()`, etc.)	`_simple_filter_to_robin()` uses robin Column methods instead of Python operators.
Filter: AND/OR	`ColumnOperation("&", ...)` and `ColumnOperation("\|", ...)` supported recursively.
Join: on= expression	`_join_on_to_column_names()` extracts keys from `col == col`; join supported.
Join: semi/anti	`can_handle_join` accepts left_semi, left_anti; materialize passes through.
Robin storage `db_path`	Parquet teardown ERRORs fixed.
Fewer workers default in Robin mode	`run_all_tests.sh` uses 4 workers when `BACKEND=robin` to reduce worker crashes.
Skip `test_regexp_extract_all_*` in Robin	Test skipped; robin-sparkless doesn’t support select with that expression (issue #176).
Polars: regex in closure	`regexp_extract_all` compiles regex inside closure to avoid hang under xdist.

Add groupBy + agg to Robin materializer
- Status: Not implemented. In the current design, groupBy never appears in a DataFrame’s _operations_queue; GroupedData.agg() materializes the parent DataFrame (via Robin when applicable) and then runs aggregation in Python. So typical df.filter(...).groupBy("a").count() already works with Robin (filter is materialized by Robin, then group+count run in Sparkless). Adding groupBy to the Robin materializer would only matter if a future path put groupBy in the queue.
- What (if needed): Add "groupBy" to SUPPORTED_OPERATIONS and in the materialize loop translate to robin_sparkless df.group_by([...]).agg(...) when the queue contains groupBy.
Document Robin-supported operations and limits
- Status: Done. backend_selection.md has a “Robin backend” section; robin_compatibility_recommendations.md lists supported/unsupported ops and the worker-crash workaround.

Join result: 0 rows
- Status: Not changed. Conversion from collect() to List[Row] (and _right merge) is in place; if Robin still returns 0 rows for some joins, the cause is likely in robin-sparkless or in the join keys we pass. Debug with a minimal repro and report upstream if needed.
- Files: sparkless/backend/robin/materializer.py (collect → merged → Row).
Extend withColumn expression translation
- Status: Done. _expression_to_robin() now supports: alias (e.g. col.alias("x")), commutative ops with Literal on the left (e.g. lit(2) + col("x")), and the same comparison/arithmetic ops as before.
- Files: sparkless/backend/robin/materializer.py.

Robin worker-crash note in backend_selection
- Why: Users who see “node down” or INTERNALERROR in Robin mode need a quick pointer.
- What: In backend_selection “Troubleshooting” or “Robin backend”, add one line: “If you see worker crashes or INTERNALERROR when running with many workers, use fewer workers (e.g. -n 4) or serial (-n 0); see robin_mode_worker_crash_investigation.md.”
Pure Robin mode (no Polars fallback)
- Status: Done. When Robin is selected, Sparkless runs in pure Robin mode only. Unsupported operations raise SparkUnsupportedOperationError. Use the Polars backend for full operation coverage.

select() / with_column() with Column expressions – Resolved in robin-sparkless 0.4.0 (issue #182). Sparkless requires 0.4.0+ and uses these APIs.
Filter Column–Column, filter(bool), lit(date/datetime), Window API – Resolved in 0.4.0 (issues #184–#187).
select() with some expressions (e.g. regexp_extract_all) – robin-sparkless issue #176 if still needed.
join(on=“col”) – single-column string; we pass list; robin-sparkless issue #175.
Worker crashes under pytest-xdist – robin-sparkless issue #178 (fork-safety / stability).

No Sparkless code change is required for the remaining items; we work around or skip where needed and track upstream fixes.

Operation	Supported	Notes
filter	Yes	Simple col op literal; AND/OR; isin (translated to OR of equality). Method API (.gt, .lt, …).
select	Yes (0.4.0+)	Column names (strings) and Column expressions.
limit	Yes	Non-negative int.
orderBy	Yes	Column names + optional ascending.
withColumn	Yes (0.4.0+)	Expressions translatable by `_expression_to_robin`; robin-sparkless 0.4.0+ accepts Column in `with_column()`.
join	Yes	on= list or extracted from col==col; how= inner/left/right/outer/semi/anti.
union	Yes
distinct	Yes	Deduplicate rows.
drop	Yes	Drop column(s) by name.
groupBy + agg	No	Recommended to add (high priority).