Sparkless changes to work better with robin-sparkless
This document lists concrete Sparkless-side changes that improve Robin backend compatibility. It draws on robin_sparkless_ownership_analysis.md, robin_improvement_plan.md, and the worker-crash investigation.
Already done
Change |
Notes |
|---|---|
Filter: use method API ( |
|
Filter: AND/OR |
|
Join: on= expression |
|
Join: semi/anti |
|
Robin storage |
Parquet teardown ERRORs fixed. |
Fewer workers default in Robin mode |
|
Skip |
Test skipped; robin-sparkless doesn’t support select with that expression (issue #176). |
Polars: regex in closure |
|
Recommended Sparkless changes (by priority)
High priority
Add groupBy + agg to Robin materializer
Status: Not implemented. In the current design,
groupBynever appears in a DataFrame’s_operations_queue;GroupedData.agg()materializes the parent DataFrame (via Robin when applicable) and then runs aggregation in Python. So typicaldf.filter(...).groupBy("a").count()already works with Robin (filter is materialized by Robin, then group+count run in Sparkless). Adding groupBy to the Robin materializer would only matter if a future path put groupBy in the queue.What (if needed): Add
"groupBy"toSUPPORTED_OPERATIONSand in the materialize loop translate to robin_sparklessdf.group_by([...]).agg(...)when the queue contains groupBy.
Document Robin-supported operations and limits
Status: Done. backend_selection.md has a “Robin backend” section; robin_compatibility_recommendations.md lists supported/unsupported ops and the worker-crash workaround.
Medium priority
Join result: 0 rows
Status: Not changed. Conversion from
collect()toList[Row](and_rightmerge) is in place; if Robin still returns 0 rows for some joins, the cause is likely in robin-sparkless or in the join keys we pass. Debug with a minimal repro and report upstream if needed.Files:
sparkless/backend/robin/materializer.py(collect → merged → Row).
Extend withColumn expression translation
Status: Done.
_expression_to_robin()now supports: alias (e.g.col.alias("x")), commutative ops withLiteralon the left (e.g.lit(2) + col("x")), and the same comparison/arithmetic ops as before.Files:
sparkless/backend/robin/materializer.py.
Low priority / optional
Robin worker-crash note in backend_selection
Why: Users who see “node down” or INTERNALERROR in Robin mode need a quick pointer.
What: In backend_selection “Troubleshooting” or “Robin backend”, add one line: “If you see worker crashes or INTERNALERROR when running with many workers, use fewer workers (e.g.
-n 4) or serial (-n 0); see robin_mode_worker_crash_investigation.md.”
Pure Robin mode (no Polars fallback)
Status: Done. When Robin is selected, Sparkless runs in pure Robin mode only. Unsupported operations raise
SparkUnsupportedOperationError. Use the Polars backend for full operation coverage.
What to leave to robin-sparkless
select() / with_column() with Column expressions – Resolved in robin-sparkless 0.4.0 (issue #182). Sparkless requires 0.4.0+ and uses these APIs.
Filter Column–Column, filter(bool), lit(date/datetime), Window API – Resolved in 0.4.0 (issues #184–#187).
select() with some expressions (e.g.
regexp_extract_all) – robin-sparkless issue #176 if still needed.join(on=“col”) – single-column string; we pass list; robin-sparkless issue #175.
Worker crashes under pytest-xdist – robin-sparkless issue #178 (fork-safety / stability).
No Sparkless code change is required for the remaining items; we work around or skip where needed and track upstream fixes.
Quick reference: current Robin materializer support
Operation |
Supported |
Notes |
|---|---|---|
filter |
Yes |
Simple col op literal; AND/OR; isin (translated to OR of equality). Method API (.gt, .lt, …). |
select |
Yes (0.4.0+) |
Column names (strings) and Column expressions. |
limit |
Yes |
Non-negative int. |
orderBy |
Yes |
Column names + optional ascending. |
withColumn |
Yes (0.4.0+) |
Expressions translatable by |
join |
Yes |
on= list or extracted from col==col; how= inner/left/right/outer/semi/anti. |
union |
Yes |
|
distinct |
Yes |
Deduplicate rows. |
drop |
Yes |
Drop column(s) by name. |
groupBy + agg |
No |
Recommended to add (high priority). |