Sparkless changes to work better with robin-sparkless

This document lists concrete Sparkless-side changes that improve Robin backend compatibility. It draws on robin_sparkless_ownership_analysis.md, robin_improvement_plan.md, and the worker-crash investigation.


Already done

Change

Notes

Filter: use method API (.gt(), .lt(), etc.)

_simple_filter_to_robin() uses robin Column methods instead of Python operators.

Filter: AND/OR

ColumnOperation("&", ...) and ColumnOperation("|", ...) supported recursively.

Join: on= expression

_join_on_to_column_names() extracts keys from col == col; join supported.

Join: semi/anti

can_handle_join accepts left_semi, left_anti; materialize passes through.

Robin storage db_path

Parquet teardown ERRORs fixed.

Fewer workers default in Robin mode

run_all_tests.sh uses 4 workers when BACKEND=robin to reduce worker crashes.

Skip test_regexp_extract_all_* in Robin

Test skipped; robin-sparkless doesn’t support select with that expression (issue #176).

Polars: regex in closure

regexp_extract_all compiles regex inside closure to avoid hang under xdist.



What to leave to robin-sparkless

  • select() / with_column() with Column expressionsResolved in robin-sparkless 0.4.0 (issue #182). Sparkless requires 0.4.0+ and uses these APIs.

  • Filter Column–Column, filter(bool), lit(date/datetime), Window APIResolved in 0.4.0 (issues #184–#187).

  • select() with some expressions (e.g. regexp_extract_all) – robin-sparkless issue #176 if still needed.

  • join(on=“col”) – single-column string; we pass list; robin-sparkless issue #175.

  • Worker crashes under pytest-xdist – robin-sparkless issue #178 (fork-safety / stability).

No Sparkless code change is required for the remaining items; we work around or skip where needed and track upstream fixes.


Quick reference: current Robin materializer support

Operation

Supported

Notes

filter

Yes

Simple col op literal; AND/OR; isin (translated to OR of equality). Method API (.gt, .lt, …).

select

Yes (0.4.0+)

Column names (strings) and Column expressions.

limit

Yes

Non-negative int.

orderBy

Yes

Column names + optional ascending.

withColumn

Yes (0.4.0+)

Expressions translatable by _expression_to_robin; robin-sparkless 0.4.0+ accepts Column in with_column().

join

Yes

on= list or extracted from col==col; how= inner/left/right/outer/semi/anti.

union

Yes

distinct

Yes

Deduplicate rows.

drop

Yes

Drop column(s) by name.

groupBy + agg

No

Recommended to add (high priority).