Improving Sparkless for Robin-Sparkless

This document lists concrete ways to improve Sparkless so it works better with the Robin backend (robin-sparkless). It is based on the Robin mode failure analysis and test report.


1. Summary

Area

Priority

Owner

Status / notes

Join: support Column expression as on

High

Sparkless

Done – materializer accepts col == col and extracts join keys; Robin path no longer raises unsupported. Join result is 0 rows β†’ see β€œJoin result” below.

Join: fix 0-row result (inner/left/right/outer)

High

Sparkless / robin-sparkless

Robin materializer runs join but returns 0 rows; debug robin join(on=..., how=...) and/or our collect() β†’ Row conversion.

Filter: support AND/OR and more expressions

High

Sparkless

Done – _simple_filter_to_robin handles ColumnOperation("&", ...) and ColumnOperation("|", ...) recursively.

groupBy + agg in Robin materializer

High

Sparkless

Add groupBy (and agg) to SUPPORTED_OPERATIONS and translate to robin_sparkless group_by + GroupedData aggregations.

Semi/anti join support in Robin

Medium

Sparkless / robin-sparkless

Done – _can_handle_join and materialize accept left_semi / left_anti; robin may support or raise at runtime.

Column comparison / TypeError in fixtures

Medium

Sparkless

Avoid comparing or ordering raw Column objects in assertion/sort helpers; resolve to values or use supported expressions.

Parquet/table append semantics for Robin

Medium

Sparkless

If append/visibility under Robin storage is wrong, fix storage/session logic for the Robin delegate.

Test/fixture expectations (robin in backend list)

Low

Sparkless

Done – include 'robin' where tests assume backends (e.g. unified infrastructure example).

Robin storage db_path

Low

Sparkless

Done – added so teardown no longer raises.


2. Implemented

  • Robin storage db_path – RobinStorageManager exposes db_path; parquet teardown ERRORs are fixed.

  • Arbitrary schema – Materializer uses create_dataframe_from_rows(data, schema) and _spark_type_to_robin_dtype for any schema.

  • orderBy, withColumn, join, union – Materializer supports these; join accepts both string/list and Column expression (col == col) via _join_on_to_column_names().

  • Join on Column expression – _can_handle_join and join materialization use _join_on_to_column_names(on) so df.join(other, df.a == other.a, "inner") is handled by Robin instead of raising unsupported.

  • Filter AND/OR – _simple_filter_to_robin handles ColumnOperation("&", left, right) and ColumnOperation("|", left, right) recursively so combined conditions are translated to Robin.

  • Semi/anti join – _can_handle_join accepts left_semi, left_anti, semi, anti; materialize passes them through to robin (robin may support or raise at runtime).

  • Join Row conversion – Materializer uses SchemaManager.project_schema_with_operations to build a final schema and passes it to Row(d, schema=final_schema) so column order and duplicate names after join are correct.

  • Tests/fixtures – Backend list includes 'robin' where needed (e.g. unified infrastructure example).



4. Code locations

Change

File(s)

Join on expression, filter AND/OR, groupBy

sparkless/backend/robin/materializer.py

Column comparison in tests

tests/fixtures/comparison.py (and any assertion helpers that compare Column objects)

Robin storage

sparkless/backend/robin/storage.py


5. How to validate

  • Run full suite in Robin mode and compare pass/fail counts to the last report:

    SPARKLESS_TEST_BACKEND=robin SPARKLESS_BACKEND=robin pytest tests/ --ignore=tests/archive -n 10 --dist loadfile -v --tb=short 2>&1 | tee tests/robin_mode_test_results.txt
    
  • Run only parity join tests to confirm join path and row counts:

    SPARKLESS_TEST_BACKEND=robin SPARKLESS_BACKEND=robin pytest tests/parity/dataframe/test_join.py -v --tb=short
    
  • After adding filter AND/OR or groupBy, run filter/groupby parity tests in Robin mode and check that they pass or fail with a clear, non-unsupported error.


6. robin-sparkless (upstream)

Robin-sparkless already exposes the APIs we need (arbitrary schema, filter, select, with_column, order_by, join, union, group_by, GroupedData). No upstream feature request is required for β€œmore operations” or β€œflexible schema.” Concrete bugs (wrong result, wrong row count, missing behavior) should be reported upstream with a minimal repro; see robin_sparkless_issues.md.