Robin Mode Ownership Analysis: Sparkless vs robin-sparkless

Date: 2026-02-06
Reference run: test_results_robin_mode.txt β€” 1,656 failed, 1,050 passed, 22 skipped
Purpose: Determine whether failures require Sparkless adaptations or robin-sparkless fixes.


Executive Summary

Most failures require Sparkless adaptations. The robin-sparkless package exposes the APIs needed (filter, select, join, group_by, etc.), but uses a method-based comparison API (e.g. col.gt(lit)) instead of Python operators (col > lit). Sparkless currently generates operator-based expressions, which fail. Additionally, Sparkless’ Robin materializer has narrow can_handle rules and supports only a subset of operations.

robin-sparkless issues identified: Column class does not implement __gt__, __lt__, etc.; Python col > lit raises TypeError. Upstream fix would be adding operator overloads for PySpark compatibility.


1. Failure Categories

Category A: SparkUnsupportedOperationError (Sparkless)

Cause: Robin materializer’s can_handle_operation() returns False; Sparkless raises fail-fast.

Examples:

  • test_create_map_with_literals: Operation 'Operations: select' is not supported β€” select with Column expressions (e.g. F.create_map(...).alias("map_col")) fails because materializer only accepts select([str, str, ...]), not Column objects.

  • UDF tests, withField tests, window comparison tests β€” these use operations the materializer does not declare support for.

Owner: Sparkless β€” extend can_handle_operation and add translation for:

  • select with Column/expression payloads

  • create_map, withField, UDFs (if robin-sparkless supports them; otherwise document as unsupported)


Category B: Row count mismatch mock=0 (Mixed – Sparkless translation bug confirmed)

Cause: Filter/join returns 0 rows when it should return N.

Root cause (confirmed): Sparkless Robin materializer uses Python operators:

return robin_col > robin_lit   # Fails!

robin-sparkless Column does not implement __gt__; it raises:

TypeError: '>' not supported between instances of 'builtins.Column' and 'builtins.Column'

robin-sparkless expects the method-based API:

robin_col.gt(robin_lit)  # Works

Owner: Sparkless β€” change _simple_filter_to_robin() in sparkless/backend/robin/materializer.py to use .gt(), .lt(), .ge(), .le(), .eq(), .ne() instead of >, <, etc.

Optional robin-sparkless improvement: Add __gt__, __lt__, etc. for PySpark-style col > lit usage.


Category C: Operations robin-sparkless may not support

Examples: create_map, UDFs, rlike lookaround, withField, window functions with complex expressions.

Owner: Depends on robin-sparkless API:

  • If robin-sparkless exposes the function β†’ Sparkless must translate to it.

  • If not β†’ either Sparkless documents as unsupported (fail-fast) or robin-sparkless adds support.


Category D: AssertionError / wrong result (needs isolation)

Cause: Result shape or values differ from expected.

Owner: Requires per-test isolation:

  • If translation to robin-sparkless is correct β†’ likely robin-sparkless bug.

  • If translation is wrong β†’ Sparkless bug.


2. Action Items

Sparkless (high priority)

Item

File

Description

1

sparkless/backend/robin/materializer.py

Use col.gt(lit), col.lt(lit), etc. instead of col > lit in _simple_filter_to_robin()

2

sparkless/backend/robin/materializer.py

Broaden select handling for Column expressions where possible

3

sparkless/backend/robin/materializer.py

Extend can_handle_operation for more op shapes (with clear docs on limits)

Sparkless (medium priority)

Item

Description

4

Document which operations are supported vs unsupported for Robin backend

5

Consider a fallback path: when Robin cannot handle an op, use Polars materializer if available

robin-sparkless (optional)

Item

Description

1

Add __gt__, __lt__, __ge__, __le__, __eq__, __ne__ to Column for PySpark compatibility


3. Verification

After fixing the filter translation (item 1):

# Should pass filter and join parity
SPARKLESS_TEST_BACKEND=robin python -m pytest \
  tests/parity/dataframe/test_filter.py::TestFilterParity::test_filter_operations \
  tests/parity/dataframe/test_join.py::TestJoinParity::test_inner_join \
  -v --no-cov

4. Test Run Commands

# Full Robin run
SPARKLESS_TEST_BACKEND=robin SPARKLESS_BACKEND=robin python -m pytest \
  tests/ --ignore=tests/archive -n 12 --dist loadfile -v --tb=short \
  2>&1 | tee test_results_robin_mode.txt

# Sample with long tracebacks
SPARKLESS_TEST_BACKEND=robin python scripts/run_robin_failure_sample.py --max 50