# Robin Mode Ownership Analysis: Sparkless vs robin-sparkless **Date:** 2026-02-06 **Reference run:** `test_results_robin_mode.txt` — 1,656 failed, 1,050 passed, 22 skipped **Purpose:** Determine whether failures require Sparkless adaptations or robin-sparkless fixes. --- ## Executive Summary **Most failures require Sparkless adaptations.** The robin-sparkless package exposes the APIs needed (filter, select, join, group_by, etc.), but uses a **method-based comparison API** (e.g. `col.gt(lit)`) instead of Python operators (`col > lit`). Sparkless currently generates operator-based expressions, which fail. Additionally, Sparkless’ Robin materializer has narrow `can_handle` rules and supports only a subset of operations. **robin-sparkless issues identified:** Column class does not implement `__gt__`, `__lt__`, etc.; Python `col > lit` raises `TypeError`. Upstream fix would be adding operator overloads for PySpark compatibility. --- ## 1. Failure Categories ### Category A: SparkUnsupportedOperationError (Sparkless) **Cause:** Robin materializer’s `can_handle_operation()` returns False; Sparkless raises fail-fast. **Examples:** - `test_create_map_with_literals`: `Operation 'Operations: select' is not supported` — select with Column expressions (e.g. `F.create_map(...).alias("map_col")`) fails because materializer only accepts `select([str, str, ...])`, not Column objects. - UDF tests, withField tests, window comparison tests — these use operations the materializer does not declare support for. **Owner:** **Sparkless** — extend `can_handle_operation` and add translation for: - select with Column/expression payloads - create_map, withField, UDFs (if robin-sparkless supports them; otherwise document as unsupported) --- ### Category B: Row count mismatch mock=0 (Mixed – Sparkless translation bug confirmed) **Cause:** Filter/join returns 0 rows when it should return N. **Root cause (confirmed):** Sparkless Robin materializer uses Python operators: ```python return robin_col > robin_lit # Fails! ``` robin-sparkless `Column` does **not** implement `__gt__`; it raises: ``` TypeError: '>' not supported between instances of 'builtins.Column' and 'builtins.Column' ``` robin-sparkless expects the method-based API: ```python robin_col.gt(robin_lit) # Works ``` **Owner:** **Sparkless** — change `_simple_filter_to_robin()` in `sparkless/backend/robin/materializer.py` to use `.gt()`, `.lt()`, `.ge()`, `.le()`, `.eq()`, `.ne()` instead of `>`, `<`, etc. **Optional robin-sparkless improvement:** Add `__gt__`, `__lt__`, etc. for PySpark-style `col > lit` usage. --- ### Category C: Operations robin-sparkless may not support **Examples:** create_map, UDFs, rlike lookaround, withField, window functions with complex expressions. **Owner:** Depends on robin-sparkless API: - If robin-sparkless exposes the function → **Sparkless** must translate to it. - If not → either **Sparkless** documents as unsupported (fail-fast) or **robin-sparkless** adds support. --- ### Category D: AssertionError / wrong result (needs isolation) **Cause:** Result shape or values differ from expected. **Owner:** Requires per-test isolation: - If translation to robin-sparkless is correct → likely **robin-sparkless** bug. - If translation is wrong → **Sparkless** bug. --- ## 2. Action Items ### Sparkless (high priority) | Item | File | Description | |------|------|-------------| | 1 | `sparkless/backend/robin/materializer.py` | Use `col.gt(lit)`, `col.lt(lit)`, etc. instead of `col > lit` in `_simple_filter_to_robin()` | | 2 | `sparkless/backend/robin/materializer.py` | Broaden select handling for Column expressions where possible | | 3 | `sparkless/backend/robin/materializer.py` | Extend `can_handle_operation` for more op shapes (with clear docs on limits) | ### Sparkless (medium priority) | Item | Description | |------|-------------| | 4 | Document which operations are supported vs unsupported for Robin backend | | 5 | Consider a fallback path: when Robin cannot handle an op, use Polars materializer if available | ### robin-sparkless (optional) | Item | Description | |------|-------------| | 1 | Add `__gt__`, `__lt__`, `__ge__`, `__le__`, `__eq__`, `__ne__` to Column for PySpark compatibility | --- ## 3. Verification After fixing the filter translation (item 1): ```bash # Should pass filter and join parity SPARKLESS_TEST_BACKEND=robin python -m pytest \ tests/parity/dataframe/test_filter.py::TestFilterParity::test_filter_operations \ tests/parity/dataframe/test_join.py::TestJoinParity::test_inner_join \ -v --no-cov ``` --- ## 4. Test Run Commands ```bash # Full Robin run SPARKLESS_TEST_BACKEND=robin SPARKLESS_BACKEND=robin python -m pytest \ tests/ --ignore=tests/archive -n 12 --dist loadfile -v --tb=short \ 2>&1 | tee test_results_robin_mode.txt # Sample with long tracebacks SPARKLESS_TEST_BACKEND=robin python scripts/run_robin_failure_sample.py --max 50 ```