# Improving Sparkless for Robin-Sparkless

This document lists concrete ways to improve Sparkless so it works better with the Robin backend (robin-sparkless). It is based on the [Robin mode failure analysis](robin_mode_failure_analysis.md) and [test report](robin_mode_test_report.md).

---

## 1. Summary

| Area | Priority | Owner | Status / notes |
|------|----------|--------|----------------|
| Join: support Column expression as `on` | High | Sparkless | **Done** – materializer accepts `col == col` and extracts join keys; Robin path no longer raises unsupported. Join result is 0 rows → see “Join result” below. |
| Join: fix 0-row result (inner/left/right/outer) | High | Sparkless / robin-sparkless | Robin materializer runs join but returns 0 rows; debug robin `join(on=..., how=...)` and/or our `collect()` → Row conversion. |
| Filter: support AND/OR and more expressions | High | Sparkless | **Done** – `_simple_filter_to_robin` handles `ColumnOperation("&", ...)` and `ColumnOperation("\|", ...)` recursively. |
| groupBy + agg in Robin materializer | High | Sparkless | Add `groupBy` (and agg) to `SUPPORTED_OPERATIONS` and translate to robin_sparkless `group_by` + GroupedData aggregations. |
| Semi/anti join support in Robin | Medium | Sparkless / robin-sparkless | **Done** – `_can_handle_join` and materialize accept `left_semi` / `left_anti`; robin may support or raise at runtime. |
| Column comparison / TypeError in fixtures | Medium | Sparkless | Avoid comparing or ordering raw `Column` objects in assertion/sort helpers; resolve to values or use supported expressions. |
| Parquet/table append semantics for Robin | Medium | Sparkless | If append/visibility under Robin storage is wrong, fix storage/session logic for the Robin delegate. |
| Test/fixture expectations (robin in backend list) | Low | Sparkless | **Done** – include `'robin'` where tests assume backends (e.g. unified infrastructure example). |
| Robin storage `db_path` | Low | Sparkless | **Done** – added so teardown no longer raises. |

---

## 2. Implemented

- **Robin storage `db_path`** – `RobinStorageManager` exposes `db_path`; parquet teardown ERRORs are fixed.
- **Arbitrary schema** – Materializer uses `create_dataframe_from_rows(data, schema)` and `_spark_type_to_robin_dtype` for any schema.
- **orderBy, withColumn, join, union** – Materializer supports these; join accepts both string/list and Column expression (`col == col`) via `_join_on_to_column_names()`.
- **Join on Column expression** – `_can_handle_join` and join materialization use `_join_on_to_column_names(on)` so `df.join(other, df.a == other.a, "inner")` is handled by Robin instead of raising unsupported.
- **Filter AND/OR** – `_simple_filter_to_robin` handles `ColumnOperation("&", left, right)` and `ColumnOperation("|", left, right)` recursively so combined conditions are translated to Robin.
- **Semi/anti join** – `_can_handle_join` accepts `left_semi`, `left_anti`, `semi`, `anti`; materialize passes them through to robin (robin may support or raise at runtime).
- **Join Row conversion** – Materializer uses `SchemaManager.project_schema_with_operations` to build a final schema and passes it to `Row(d, schema=final_schema)` so column order and duplicate names after join are correct.
- **Tests/fixtures** – Backend list includes `'robin'` where needed (e.g. unified infrastructure example).

---

## 3. Recommended next steps (in order)

### 3.1 Fix join result (0 rows)

Robin join path runs but returns 0 rows for inner/left/right/outer.

- **Sparkless:** Check how we convert `df.collect()` to `List[Row]` (column names, schema, duplicate names after join). Ensure we’re not dropping or mis-mapping columns.
- **robin-sparkless:** Verify `join(other, on=["dept_id"], how="inner")` with two DataFrames that both have `dept_id` returns the expected rows; if not, report upstream or adapt our call (e.g. left_on/right_on if the API supports it).

### 3.2 Extend filter translation (AND/OR)

Many failures are `SparkUnsupportedOperationError` for filter.

- In `sparkless/backend/robin/materializer.py`, extend `_simple_filter_to_robin` (or add a wrapper) to handle:
  - `ColumnOperation("&", left, right)` → `_simple_filter_to_robin(left) and _simple_filter_to_robin(right)` (and map to robin’s `&` or chained `.filter()`).
  - `ColumnOperation("|", left, right)` → same for OR.
- Recursively support nested AND/OR so more real-world filters are handled by Robin.

### 3.3 Add groupBy + agg to Robin materializer

- Add `"groupBy"` to `RobinMaterializer.SUPPORTED_OPERATIONS`.
- In the operations queue, groupBy is typically followed by a GroupedData agg; the payload may be `(group_by_columns, agg_exprs)` or similar (see how Polars materializer receives it).
- Translate to robin_sparkless: `df.group_by([...]).agg(...)` (or equivalent GroupedData API). Start with simple aggs (count, sum, min, max, avg).

### 3.4 Semi/anti join

- Check robin_sparkless API for left_semi / left_anti (or equivalent). If present, add `_can_handle_join` for `how in ("left_semi", "left_anti")` and translate in the join branch.
- If absent, keep raising unsupported for semi/anti or document as limitation.

---

## 4. Code locations

| Change | File(s) |
|--------|--------|
| Join on expression, filter AND/OR, groupBy | `sparkless/backend/robin/materializer.py` |
| Column comparison in tests | `tests/fixtures/comparison.py` (and any assertion helpers that compare Column objects) |
| Robin storage | `sparkless/backend/robin/storage.py` |

---

## 5. How to validate

- Run full suite in Robin mode and compare pass/fail counts to the last report:
  ```bash
  SPARKLESS_TEST_BACKEND=robin SPARKLESS_BACKEND=robin pytest tests/ --ignore=tests/archive -n 10 --dist loadfile -v --tb=short 2>&1 | tee tests/robin_mode_test_results.txt
  ```
- Run only parity join tests to confirm join path and row counts:
  ```bash
  SPARKLESS_TEST_BACKEND=robin SPARKLESS_BACKEND=robin pytest tests/parity/dataframe/test_join.py -v --tb=short
  ```
- After adding filter AND/OR or groupBy, run filter/groupby parity tests in Robin mode and check that they pass or fail with a clear, non-unsupported error.

---

## 6. robin-sparkless (upstream)

Robin-sparkless already exposes the APIs we need (arbitrary schema, filter, select, with_column, order_by, join, union, group_by, GroupedData). No upstream feature request is required for “more operations” or “flexible schema.” Concrete bugs (wrong result, wrong row count, missing behavior) should be reported upstream with a minimal repro; see [robin_sparkless_issues.md](robin_sparkless_issues.md).