Robin mode: worker crashes and INTERNALERROR (KeyError WorkerController)

What we saw

When running the test suite in Robin mode with 12 workers (-n 12), the run ended with:

[gw8] node down: Not properly terminated
replacing crashed worker gw8
...
[gw10] node down: Not properly terminated
replacing crashed worker gw10
[gw11] node down: Not properly terminated
replacing crashed worker gw11
INTERNALERROR> KeyError: <WorkerController gw14>
  ...
  File ".../xdist/scheduler/loadscope.py", line 275, in _assign_work_unit
    worker_collection = self.registered_collections[node]
KeyError: <WorkerController gw14>

So: multiple workers crashed (“Not properly terminated”), then xdist tried to replace them and hit a KeyError in the scheduler.

Root cause (two layers)

1. Worker crashes (primary)

“Node down: Not properly terminated” means the worker process died abruptly (no normal Python teardown). Typical causes:

Segfault or abort in native code – e.g. Rust/C extension used from a forked worker
OOM kill
Unhandled fatal signal

In Robin mode, the only native/runtime layer in the test process is robin-sparkless (Rust/Python bindings). So the most plausible cause is that some test triggers a code path in robin-sparkless that crashes the process when run inside a pytest-xdist worker (forked subprocess). That could be:

Fork-safety: Rust/native state not safe after fork
A bug in robin-sparkless that aborts or segfaults on certain operations
Resource or threading issues when many workers use the library

We don’t get a Python traceback for the crash because the process is killed before Python can report it.

2. KeyError in xdist (secondary)

When a worker crashes, pytest-xdist replaces it and continues. There is a known xdist bug: during that replacement flow, the scheduler can look up registered_collections[node] for a node that is not (or no longer) in the dict, leading to KeyError: <WorkerController gwX> (see pytest-xdist#714). So the INTERNALERROR is a consequence of the worker crashes plus xdist’s handling of crashed/replacement workers, not a bug in our tests or in Robin itself.

What we can do

Sparkless side

Run Robin mode with fewer workers or serial
Reduces concurrency and the chance of hitting crashy paths or stressing robin-sparkless:
- Fewer workers: SPARKLESS_TEST_BACKEND=robin bash tests/run_all_tests.sh -n 4
- Serial: SPARKLESS_TEST_BACKEND=robin bash tests/run_all_tests.sh -n 0
  (Serial also avoids xdist entirely, so no KeyError from replacement workers.)
Document the behavior
In docs/backend_selection or a Robin-specific note: when running with many workers, Robin mode may occasionally show “node down: Not properly terminated” and an INTERNALERROR (KeyError WorkerController). Workaround: use fewer workers or -n 0 for Robin.
Optionally default Robin to fewer workers
In tests/run_all_tests.sh, when BACKEND=robin, you could set a lower default worker count (e.g. 4) unless the user overrides with -n.

pytest-xdist

This is an upstream bug (issue #714; there may be a fix in a PR). We can:
- Pin or upgrade to a xdist version that fixes the KeyError when it’s released.
- Not block on it; reducing workers or using -n 0 for Robin avoids the replacement path that triggers the bug.

robin-sparkless

Yes, we should open an issue. The report should say:
- We run the Sparkless test suite with pytest-xdist (multiple forked workers) in Robin mode.
- Worker processes sometimes crash with “node down: Not properly terminated” (no Python exception, process dies).
- This suggests a crash or abort in native/Rust code when used from forked subprocess workers.
- Ask them to investigate: fork-safety, signal handling, and stability when the library is used from multiprocessing/forked workers.
- Optionally attach: a short log snippet (node down + replacing crashed worker) and the test modules that were running on the crashed workers (e.g. test_column_case_variations, test_string_arithmetic, parity/functions/test_string, parity/sql/test_dml) as potentially triggering tests.

Summary

Cause	Owner	Action
Worker process crash (“Not properly terminated”)	Likely robin-sparkless (native code in forked workers)	Open robin-sparkless issue; run Robin with fewer workers or `-n 0` in the meantime
KeyError: WorkerController in xdist	pytest-xdist (issue #714)	Use fewer workers or serial for Robin to avoid replacement path; track xdist fix

No change is required inside Sparkless test code; the failures are due to worker crashes and xdist’s handling of them.