Robin mode: worker crashes and INTERNALERROR (KeyError WorkerController)
What we saw
When running the test suite in Robin mode with 12 workers (-n 12), the run ended with:
[gw8] node down: Not properly terminated
replacing crashed worker gw8
...
[gw10] node down: Not properly terminated
replacing crashed worker gw10
[gw11] node down: Not properly terminated
replacing crashed worker gw11
INTERNALERROR> KeyError: <WorkerController gw14>
...
File ".../xdist/scheduler/loadscope.py", line 275, in _assign_work_unit
worker_collection = self.registered_collections[node]
KeyError: <WorkerController gw14>
So: multiple workers crashed (“Not properly terminated”), then xdist tried to replace them and hit a KeyError in the scheduler.
Root cause (two layers)
1. Worker crashes (primary)
“Node down: Not properly terminated” means the worker process died abruptly (no normal Python teardown). Typical causes:
Segfault or abort in native code – e.g. Rust/C extension used from a forked worker
OOM kill
Unhandled fatal signal
In Robin mode, the only native/runtime layer in the test process is robin-sparkless (Rust/Python bindings). So the most plausible cause is that some test triggers a code path in robin-sparkless that crashes the process when run inside a pytest-xdist worker (forked subprocess). That could be:
Fork-safety: Rust/native state not safe after fork
A bug in robin-sparkless that aborts or segfaults on certain operations
Resource or threading issues when many workers use the library
We don’t get a Python traceback for the crash because the process is killed before Python can report it.
2. KeyError in xdist (secondary)
When a worker crashes, pytest-xdist replaces it and continues. There is a known xdist bug: during that replacement flow, the scheduler can look up registered_collections[node] for a node that is not (or no longer) in the dict, leading to KeyError: <WorkerController gwX> (see pytest-xdist#714). So the INTERNALERROR is a consequence of the worker crashes plus xdist’s handling of crashed/replacement workers, not a bug in our tests or in Robin itself.
What we can do
Sparkless side
Run Robin mode with fewer workers or serial
Reduces concurrency and the chance of hitting crashy paths or stressing robin-sparkless:Fewer workers:
SPARKLESS_TEST_BACKEND=robin bash tests/run_all_tests.sh -n 4Serial:
SPARKLESS_TEST_BACKEND=robin bash tests/run_all_tests.sh -n 0
(Serial also avoids xdist entirely, so no KeyError from replacement workers.)
Document the behavior
In docs/backend_selection or a Robin-specific note: when running with many workers, Robin mode may occasionally show “node down: Not properly terminated” and an INTERNALERROR (KeyError WorkerController). Workaround: use fewer workers or-n 0for Robin.Optionally default Robin to fewer workers
Intests/run_all_tests.sh, whenBACKEND=robin, you could set a lower default worker count (e.g. 4) unless the user overrides with-n.
pytest-xdist
This is an upstream bug (issue #714; there may be a fix in a PR). We can:
Pin or upgrade to a xdist version that fixes the KeyError when it’s released.
Not block on it; reducing workers or using
-n 0for Robin avoids the replacement path that triggers the bug.
robin-sparkless
Yes, we should open an issue. The report should say:
We run the Sparkless test suite with pytest-xdist (multiple forked workers) in Robin mode.
Worker processes sometimes crash with “node down: Not properly terminated” (no Python exception, process dies).
This suggests a crash or abort in native/Rust code when used from forked subprocess workers.
Ask them to investigate: fork-safety, signal handling, and stability when the library is used from multiprocessing/forked workers.
Optionally attach: a short log snippet (node down + replacing crashed worker) and the test modules that were running on the crashed workers (e.g. test_column_case_variations, test_string_arithmetic, parity/functions/test_string, parity/sql/test_dml) as potentially triggering tests.
Summary
Cause |
Owner |
Action |
|---|---|---|
Worker process crash (“Not properly terminated”) |
Likely robin-sparkless (native code in forked workers) |
Open robin-sparkless issue; run Robin with fewer workers or |
KeyError: WorkerController in xdist |
pytest-xdist (issue #714) |
Use fewer workers or serial for Robin to avoid replacement path; track xdist fix |
No change is required inside Sparkless test code; the failures are due to worker crashes and xdist’s handling of them.