# Threading Sparkless's Polars backend is **thread-safe by design**, eliminating threading issues that were present with DuckDB. Polars uses Rayon (Rust's data parallelism library) internally, making it safe for concurrent operations. ## Thread Safety with Polars Polars backend provides native thread safety: 1. **No Connection Locks**: Polars is thread-safe by design - no need for connection locks or thread-local handling 2. **Concurrent Operations**: Multiple threads can safely operate on DataFrames simultaneously 3. **Parallel Execution**: Polars automatically parallelizes operations across threads when beneficial ## Thread-Safe Operations All operations with Polars backend are thread-safe: ```python from sparkless.sql import SparkSession from concurrent.futures import ThreadPoolExecutor spark = SparkSession("MyApp") def create_table_in_thread(schema_name, table_name): """Create a table in a worker thread.""" # No special handling needed - Polars is thread-safe df = spark.createDataFrame([{"id": 1, "name": "Alice"}]) df.write.saveAsTable(f"{schema_name}.{table_name}") # Use ThreadPoolExecutor - no threading issues! with ThreadPoolExecutor(max_workers=8) as executor: futures = [ executor.submit(create_table_in_thread, f"schema_{i}", f"table_{i}") for i in range(10) ] for future in futures: future.result() # All operations succeed without locks ``` ## How It Works Polars provides thread safety through: 1. **Rust-Based Core**: Polars is built on Rust with Rayon for data parallelism 2. **No Shared Mutable State**: Each operation works on immutable DataFrames 3. **Automatic Parallelization**: Polars parallelizes operations internally when safe 4. **No Connection Management**: Unlike SQL databases, Polars doesn't require connection pooling ## Best Practices ### Using with ThreadPoolExecutor When using `ThreadPoolExecutor` or similar parallel execution frameworks: ```python from concurrent.futures import ThreadPoolExecutor def process_data(schema_name, data): """Process data in a worker thread.""" spark = SparkSession("MyApp") df = spark.createDataFrame(data) # No threading concerns - Polars handles it df.write.saveAsTable(f"{schema_name}.results") with ThreadPoolExecutor(max_workers=8) as executor: futures = [ executor.submit(process_data, f"schema_{i}", data_list[i]) for i in range(10) ] for future in futures: future.result() ``` ### Using with pytest-xdist When running tests with `pytest-xdist`: ```bash # Run tests in parallel with 8 workers - no threading issues! pytest -n 8 tests/ # Polars backend is thread-safe - no special configuration needed ``` No special configuration is needed - Sparkless with Polars backend handles threading automatically. ## Comparison with DuckDB Backend ### DuckDB Backend (v2.x and earlier) - **Thread-local connections**: Each thread needed its own connection - **Schema visibility issues**: Schemas created in one thread weren't visible to others - **Connection locks required**: Required `_connection_lock` to prevent race conditions - **Retry logic needed**: Schema creation needed retries with exponential backoff ### Polars Backend (v3.0.0+) - **Thread-safe by design**: No connection management needed - **No schema visibility issues**: Polars doesn't use SQL schemas - **No locks required**: Operations are inherently thread-safe - **No retry logic**: No race conditions to handle ## Performance Considerations Polars threading provides excellent performance: - **Automatic Parallelization**: Polars parallelizes operations across available CPU cores - **No Lock Overhead**: No connection locks means lower overhead - **Better Scalability**: Can safely use many threads without contention ## Example: Parallel Pipeline Execution ```python from sparkless.sql import SparkSession from concurrent.futures import ThreadPoolExecutor def run_pipeline_step(step_id, input_data): """Run a pipeline step in a worker thread.""" spark = SparkSession("Pipeline") # Create input DataFrame df = spark.createDataFrame(input_data) # Process data result = df.select("id", "value").filter("value > 10") # Save to table (thread-safe - no special handling needed) result.write.saveAsTable(f"step_{step_id}.results") return result.count() # Run multiple pipeline steps in parallel - no threading issues! with ThreadPoolExecutor(max_workers=4) as executor: futures = [ executor.submit(run_pipeline_step, i, data[i]) for i in range(10) ] results = [future.result() for future in futures] ``` ## Troubleshooting ### No Threading Issues! With Polars backend, threading issues should be completely eliminated. If you encounter any issues: 1. **Verify Backend**: Ensure you're using Polars backend (default in v3.0.0+) ```python from sparkless.backend.factory import BackendFactory backend_type = BackendFactory.get_backend_type(spark._storage) assert backend_type == "polars" ``` 2. **Check Polars Version**: Ensure you have a recent version of Polars ```bash pip install polars>=0.20.0 ``` 3. **No Special Configuration**: No threading-related configuration needed ## Migration from DuckDB If you're migrating from DuckDB backend and had threading issues: - **No more locks**: Remove any threading workarounds - **No retry logic**: Remove schema creation retries - **Simpler code**: Threading concerns are handled automatically ## See Also - [Configuration Guide](configuration.md) - Configuration options for Sparkless - [Pytest Integration Guide](pytest_integration.md) - Using Sparkless with pytest - [Memory Management Guide](memory_management.md) - Managing memory in parallel contexts - [Migration Guide](../migration_from_v2_to_v3.md) - Migrating from DuckDB to Polars