# Migration from v2 to v3 > **Versioning Note:** The functionality described here now ships as Sparkless `0.0.x`–`0.3.x` in the semver-aligned roadmap. References to “v3.0.0” map directly to the current `0.0.0` baseline release. ## Overview Sparkless v3.0.0 introduces a **complete migration from DuckDB to Polars** as the default backend. This is a **breaking change** that requires attention when upgrading. ## Key Changes ### Backend Migration - **Default Backend**: Changed from DuckDB to Polars - **Storage Format**: Tables now persist as Parquet files instead of DuckDB database files - **Thread Safety**: Polars is thread-safe by design - no more connection locks or threading issues - **No SQL Required**: Polars uses DataFrame operations instead of SQL generation ### Breaking Changes 1. **Default Backend**: New sessions use Polars by default 2. **Storage Format**: Existing DuckDB databases are not automatically migrated 3. **Configuration**: `max_memory` and `allow_disk_spillover` options are ignored for Polars backend 4. **Dependencies**: DuckDB and SQLAlchemy are no longer required (optional for backward compatibility) ## Migration Steps ### 1. Update Dependencies ```bash # Install v3.0.0 pip install sparkless>=3.0.0 # Polars is now a required dependency # DuckDB and SQLAlchemy are optional (only if using DuckDB backend) ``` ### 2. Update Code Most code will work without changes since Polars backend implements the same interfaces: ```python # Before (v2.x) - works the same in v3.0.0 from sparkless.sql import SparkSession spark = SparkSession("MyApp") df = spark.createDataFrame([{"name": "Alice", "age": 25}]) df.show() ``` ### 3. Migrate Existing DuckDB Databases (Optional) If you have existing DuckDB database files, you'll need to migrate them: ```python # Option 1: Use migration script (if provided) from sparkless.scripts.migrate_duckdb_to_parquet import migrate_database migrate_database("old_database.duckdb", "new_storage_path") # Option 2: Manual migration # Export tables from DuckDB and recreate with Polars backend ``` ### 4. Update Backend Configuration (If Needed) If you explicitly configured DuckDB backend, you may need to update: ```python # v2.x spark = SparkSession.builder \ .config("spark.sparkless.backend", "duckdb") \ .config("spark.sparkless.backend.maxMemory", "4GB") \ .getOrCreate() # v3.0.0 - Polars is default, no memory config needed spark = SparkSession("MyApp") # Uses Polars automatically # Or explicitly set Polars spark = SparkSession.builder \ .config("spark.sparkless.backend", "polars") \ .getOrCreate() ``` ### 5. Use DuckDB Backend (If Needed) If you need DuckDB for specific features, you can still use it: ```python # v3.0.0 - Use DuckDB backend explicitly spark = SparkSession.builder \ .config("spark.sparkless.backend", "duckdb") \ .config("spark.sparkless.backend.maxMemory", "4GB") \ .getOrCreate() ``` **Note**: DuckDB backend requires `duckdb` and `duckdb-engine` packages to be installed. ## What Changed Under the Hood ### Storage - **Before**: DuckDB database files (`.duckdb`) - **After**: Parquet files (`{schema}/{table}.parquet`) + JSON schema files (`{table}.schema.json`) ### Query Execution - **Before**: SQL generation via SQLAlchemy → DuckDB - **After**: Polars expressions → Polars DataFrame operations ### Threading - **Before**: Connection locks required (`_connection_lock`) - **After**: Thread-safe by design (Polars uses Rayon internally) ## Performance Improvements - **Thread Safety**: No locking overhead - **Lazy Evaluation**: Polars lazy evaluation provides better optimization - **Memory Efficiency**: Polars handles memory more efficiently - **Faster Operations**: Polars is optimized for DataFrame operations ## Removed Features - `max_memory` configuration option (Polars handles memory automatically) - `allow_disk_spillover` configuration option (not needed with Polars) - SQL-specific optimizations (replaced with Polars optimizations) ## Backward Compatibility - **DuckDB Backend**: Still available as optional backend - **API Compatibility**: All PySpark API methods remain the same - **Storage Interface**: Same `IStorageManager` protocol ## Troubleshooting ### Import Errors If you see `ModuleNotFoundError: No module named 'polars'`: ```bash pip install polars>=0.20.0 ``` ### Threading Issues If you had threading issues with v2.x, they should be resolved in v3.0.0 with Polars. ### Storage Migration If you need to migrate existing DuckDB databases, contact support or use the migration script (if available). ## Testing After migration, run your test suite: ```bash pytest tests/ ``` All tests should pass with Polars backend. If you encounter issues, you can temporarily use DuckDB backend: ```python spark = SparkSession.builder \ .config("spark.sparkless.backend", "duckdb") \ .getOrCreate() ``` ## Questions? If you encounter issues during migration, please: 1. Check this guide 2. Review the [changelog](../CHANGELOG.md) 3. Open an issue on GitHub ## Summary - ✅ **Default backend**: Polars (thread-safe, high-performance) - ✅ **Storage format**: Parquet files - ✅ **Breaking changes**: Default backend, storage format - ✅ **Backward compatibility**: DuckDB backend still available - ✅ **API compatibility**: All PySpark APIs remain the same