Migration from v2 to v3

Versioning Note: The functionality described here now ships as Sparkless 0.0.x–0.3.x in the semver-aligned roadmap. References to “v3.0.0” map directly to the current 0.0.0 baseline release.

Overview

Sparkless v3.0.0 introduces a complete migration from DuckDB to Polars as the default backend. This is a breaking change that requires attention when upgrading.

Key Changes

Backend Migration

Default Backend: Changed from DuckDB to Polars
Storage Format: Tables now persist as Parquet files instead of DuckDB database files
Thread Safety: Polars is thread-safe by design - no more connection locks or threading issues
No SQL Required: Polars uses DataFrame operations instead of SQL generation

Breaking Changes

Default Backend: New sessions use Polars by default
Storage Format: Existing DuckDB databases are not automatically migrated
Configuration: max_memory and allow_disk_spillover options are ignored for Polars backend
Dependencies: DuckDB and SQLAlchemy are no longer required (optional for backward compatibility)

Migration Steps

1. Update Dependencies

# Install v3.0.0
pip install sparkless>=3.0.0

# Polars is now a required dependency
# DuckDB and SQLAlchemy are optional (only if using DuckDB backend)

2. Update Code

Most code will work without changes since Polars backend implements the same interfaces:

# Before (v2.x) - works the same in v3.0.0
from sparkless.sql import SparkSession

spark = SparkSession("MyApp")
df = spark.createDataFrame([{"name": "Alice", "age": 25}])
df.show()

3. Migrate Existing DuckDB Databases (Optional)

If you have existing DuckDB database files, you’ll need to migrate them:

# Option 1: Use migration script (if provided)
from sparkless.scripts.migrate_duckdb_to_parquet import migrate_database
migrate_database("old_database.duckdb", "new_storage_path")

# Option 2: Manual migration
# Export tables from DuckDB and recreate with Polars backend

4. Update Backend Configuration (If Needed)

If you explicitly configured DuckDB backend, you may need to update:

# v2.x
spark = SparkSession.builder \
    .config("spark.sparkless.backend", "duckdb") \
    .config("spark.sparkless.backend.maxMemory", "4GB") \
    .getOrCreate()

# v3.0.0 - Polars is default, no memory config needed
spark = SparkSession("MyApp")  # Uses Polars automatically

# Or explicitly set Polars
spark = SparkSession.builder \
    .config("spark.sparkless.backend", "polars") \
    .getOrCreate()

5. Use DuckDB Backend (If Needed)

If you need DuckDB for specific features, you can still use it:

# v3.0.0 - Use DuckDB backend explicitly
spark = SparkSession.builder \
    .config("spark.sparkless.backend", "duckdb") \
    .config("spark.sparkless.backend.maxMemory", "4GB") \
    .getOrCreate()

Note: DuckDB backend requires duckdb and duckdb-engine packages to be installed.

What Changed Under the Hood

Storage

Before: DuckDB database files (.duckdb)
After: Parquet files ({schema}/{table}.parquet) + JSON schema files ({table}.schema.json)

Query Execution

Before: SQL generation via SQLAlchemy → DuckDB
After: Polars expressions → Polars DataFrame operations

Threading

Before: Connection locks required (_connection_lock)
After: Thread-safe by design (Polars uses Rayon internally)

Performance Improvements

Thread Safety: No locking overhead
Lazy Evaluation: Polars lazy evaluation provides better optimization
Memory Efficiency: Polars handles memory more efficiently
Faster Operations: Polars is optimized for DataFrame operations

Removed Features

max_memory configuration option (Polars handles memory automatically)
allow_disk_spillover configuration option (not needed with Polars)
SQL-specific optimizations (replaced with Polars optimizations)

Backward Compatibility

DuckDB Backend: Still available as optional backend
API Compatibility: All PySpark API methods remain the same
Storage Interface: Same IStorageManager protocol

Troubleshooting

Import Errors

If you see ModuleNotFoundError: No module named 'polars':

pip install polars>=0.20.0

Threading Issues

If you had threading issues with v2.x, they should be resolved in v3.0.0 with Polars.

Storage Migration

If you need to migrate existing DuckDB databases, contact support or use the migration script (if available).

Testing

After migration, run your test suite:

pytest tests/

All tests should pass with Polars backend. If you encounter issues, you can temporarily use DuckDB backend:

spark = SparkSession.builder \
    .config("spark.sparkless.backend", "duckdb") \
    .getOrCreate()

Questions?

If you encounter issues during migration, please:

Check this guide
Review the changelog
Open an issue on GitHub

Summary

✅ Default backend: Polars (thread-safe, high-performance)
✅ Storage format: Parquet files
✅ Breaking changes: Default backend, storage format
✅ Backward compatibility: DuckDB backend still available
✅ API compatibility: All PySpark APIs remain the same