Migration from v2 to v3

Versioning Note: The functionality described here now ships as Sparkless 0.0.x0.3.x in the semver-aligned roadmap. References to “v3.0.0” map directly to the current 0.0.0 baseline release.

Overview

Sparkless v3.0.0 introduces a complete migration from DuckDB to Polars as the default backend. This is a breaking change that requires attention when upgrading.

Key Changes

Backend Migration

  • Default Backend: Changed from DuckDB to Polars

  • Storage Format: Tables now persist as Parquet files instead of DuckDB database files

  • Thread Safety: Polars is thread-safe by design - no more connection locks or threading issues

  • No SQL Required: Polars uses DataFrame operations instead of SQL generation

Breaking Changes

  1. Default Backend: New sessions use Polars by default

  2. Storage Format: Existing DuckDB databases are not automatically migrated

  3. Configuration: max_memory and allow_disk_spillover options are ignored for Polars backend

  4. Dependencies: DuckDB and SQLAlchemy are no longer required (optional for backward compatibility)

Migration Steps

1. Update Dependencies

# Install v3.0.0
pip install sparkless>=3.0.0

# Polars is now a required dependency
# DuckDB and SQLAlchemy are optional (only if using DuckDB backend)

2. Update Code

Most code will work without changes since Polars backend implements the same interfaces:

# Before (v2.x) - works the same in v3.0.0
from sparkless.sql import SparkSession

spark = SparkSession("MyApp")
df = spark.createDataFrame([{"name": "Alice", "age": 25}])
df.show()

3. Migrate Existing DuckDB Databases (Optional)

If you have existing DuckDB database files, you’ll need to migrate them:

# Option 1: Use migration script (if provided)
from sparkless.scripts.migrate_duckdb_to_parquet import migrate_database
migrate_database("old_database.duckdb", "new_storage_path")

# Option 2: Manual migration
# Export tables from DuckDB and recreate with Polars backend

4. Update Backend Configuration (If Needed)

If you explicitly configured DuckDB backend, you may need to update:

# v2.x
spark = SparkSession.builder \
    .config("spark.sparkless.backend", "duckdb") \
    .config("spark.sparkless.backend.maxMemory", "4GB") \
    .getOrCreate()

# v3.0.0 - Polars is default, no memory config needed
spark = SparkSession("MyApp")  # Uses Polars automatically

# Or explicitly set Polars
spark = SparkSession.builder \
    .config("spark.sparkless.backend", "polars") \
    .getOrCreate()

5. Use DuckDB Backend (If Needed)

If you need DuckDB for specific features, you can still use it:

# v3.0.0 - Use DuckDB backend explicitly
spark = SparkSession.builder \
    .config("spark.sparkless.backend", "duckdb") \
    .config("spark.sparkless.backend.maxMemory", "4GB") \
    .getOrCreate()

Note: DuckDB backend requires duckdb and duckdb-engine packages to be installed.

What Changed Under the Hood

Storage

  • Before: DuckDB database files (.duckdb)

  • After: Parquet files ({schema}/{table}.parquet) + JSON schema files ({table}.schema.json)

Query Execution

  • Before: SQL generation via SQLAlchemy → DuckDB

  • After: Polars expressions → Polars DataFrame operations

Threading

  • Before: Connection locks required (_connection_lock)

  • After: Thread-safe by design (Polars uses Rayon internally)

Performance Improvements

  • Thread Safety: No locking overhead

  • Lazy Evaluation: Polars lazy evaluation provides better optimization

  • Memory Efficiency: Polars handles memory more efficiently

  • Faster Operations: Polars is optimized for DataFrame operations

Removed Features

  • max_memory configuration option (Polars handles memory automatically)

  • allow_disk_spillover configuration option (not needed with Polars)

  • SQL-specific optimizations (replaced with Polars optimizations)

Backward Compatibility

  • DuckDB Backend: Still available as optional backend

  • API Compatibility: All PySpark API methods remain the same

  • Storage Interface: Same IStorageManager protocol

Troubleshooting

Import Errors

If you see ModuleNotFoundError: No module named 'polars':

pip install polars>=0.20.0

Threading Issues

If you had threading issues with v2.x, they should be resolved in v3.0.0 with Polars.

Storage Migration

If you need to migrate existing DuckDB databases, contact support or use the migration script (if available).

Testing

After migration, run your test suite:

pytest tests/

All tests should pass with Polars backend. If you encounter issues, you can temporarily use DuckDB backend:

spark = SparkSession.builder \
    .config("spark.sparkless.backend", "duckdb") \
    .getOrCreate()

Questions?

If you encounter issues during migration, please:

  1. Check this guide

  2. Review the changelog

  3. Open an issue on GitHub

Summary

  • Default backend: Polars (thread-safe, high-performance)

  • Storage format: Parquet files

  • Breaking changes: Default backend, storage format

  • Backward compatibility: DuckDB backend still available

  • API compatibility: All PySpark APIs remain the same