# Configuration Sparkless configuration is managed via SparkSession constructor options and the session builder. ## Basic Configuration ```python from sparkless.sql import SparkSession spark = SparkSession( validation_mode="relaxed", # strict | relaxed | minimal enable_type_coercion=True, ) ``` Key settings: - validation_mode: controls strictness of schema/data checks - enable_type_coercion: attempts to coerce types during DataFrame creation ## Case Sensitivity Configuration Sparkless supports PySpark-compatible case sensitivity configuration via `spark.sql.caseSensitive`: ```python # Default: case-insensitive (matches PySpark default) spark = SparkSession("MyApp") assert spark.conf.is_case_sensitive() == False # Enable case-sensitive mode spark = SparkSession.builder \ .config("spark.sql.caseSensitive", "true") \ .getOrCreate() assert spark.conf.is_case_sensitive() == True ``` **Default Behavior (case-insensitive):** - Column names can be referenced in any case (e.g., `df.select("name")` matches `"Name"` column) - Output preserves original column names from schema - Matches PySpark's default behavior **Case-Sensitive Mode:** - Column names must match exactly (case-sensitive) - Useful when working with data sources that have case-sensitive schemas - Can be enabled via `spark.conf.set("spark.sql.caseSensitive", "true")` or builder config **Ambiguity Detection:** - When multiple columns differ only by case (e.g., `["Name", "name"]`) and case-insensitive mode is enabled, resolution raises `AnalysisException` due to ambiguity - This matches PySpark behavior Example: ```python from sparkless.sql import SparkSession # Case-insensitive (default) spark = SparkSession("TestApp") df = spark.createDataFrame([{"Name": "Alice", "Age": 25}]) result = df.select("name").collect() # Works - resolves to "Name" assert result[0]["Name"] == "Alice" # Output uses original case "Name" # Case-sensitive spark.conf.set("spark.sql.caseSensitive", "true") result = df.select("Name").collect() # Must use exact case # df.select("name") would raise column not found error ``` ## Backend Configuration (v3.0.0+) ### Default Backend (Polars) ```python # Polars is the default backend in v3.0.0+ spark = SparkSession("MyApp") ``` ### Explicit Backend Selection ```python # Use Polars explicitly spark = SparkSession.builder \ .config("spark.sparkless.backend", "polars") \ .getOrCreate() # Use DuckDB backend (legacy, requires duckdb package) spark = SparkSession.builder \ .config("spark.sparkless.backend", "duckdb") \ .config("spark.sparkless.backend.maxMemory", "4GB") \ .config("spark.sparkless.backend.allowDiskSpillover", True) \ .getOrCreate() # Use memory backend spark = SparkSession.builder \ .config("spark.sparkless.backend", "memory") \ .getOrCreate() ``` ### Backend-Specific Options **Polars Backend (default):** - No configuration needed - Polars handles memory and performance automatically - Thread-safe by design - Uses Parquet files for persistence **DuckDB Backend (legacy):** - `spark.sparkless.backend.maxMemory`: Maximum memory (e.g., "1GB", "4GB") - `spark.sparkless.backend.allowDiskSpillover`: Allow disk spillover when memory is full **Note**: `maxMemory` and `allowDiskSpillover` options are ignored for Polars backend. ## Performance knobs You can tune mock behaviour per pipeline using the following. - **Lazy vs eager evaluation** – `SparkSession(..., enable_lazy_evaluation=True)` (default) defers execution until an action (`collect`, `show`, `count`, etc.). Set to `False` for legacy eager behaviour. - **Logical plan path** – Set `spark.conf.set("spark.sparkless.useLogicalPlan", "true")` (or builder config) to use the serialized logical plan path when the backend supports it (e.g. Robin). This can change execution strategy and performance. - **Backend selection** – The Polars backend is the default and is optimized for in-memory pipelines. For very large or I/O-bound workloads, consider splitting data or using the appropriate backend (see [Backend selection](../backend_selection.md)). - **Profiling** – Optional profiling utilities are documented in [Profiling](../performance/profiling.md). Use them to identify hot paths before tuning.