# Getting Started


> **Compatibility Snapshot:** This guide targets Sparkless `3.31.0`, which provides parity with PySpark 3.2–3.5 and ships with 2400+ passing regression tests.

## Installation

Install Sparkless using pip:

```bash
pip install sparkless
```

For development with testing tools:

```bash
pip install sparkless[dev]
```

## Quick Start

### Basic Example

```python
from sparkless.sql import SparkSession, functions as F

# Create session
spark = SparkSession("MyApp")

# Create DataFrame
data = [
    {"id": 1, "name": "Alice", "age": 25},
    {"id": 2, "name": "Bob", "age": 30},
]
df = spark.createDataFrame(data)

# Operations work just like PySpark
result = df.filter(F.col("age") > 25).select("name")
print(result.collect())  # [Row(name='Bob')]
```

### Drop-in PySpark Replacement

Sparkless is designed to be a drop-in replacement for PySpark:

```python
# Before (PySpark)
from pyspark.sql import SparkSession

# After (Sparkless)
from sparkless.sql import SparkSession
```

That's it! Your existing PySpark code works unchanged.

## Core Features

### DataFrame Operations

```python
from sparkless.sql import SparkSession, functions as F

spark = SparkSession("Example")
data = [
    {"name": "Alice", "dept": "Engineering", "salary": 80000},
    {"name": "Bob", "dept": "Sales", "salary": 75000},
    {"name": "Charlie", "dept": "Engineering", "salary": 90000},
]
df = spark.createDataFrame(data)

# Filter
high_earners = df.filter(F.col("salary") > 75000)

# Null-safe equality (for comparing columns that may contain NULL)
# NULL <=> NULL returns True, NULL <=> non-NULL returns False
employees_with_managers = df.filter(F.col("id").eqNullSafe(F.col("manager_id")))

# Select
names = df.select("name", "dept")

# Aggregations
dept_avg = df.groupBy("dept").avg("salary")
```

### Window Functions

```python
from sparkless.sql import Window, functions as F

# Ranking within departments
window_spec = Window.partitionBy("dept").orderBy(F.desc("salary"))
ranked = df.select(
    "name",
    "dept",
    "salary",
    F.row_number().over(window_spec).alias("rank")
)
```

### SQL Queries

```python
# Create temporary view
df.createOrReplaceTempView("employees")
```

### Storage Management

Sparkless provides two ways to manage databases and tables:

**Option 1: SQL Commands (PySpark-Compatible - Recommended)**
```python
# Works in both sparkless and PySpark
spark.sql("CREATE DATABASE IF NOT EXISTS test_db")
spark.sql("CREATE TABLE test_db.users (name STRING, age INT)")
spark.sql("INSERT INTO test_db.users VALUES ('Alice', 25), ('Bob', 30)")
```

**Option 2: Storage API (Sparkless-Specific)**
```python
# Convenient API, but sparkless-specific
from sparkless.sql.types import StructType, StructField, StringType, IntegerType

spark._storage.create_schema("test_db")
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])
spark._storage.create_table("test_db", "users", schema)
spark._storage.insert_data("test_db", "users", [
    {"name": "Alice", "age": 25},
    {"name": "Bob", "age": 30}
])
```

**Note:** For maximum compatibility with PySpark, use SQL commands. The `.storage` API is a sparkless convenience feature that doesn't exist in PySpark.

### Run SQL Queries

```python
# Run SQL queries
result = spark.sql("SELECT name, salary FROM employees WHERE salary > 80000")
result.show()
```

## Testing with Sparkless

### Unit Test Example

```python
import pytest
from sparkless.sql import SparkSession, functions as F

def test_data_transformation():
    """Test DataFrame transformation logic."""
    spark = SparkSession("TestApp")
    
    # Test data
    data = [{"value": 10}, {"value": 20}, {"value": 30}]
    df = spark.createDataFrame(data)
    
    # Apply transformation
    result = df.filter(F.col("value") > 15)
    
    # Assertions
    assert result.count() == 2
    rows = result.collect()
    assert rows[0]["value"] == 20
    assert rows[1]["value"] == 30

def test_aggregation():
    """Test aggregation logic."""
    spark = SparkSession("TestApp")
    
    data = [
        {"category": "A", "value": 100},
        {"category": "A", "value": 200},
        {"category": "B", "value": 150},
    ]
    df = spark.createDataFrame(data)
    
    # Group and aggregate
    result = df.groupBy("category").sum("value")
    
    # Verify results
    assert result.count() == 2
```

## Lazy Evaluation

Sparkless mirrors PySpark's lazy evaluation model:

```python
# Transformations are queued (not executed)
filtered = df.filter(F.col("age") > 25)
selected = filtered.select("name")
# No execution yet!

# Actions trigger execution
rows = selected.collect()  # ← Executes now
count = selected.count()   # ← Executes now
```

Control evaluation mode:

```python
# Lazy (default, recommended)
spark = SparkSession("App", enable_lazy_evaluation=True)

# Eager (for legacy tests)
spark = SparkSession("App", enable_lazy_evaluation=False)
```

## Performance

Sparkless provides significant speed improvements:

| Operation | PySpark | Sparkless | Speedup |
|-----------|---------|------------|---------|
| Session Creation | 30-45s | 0.1s | 300x |
| Simple Query | 2-5s | 0.01s | 200x |
| Full Test Suite | 5-10min | 30-60s | 10x |

## Advanced: Session-aware literals and schema tracking

Sparkless tracks the **active SparkSession** for functions that depend on it (e.g. `F.col`, `F.lit`, `F.expr`). When you call these, the engine uses the current session for catalog and configuration (e.g. case sensitivity, current database).

**Session-aware helpers** (e.g. `current_catalog`, `current_database`, `current_schema`, `current_user`) and **schema tracking** in the Polars storage backend ensure that operations like `setCurrentDatabase` take effect for subsequent SQL and DataFrame operations. Create the session before using `F.col`/`F.lit` and set the current database with `spark.catalog.setCurrentDatabase("db_name")` when using multiple databases. See [Configuration](guides/configuration.md) and [Troubleshooting](guides/troubleshooting.md).

## Next Steps

- **[API Reference](api_reference.md)** - Complete API documentation
- **[SQL Operations](sql_operations_guide.md)** - SQL query examples
- **[Testing Patterns](testing_patterns.md)** - Test helpers and fixtures
- **[Examples](../examples/)** - More code examples

## Getting Help

- **GitHub**: [github.com/eddiethedean/sparkless](https://github.com/eddiethedean/sparkless)
- **Issues**: [github.com/eddiethedean/sparkless/issues](https://github.com/eddiethedean/sparkless/issues)
- **Documentation**: [docs/](.)