Storage API Guide

⚠️ Important: The spark._storage API is a private sparkless-specific convenience feature that does not exist in PySpark. For code that needs to work with both sparkless and PySpark, use SQL commands or DataFrame operations instead. The spark._storage API is now private and should not be used in production code.

This guide explains the two ways to manage databases and tables in sparkless, and when to use each approach.

Overview

Sparkless provides two APIs for managing storage:

PySpark-Compatible APIs (SQL commands) - ✅ Use for compatibility with PySpark
sparkless Convenience APIs (._storage API) - ⚠️ Private sparkless-specific, not available in PySpark

Both work identically in sparkless, but only SQL commands are portable between sparkless and PySpark.

PySpark-Compatible APIs (Recommended for Compatibility)

Use SQL commands when you need code that works with both sparkless and PySpark.

Creating Databases

from sparkless.sql import SparkSession

spark = SparkSession("MyApp")

# Create database
spark.sql("CREATE DATABASE IF NOT EXISTS test_db")

# Use database
spark.sql("USE test_db")

# Drop database
spark.sql("DROP DATABASE IF EXISTS test_db CASCADE")

Creating Tables

# Create table with schema
spark.sql("""
    CREATE TABLE IF NOT EXISTS users (
        id INT,
        name STRING,
        age INT
    )
""")

# Insert data
spark.sql("""
    INSERT INTO users VALUES
    (1, 'Alice', 25),
    (2, 'Bob', 30),
    (3, 'Charlie', 35)
""")

# Query table
result = spark.sql("SELECT * FROM users WHERE age > 25")
result.show()

Using Catalog API

# List databases
databases = spark.catalog.listDatabases()
for db in databases:
    print(db.name)

# List tables
tables = spark.catalog.listTables("test_db")
for table in tables:
    print(table.name)

# Check if table exists
exists = spark.catalog.tableExists("users", "test_db")

Benefits

✅ Works identically in PySpark and sparkless
✅ Standard SQL syntax
✅ No code changes needed when switching engines
✅ Familiar to PySpark developers

sparkless Convenience APIs

Use the .storage API when writing sparkless-specific test utilities or when you need more convenient programmatic access.

Creating Databases (Schemas)

from sparkless.sql import SparkSession

spark = SparkSession("MyApp")

# Create schema (database)
spark._storage.create_schema("test_db")

# Check if schema exists
exists = spark._storage.schema_exists("test_db")

# List all schemas
schemas = spark._storage.list_schemas()

# Drop schema
spark._storage.drop_schema("test_db")

Creating Tables

from sparkless.sql.types import StructType, StructField, StringType, IntegerType

# Define schema
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Create table
spark._storage.create_table("test_db", "users", schema)

# Insert data
data = [
    {"id": 1, "name": "Alice", "age": 25},
    {"id": 2, "name": "Bob", "age": 30},
    {"id": 3, "name": "Charlie", "age": 35}
]
spark._storage.insert_data("test_db", "users", data)

# Get table as DataFrame
df = spark._storage.get_table("test_db", "users")

Benefits

✅ More Pythonic API
✅ Direct programmatic access
✅ Easier for test setup
⚠️ Not available in PySpark - code won’t work with real PySpark

When to Use Which API

Use SQL Commands (PySpark-Compatible) When:

Writing code that needs to work with both sparkless and PySpark
Following PySpark best practices
Writing production-like code
Sharing code with teams using PySpark
Learning PySpark patterns

Example:

# This code works in both sparkless and PySpark
spark.sql("CREATE DATABASE IF NOT EXISTS analytics")
spark.sql("CREATE TABLE analytics.events (timestamp TIMESTAMP, event_type STRING)")

Use `.storage` API (sparkless Convenience) When:

Writing sparkless-specific test utilities
Setting up test fixtures
Need convenient programmatic access
Code will only run with sparkless

Example:

# This is convenient for tests, but won't work with PySpark
@pytest.fixture
def setup_test_data(spark):
    spark._storage.create_schema("test")
    schema = StructType([StructField("id", IntegerType())])
    spark._storage.create_table("test", "data", schema)
    return spark

Migration Guide

Migrating from `.storage` API to SQL Commands

If you have code using .storage API and want to make it PySpark-compatible:

Before (sparkless only):

spark._storage.create_schema("test_db")
schema = StructType([StructField("name", StringType())])
spark._storage.create_table("test_db", "users", schema)
spark._storage.insert_data("test_db", "users", [{"name": "Alice"}])

After (PySpark-compatible):

spark.sql("CREATE DATABASE IF NOT EXISTS test_db")
spark.sql("CREATE TABLE test_db.users (name STRING)")
spark.sql("INSERT INTO test_db.users VALUES ('Alice')")

Migrating from SQL Commands to `.storage` API

If you want to use the convenience API in sparkless-specific code:

Before (SQL):

spark.sql("CREATE DATABASE IF NOT EXISTS test_db")
spark.sql("CREATE TABLE test_db.users (name STRING, age INT)")
spark.sql("INSERT INTO test_db.users VALUES ('Alice', 25)")

After (convenience API):

spark._storage.create_schema("test_db")
schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])
spark._storage.create_table("test_db", "users", schema)
spark._storage.insert_data("test_db", "users", [{"name": "Alice", "age": 25}])

Best Practices

For Production-Like Code: Always use SQL commands for maximum compatibility
For Test Utilities: Use .storage API for convenience in sparkless-specific test helpers
For Learning: Use SQL commands to learn PySpark patterns
For Sharing: Use SQL commands so code works for everyone

Summary

Feature	SQL Commands	`.storage` API
PySpark Compatible	✅ Yes	❌ No
Standard SQL	✅ Yes	❌ No
Programmatic Access	⚠️ Via SQL strings	✅ Direct API
Test Convenience	⚠️ More verbose	✅ More concise
Learning PySpark	✅ Recommended	⚠️ sparkless specific

Recommendation: Use SQL commands for code that needs to work with both engines. Use .storage API for sparkless-specific test utilities.

Storage API Guide

Overview

PySpark-Compatible APIs (Recommended for Compatibility)

Creating Databases

Creating Tables

Using Catalog API

Benefits

sparkless Convenience APIs

Creating Databases (Schemas)

Creating Tables

Benefits

When to Use Which API

Use SQL Commands (PySpark-Compatible) When:

Use .storage API (sparkless Convenience) When:

Migration Guide

Migrating from .storage API to SQL Commands

Migrating from SQL Commands to .storage API

Best Practices

Summary

Use `.storage` API (sparkless Convenience) When:

Migrating from `.storage` API to SQL Commands

Migrating from SQL Commands to `.storage` API