Storage API Guide

⚠️ Important: The spark._storage API is a private sparkless-specific convenience feature that does not exist in PySpark. For code that needs to work with both sparkless and PySpark, use SQL commands or DataFrame operations instead. The spark._storage API is now private and should not be used in production code.

This guide explains the two ways to manage databases and tables in sparkless, and when to use each approach.

Overview

Sparkless provides two APIs for managing storage:

  1. PySpark-Compatible APIs (SQL commands) - βœ… Use for compatibility with PySpark

  2. sparkless Convenience APIs (._storage API) - ⚠️ Private sparkless-specific, not available in PySpark

Both work identically in sparkless, but only SQL commands are portable between sparkless and PySpark.

sparkless Convenience APIs

Use the .storage API when writing sparkless-specific test utilities or when you need more convenient programmatic access.

Creating Databases (Schemas)

from sparkless.sql import SparkSession

spark = SparkSession("MyApp")

# Create schema (database)
spark._storage.create_schema("test_db")

# Check if schema exists
exists = spark._storage.schema_exists("test_db")

# List all schemas
schemas = spark._storage.list_schemas()

# Drop schema
spark._storage.drop_schema("test_db")

Creating Tables

from sparkless.sql.types import StructType, StructField, StringType, IntegerType

# Define schema
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Create table
spark._storage.create_table("test_db", "users", schema)

# Insert data
data = [
    {"id": 1, "name": "Alice", "age": 25},
    {"id": 2, "name": "Bob", "age": 30},
    {"id": 3, "name": "Charlie", "age": 35}
]
spark._storage.insert_data("test_db", "users", data)

# Get table as DataFrame
df = spark._storage.get_table("test_db", "users")

Benefits

  • βœ… More Pythonic API

  • βœ… Direct programmatic access

  • βœ… Easier for test setup

  • ⚠️ Not available in PySpark - code won’t work with real PySpark

When to Use Which API

Use SQL Commands (PySpark-Compatible) When:

  • Writing code that needs to work with both sparkless and PySpark

  • Following PySpark best practices

  • Writing production-like code

  • Sharing code with teams using PySpark

  • Learning PySpark patterns

Example:

# This code works in both sparkless and PySpark
spark.sql("CREATE DATABASE IF NOT EXISTS analytics")
spark.sql("CREATE TABLE analytics.events (timestamp TIMESTAMP, event_type STRING)")

Use .storage API (sparkless Convenience) When:

  • Writing sparkless-specific test utilities

  • Setting up test fixtures

  • Need convenient programmatic access

  • Code will only run with sparkless

Example:

# This is convenient for tests, but won't work with PySpark
@pytest.fixture
def setup_test_data(spark):
    spark._storage.create_schema("test")
    schema = StructType([StructField("id", IntegerType())])
    spark._storage.create_table("test", "data", schema)
    return spark

Migration Guide

Migrating from .storage API to SQL Commands

If you have code using .storage API and want to make it PySpark-compatible:

Before (sparkless only):

spark._storage.create_schema("test_db")
schema = StructType([StructField("name", StringType())])
spark._storage.create_table("test_db", "users", schema)
spark._storage.insert_data("test_db", "users", [{"name": "Alice"}])

After (PySpark-compatible):

spark.sql("CREATE DATABASE IF NOT EXISTS test_db")
spark.sql("CREATE TABLE test_db.users (name STRING)")
spark.sql("INSERT INTO test_db.users VALUES ('Alice')")

Migrating from SQL Commands to .storage API

If you want to use the convenience API in sparkless-specific code:

Before (SQL):

spark.sql("CREATE DATABASE IF NOT EXISTS test_db")
spark.sql("CREATE TABLE test_db.users (name STRING, age INT)")
spark.sql("INSERT INTO test_db.users VALUES ('Alice', 25)")

After (convenience API):

spark._storage.create_schema("test_db")
schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])
spark._storage.create_table("test_db", "users", schema)
spark._storage.insert_data("test_db", "users", [{"name": "Alice", "age": 25}])

Best Practices

  1. For Production-Like Code: Always use SQL commands for maximum compatibility

  2. For Test Utilities: Use .storage API for convenience in sparkless-specific test helpers

  3. For Learning: Use SQL commands to learn PySpark patterns

  4. For Sharing: Use SQL commands so code works for everyone

Summary

Feature

SQL Commands

.storage API

PySpark Compatible

βœ… Yes

❌ No

Standard SQL

βœ… Yes

❌ No

Programmatic Access

⚠️ Via SQL strings

βœ… Direct API

Test Convenience

⚠️ More verbose

βœ… More concise

Learning PySpark

βœ… Recommended

⚠️ sparkless specific

Recommendation: Use SQL commands for code that needs to work with both engines. Use .storage API for sparkless-specific test utilities.