# Storage API Guide **⚠️ Important:** The `spark._storage` API is a **private sparkless-specific convenience feature** that does not exist in PySpark. For code that needs to work with both sparkless and PySpark, use SQL commands or DataFrame operations instead. The `spark._storage` API is now private and should not be used in production code. This guide explains the two ways to manage databases and tables in sparkless, and when to use each approach. ## Overview Sparkless provides two APIs for managing storage: 1. **PySpark-Compatible APIs** (SQL commands) - ✅ Use for compatibility with PySpark 2. **sparkless Convenience APIs** (`._storage` API) - ⚠️ Private sparkless-specific, not available in PySpark Both work identically in sparkless, but **only SQL commands are portable** between sparkless and PySpark. ## PySpark-Compatible APIs (Recommended for Compatibility) Use SQL commands when you need code that works with both sparkless and PySpark. ### Creating Databases ```python from sparkless.sql import SparkSession spark = SparkSession("MyApp") # Create database spark.sql("CREATE DATABASE IF NOT EXISTS test_db") # Use database spark.sql("USE test_db") # Drop database spark.sql("DROP DATABASE IF EXISTS test_db CASCADE") ``` ### Creating Tables ```python # Create table with schema spark.sql(""" CREATE TABLE IF NOT EXISTS users ( id INT, name STRING, age INT ) """) # Insert data spark.sql(""" INSERT INTO users VALUES (1, 'Alice', 25), (2, 'Bob', 30), (3, 'Charlie', 35) """) # Query table result = spark.sql("SELECT * FROM users WHERE age > 25") result.show() ``` ### Using Catalog API ```python # List databases databases = spark.catalog.listDatabases() for db in databases: print(db.name) # List tables tables = spark.catalog.listTables("test_db") for table in tables: print(table.name) # Check if table exists exists = spark.catalog.tableExists("users", "test_db") ``` ### Benefits - ✅ Works identically in PySpark and sparkless - ✅ Standard SQL syntax - ✅ No code changes needed when switching engines - ✅ Familiar to PySpark developers ## sparkless Convenience APIs Use the `.storage` API when writing sparkless-specific test utilities or when you need more convenient programmatic access. ### Creating Databases (Schemas) ```python from sparkless.sql import SparkSession spark = SparkSession("MyApp") # Create schema (database) spark._storage.create_schema("test_db") # Check if schema exists exists = spark._storage.schema_exists("test_db") # List all schemas schemas = spark._storage.list_schemas() # Drop schema spark._storage.drop_schema("test_db") ``` ### Creating Tables ```python from sparkless.sql.types import StructType, StructField, StringType, IntegerType # Define schema schema = StructType([ StructField("id", IntegerType(), False), StructField("name", StringType(), True), StructField("age", IntegerType(), True) ]) # Create table spark._storage.create_table("test_db", "users", schema) # Insert data data = [ {"id": 1, "name": "Alice", "age": 25}, {"id": 2, "name": "Bob", "age": 30}, {"id": 3, "name": "Charlie", "age": 35} ] spark._storage.insert_data("test_db", "users", data) # Get table as DataFrame df = spark._storage.get_table("test_db", "users") ``` ### Benefits - ✅ More Pythonic API - ✅ Direct programmatic access - ✅ Easier for test setup - ⚠️ **Not available in PySpark** - code won't work with real PySpark ## When to Use Which API ### Use SQL Commands (PySpark-Compatible) When: - Writing code that needs to work with both sparkless and PySpark - Following PySpark best practices - Writing production-like code - Sharing code with teams using PySpark - Learning PySpark patterns **Example:** ```python # This code works in both sparkless and PySpark spark.sql("CREATE DATABASE IF NOT EXISTS analytics") spark.sql("CREATE TABLE analytics.events (timestamp TIMESTAMP, event_type STRING)") ``` ### Use `.storage` API (sparkless Convenience) When: - Writing sparkless-specific test utilities - Setting up test fixtures - Need convenient programmatic access - Code will only run with sparkless **Example:** ```python # This is convenient for tests, but won't work with PySpark @pytest.fixture def setup_test_data(spark): spark._storage.create_schema("test") schema = StructType([StructField("id", IntegerType())]) spark._storage.create_table("test", "data", schema) return spark ``` ## Migration Guide ### Migrating from `.storage` API to SQL Commands If you have code using `.storage` API and want to make it PySpark-compatible: **Before (sparkless only):** ```python spark._storage.create_schema("test_db") schema = StructType([StructField("name", StringType())]) spark._storage.create_table("test_db", "users", schema) spark._storage.insert_data("test_db", "users", [{"name": "Alice"}]) ``` **After (PySpark-compatible):** ```python spark.sql("CREATE DATABASE IF NOT EXISTS test_db") spark.sql("CREATE TABLE test_db.users (name STRING)") spark.sql("INSERT INTO test_db.users VALUES ('Alice')") ``` ### Migrating from SQL Commands to `.storage` API If you want to use the convenience API in sparkless-specific code: **Before (SQL):** ```python spark.sql("CREATE DATABASE IF NOT EXISTS test_db") spark.sql("CREATE TABLE test_db.users (name STRING, age INT)") spark.sql("INSERT INTO test_db.users VALUES ('Alice', 25)") ``` **After (convenience API):** ```python spark._storage.create_schema("test_db") schema = StructType([ StructField("name", StringType()), StructField("age", IntegerType()) ]) spark._storage.create_table("test_db", "users", schema) spark._storage.insert_data("test_db", "users", [{"name": "Alice", "age": 25}]) ``` ## Best Practices 1. **For Production-Like Code**: Always use SQL commands for maximum compatibility 2. **For Test Utilities**: Use `.storage` API for convenience in sparkless-specific test helpers 3. **For Learning**: Use SQL commands to learn PySpark patterns 4. **For Sharing**: Use SQL commands so code works for everyone ## Summary | Feature | SQL Commands | `.storage` API | |---------|--------------|----------------| | PySpark Compatible | ✅ Yes | ❌ No | | Standard SQL | ✅ Yes | ❌ No | | Programmatic Access | ⚠️ Via SQL strings | ✅ Direct API | | Test Convenience | ⚠️ More verbose | ✅ More concise | | Learning PySpark | ✅ Recommended | ⚠️ sparkless specific | **Recommendation**: Use SQL commands for code that needs to work with both engines. Use `.storage` API for sparkless-specific test utilities.