Sparkless

Getting Started

  • Getting Started
    • Installation
    • Quick Start
      • Basic Example
      • Drop-in PySpark Replacement
    • Core Features
      • DataFrame Operations
      • Window Functions
      • SQL Queries
      • Storage Management
      • Run SQL Queries
    • Testing with Sparkless
      • Unit Test Example
    • Lazy Evaluation
    • Performance
    • Advanced: Session-aware literals and schema tracking
    • Next Steps
    • Getting Help
  • Installation
    • Requirements
    • Optional Dependencies
  • API Reference
    • Overview
    • Session Management
      • SparkSession
      • Configuration
    • DataFrame Operations
      • Creation
      • Set Operations
      • Column Access
      • Selection and Filtering
      • Column Operations
      • Aggregations
    • Functions Reference
      • Column Functions
    • Window Functions
      • Window Specification
      • Window Functions
    • Data Types
      • Primitive Types
      • Complex Types
    • Catalog Operations
      • Database Management
      • Table Management
    • Data Sources
      • Reading Data
      • Writing Data
    • SQL Operations
      • SQL Queries
      • SQL Functions
    • Error Handling
      • Exception Types
      • Debug Mode
    • Performance Tips
      • Optimization
      • Memory Management
    • Testing
      • Unit Testing
      • Integration Testing
    • Compatibility Notes
      • PySpark Compatibility
      • Known Differences
      • Migration from PySpark
    • Examples
      • Basic Data Processing
      • Window Functions
      • Complex Aggregations

API Reference

  • API Documentation
    • Session Management
      • SparkSession
      • SparkContext
      • Configuration
      • Catalog
    • DataFrame API
      • DataFrame
      • GroupedData
      • DataFrameWriter
      • DataFrameReader
    • Functions Reference
      • Main Functions Module
      • String Functions
      • Math Functions
      • DateTime Functions
      • Array Functions
      • Map Functions
      • Aggregate Functions
      • Conditional Functions
      • Bitwise Functions
      • Window Functions
      • XML Functions
      • Crypto Functions
      • JSON/CSV Functions
      • Column Operations
      • Literals
      • UDF Functions
    • Data Types
      • Base Types
      • Primitive Types
      • Complex Types
      • Usage Examples
    • Backend Architecture
      • Backend Protocols
      • Polars Backend
      • Storage Backends
    • Storage Management
      • Storage Manager
      • Storage Backends
      • Storage Models
      • Serialization
  • Session Management
    • SparkSession
    • SparkContext
      • MockJVMFunctions
      • JVMContext
      • SparkContext
    • Configuration
      • Configuration
      • SparkConfig
      • ConfigBuilder
    • Catalog
      • Database
      • Table
      • Catalog
  • DataFrame API
    • DataFrame
      • DataFrame
    • GroupedData
      • GroupedData
    • DataFrameWriter
      • DataFrameWriter
    • DataFrameReader
      • DataFrameReader
  • Functions Reference
    • Main Functions Module
      • Column
      • ColumnOperation
      • Literal
      • AggregateFunction
      • CaseWhen
      • WindowFunction
      • Functions
      • StringFunctions
      • MathFunctions
      • AggregateFunctions
      • DateTimeFunctions
    • String Functions
      • StringFunctions
    • Math Functions
      • MathFunctions
    • DateTime Functions
      • DateTimeFunctions
    • Array Functions
      • ArrayFunctions
    • Map Functions
      • MapFunctions
    • Aggregate Functions
      • AggregateFunctions
    • Conditional Functions
      • validate_rule()
      • CaseWhen
      • ConditionalFunctions
    • Bitwise Functions
      • BitwiseFunctions
    • Window Functions
      • WindowFunction
    • XML Functions
      • XMLFunctions
    • Crypto Functions
      • CryptoFunctions
    • JSON/CSV Functions
      • JSONCSVFunctions
    • Column Operations
      • ColumnOperatorMixin
      • Column
      • ColumnOperation
    • Literals
      • Literal
    • UDF Functions
      • UserDefinedFunction
      • UserDefinedTableFunction
  • Data Types
    • Base Types
      • DataType
      • StringType
      • IntegerType
      • LongType
      • DoubleType
      • BooleanType
      • DateType
      • TimestampType
      • DecimalType
      • ArrayType
      • MapType
      • BinaryType
      • NullType
      • FloatType
      • ShortType
      • ByteType
      • CharType
      • VarcharType
      • TimestampNTZType
      • IntervalType
      • YearMonthIntervalType
      • DayTimeIntervalType
      • StructField
      • StructType
      • MockDatabase
      • MockTable
      • convert_python_type_to_mock_type()
      • infer_schema_from_data()
      • create_schema_from_columns()
      • get_row_value()
      • Row
    • Primitive Types
    • Complex Types
    • Usage Examples
  • Backend Architecture
    • Backend Protocols
      • QueryExecutor
      • DataMaterializer
      • ExportBackend
    • Polars Backend
      • PolarsOperationExecutor
      • PolarsMaterializer
      • PolarsExpressionTranslator
      • PolarsTable
      • PolarsSchema
      • PolarsStorageManager
    • Storage Backends
      • TableMetadata
      • StorageManagerFactory
      • UnifiedStorageManager
      • MemoryTable
      • MemorySchema
      • MemoryStorageManager
      • FileTable
      • FileSchema
      • FileStorageManager
  • Storage Management
    • Storage Manager
      • TableMetadata
      • StorageManagerFactory
      • UnifiedStorageManager
    • Storage Backends
      • Memory Storage
      • File Storage
    • Storage Models
      • StorageMode
      • MockDeltaVersion
      • MockTableMetadata
      • ColumnDefinition
      • DuckDBTableModel
      • DuckDBConnectionConfig
      • StorageOperationResult
      • QueryResult
    • Serialization
      • JSONSerializer
      • CSVSerializer

Guides

  • Guides
    • Migration Guides
      • Migration Guide
      • Migration from PySpark
      • Migration from v2 to v3
    • Configuration and Setup
      • Configuration
      • Troubleshooting
      • Pytest Integration
    • Performance and Optimization
      • Lazy Evaluation
      • CTE Optimization
      • Benchmarking
      • Memory Management
      • Threading
    • Advanced Topics
      • Plugins
  • Migration Guide
    • Quick Swap
    • Common Patterns
    • Tips
  • Configuration
    • Basic Configuration
    • Case Sensitivity Configuration
    • Backend Configuration (v3.0.0+)
      • Default Backend (Polars)
      • Explicit Backend Selection
      • Backend-Specific Options
    • Performance knobs
  • Lazy Evaluation
    • Materialization in Set Operations
  • CTE Optimization
    • Overview
    • Polars Backend (v3.0.0+)
      • How It Works
      • Performance Benefits
    • Legacy: DuckDB CTE Optimization (v2.x)
    • Implementation Details
      • Architecture
      • Supported Operations
      • Backward Compatibility
    • Performance Benefits
      • Polars Backend (v3.0.0+)
      • Legacy DuckDB Backend
      • Benchmark Results
    • Usage
    • Testing
    • Future Enhancements
    • Technical Notes
      • SQL Generation
      • Error Handling
      • DuckDB Integration
    • Conclusion
  • Pytest Integration
    • Running tests with the Robin backend
  • Benchmarking
  • Memory Management
  • Threading
    • Thread Safety with Polars
    • Thread-Safe Operations
    • How It Works
    • Best Practices
      • Using with ThreadPoolExecutor
      • Using with pytest-xdist
    • Comparison with DuckDB Backend
      • DuckDB Backend (v2.x and earlier)
      • Polars Backend (v3.0.0+)
    • Performance Considerations
    • Example: Parallel Pipeline Execution
    • Troubleshooting
      • No Threading Issues!
    • Migration from DuckDB
    • See Also
  • Plugins

Advanced Topics

  • Backend Architecture
    • Overview
    • Architecture Changes
      • Before Refactor
      • Current Architecture (v3.0.0+)
    • Protocol Definitions
      • QueryExecutor Protocol
      • DataMaterializer Protocol
      • StorageBackend Protocol
      • ExportBackend Protocol
    • Backend Factory
    • Usage Examples
      • Session with Default Backend
      • Session with Custom Backend
      • Testing with Mock Backend
    • Backend Configuration
      • Configuration via Session Builder
      • Configuration Keys
      • Backend Type Detection
      • Adding New Backends
    • Query Optimizer Hooks
      • Adaptive Execution Simulation
    • Backward Compatibility
    • Migration Guide
      • For Users
      • For Contributors
      • For Testing
    • Test Results
    • Future Enhancements
    • File Mapping
    • Summary
  • Backend Selection
    • Available Backends
    • Selecting a Backend
    • Behavioural Notes
    • Robin backend (optional)
    • Running tests with a specific backend
    • Troubleshooting
  • SQL Operations Guide
    • Overview
    • SQL Parser
      • Basic Usage
      • Supported Query Types
      • AST Structure
      • Error Handling
    • SQL Validator
      • Basic Usage
      • Validation Rules
      • Custom Validation
    • SQL Optimizer
      • Basic Usage
      • Optimization Strategies
      • Custom Optimization
    • SQL Executor
      • Basic Usage
      • Execution Context
      • Custom Execution
    • Complete SQL Pipeline Example
    • Advanced Features
      • Custom SQL Functions
      • Query Hints
      • Performance Monitoring
    • Best Practices
    • Troubleshooting
      • Common Issues
      • Debug Mode
  • Storage API Guide
    • Overview
    • PySpark-Compatible APIs (Recommended for Compatibility)
      • Creating Databases
      • Creating Tables
      • Using Catalog API
      • Benefits
    • sparkless Convenience APIs
      • Creating Databases (Schemas)
      • Creating Tables
      • Benefits
    • When to Use Which API
      • Use SQL Commands (PySpark-Compatible) When:
      • Use .storage API (sparkless Convenience) When:
    • Migration Guide
      • Migrating from .storage API to SQL Commands
      • Migrating from SQL Commands to .storage API
    • Best Practices
    • Summary
  • Storage Serialization Guide
    • Overview
    • CSV Serialization
      • Basic Usage
      • CSV Options
      • Deserialization
      • Schema-Aware Serialization
      • Advanced CSV Features
    • JSON Serialization
      • Basic Usage
      • JSON Options
      • Deserialization
      • Schema-Aware JSON
      • Custom JSON Serialization
    • Storage Integration
      • DataFrame Integration
      • Storage Backend Integration
    • Custom Serialization Formats
      • Creating Custom Serializers
      • Registering Custom Formats
    • Performance Optimization
      • Streaming Serialization
      • Compression Support
      • Memory-Efficient Processing
    • Error Handling
      • Serialization Errors
      • Deserialization Errors
      • Validation Errors
    • Best Practices
    • Troubleshooting
      • Common Issues
      • Debug Mode
  • Testing Patterns
    • Overview
    • Setup Test Fixtures
      • Basic Setup
      • Advanced Setup
    • Test Patterns
      • Basic DataFrame Operations
      • Aggregation Testing
      • Window Function Testing
      • String Function Testing
      • Date and Time Testing
      • Type Casting Testing
    • Performance Testing
      • Benchmarking
      • Memory Testing
    • Error Handling Testing
      • Exception Testing
      • Debug Mode Testing
    • Integration Testing
      • End-to-End Testing
    • Best Practices
      • Test Organization
      • Test Data Management
      • Performance Optimization
    • Common Pitfalls
      • Memory Issues
      • Type Issues
      • Error Handling
    • Test layout and skips
  • Profiling
    • Enabling Instrumentation
    • Instrumented Hot Paths
    • Collecting Samples
    • Baseline Snapshot (2025-11-13)
    • Next Steps
  • Pandas Fallback
    • Installing the Native Backend
    • Switching Backends
    • Running the Benchmark Suite
    • Interpreting Results

Additional Resources

  • Known Issues
    • Delta Schema Evolution with the Polars Backend
    • Delta Table: Unsupported Operations (NotImplementedError)
    • DataFrame.explain() Options
    • Deprecations
  • Migration from PySpark
    • Overview
    • Drop-in Replacement
      • Basic Migration
      • Module structure (PySpark compatibility)
      • Session Creation
      • DataFrame Operations
    • API Compatibility
      • ✅ Fully Supported
      • 🔄 Enhanced Features
      • 📝 Sparkless-Specific Features
      • ⚠️ Known Limitations
    • Performance Considerations
      • Speed Improvements
      • Memory Usage
    • Debugging Guide
      • Enable Debug Mode
      • Common Error Messages
      • SQL Logging
    • Testing Patterns
      • Unit Testing
      • Integration Testing
      • Performance Testing
    • Migration Checklist
      • Before Migration
      • During Migration
      • After Migration
    • Troubleshooting
      • Import Issues
      • Session Issues
      • Data Type Issues
    • Getting Help
      • Documentation
      • Community
      • Support
    • Examples
      • Basic Data Processing
      • Window Functions
      • Complex Aggregations
  • Migration from v2 to v3
    • Overview
    • Key Changes
      • Backend Migration
      • Breaking Changes
    • Migration Steps
      • 1. Update Dependencies
      • 2. Update Code
      • 3. Migrate Existing DuckDB Databases (Optional)
      • 4. Update Backend Configuration (If Needed)
      • 5. Use DuckDB Backend (If Needed)
    • What Changed Under the Hood
      • Storage
      • Query Execution
      • Threading
    • Performance Improvements
    • Removed Features
    • Backward Compatibility
    • Troubleshooting
      • Import Errors
      • Threading Issues
      • Storage Migration
    • Testing
    • Questions?
    • Summary
  • Mock Spark Features
    • Overview
    • PySpark-Compatible APIs (Recommended)
      • SQL Commands
      • Functions Module
      • Catalog API
    • sparkless Convenience APIs
      • Storage API
      • Enhanced Error Messages
      • Enhanced Explain Method
      • DataFrameWriter.delta() Convenience Method
    • Migration Guide
      • From sparkless Convenience APIs to PySpark-Compatible
      • From PySpark-Compatible to sparkless Convenience APIs
    • Best Practices
      • For Production-Like Code
      • For Test Utilities
      • For Learning PySpark
    • Summary
    • See Also
  • Function API Audit
    • Audit Date
    • Methodology
    • Core Functions
      • ✅ col(name: str) -> Column
      • ✅ lit(value: Any) -> Literal
      • ✅ expr(expression: str) -> ColumnOperation
      • ✅ when(condition: Any, value: Any = None) -> CaseWhen
    • Aggregate Functions
      • ✅ count(column: Union[Column, str, None] = None) -> AggregateFunction
      • ✅ sum(column: Union[Column, str]) -> AggregateFunction
      • ✅ avg(column: Union[Column, str]) -> AggregateFunction
      • ✅ max(column: Union[Column, str]) -> AggregateFunction
      • ✅ min(column: Union[Column, str]) -> AggregateFunction
    • Window Functions
      • ✅ row_number() -> ColumnOperation
      • ✅ rank() -> ColumnOperation
      • ✅ dense_rank() -> ColumnOperation
      • ✅ lag(column: Union[Column, str], offset: int = 1, default: Any = None) -> ColumnOperation
      • ✅ lead(column: Union[Column, str], offset: int = 1, default: Any = None) -> ColumnOperation
    • DateTime Functions
      • ✅ current_date() -> ColumnOperation
      • ✅ current_timestamp() -> ColumnOperation
      • ✅ datediff(end: Union[Column, str], start: Union[Column, str]) -> ColumnOperation
      • ✅ to_date(column: Union[Column, str], format: Optional[str] = None) -> ColumnOperation
      • ✅ to_timestamp(column: Union[Column, str], format: Optional[str] = None) -> ColumnOperation
    • Key Findings
      • ✅ Correct Implementations
      • ✅ Newly Added Compatibility
      • ✅ Improvements Made
      • ✅ Fixed Differences
      • ⚠️ Minor Differences (Acceptable - Implementation Details)
    • Functions Verified Not DataFrame Methods
    • Recommendations
      • ✅ Completed
      • Future Enhancements (Optional)
    • Conclusion
Sparkless
  • Overview: module code

All modules for which code is available

  • sparkless.backend.polars.expression_translator
  • sparkless.backend.polars.materializer
  • sparkless.backend.polars.operation_executor
  • sparkless.backend.polars.storage
  • sparkless.backend.protocols
  • sparkless.dataframe.dataframe
  • sparkless.dataframe.grouped.base
  • sparkless.dataframe.reader
  • sparkless.dataframe.writer
  • sparkless.functions.aggregate
  • sparkless.functions.array
  • sparkless.functions.base
  • sparkless.functions.bitwise
  • sparkless.functions.conditional
  • sparkless.functions.core.column
  • sparkless.functions.core.literals
  • sparkless.functions.crypto
  • sparkless.functions.datetime
  • sparkless.functions.functions
  • sparkless.functions.json_csv
  • sparkless.functions.map
  • sparkless.functions.math
  • sparkless.functions.string
  • sparkless.functions.udf
  • sparkless.functions.window_execution
  • sparkless.functions.xml
  • sparkless.session.catalog
  • sparkless.session.config.configuration
  • sparkless.session.context
  • sparkless.spark_types
  • sparkless.storage.backends.file
  • sparkless.storage.backends.memory
  • sparkless.storage.manager
  • sparkless.storage.models
  • sparkless.storage.serialization.csv
  • sparkless.storage.serialization.json

© Copyright 2025, Odos Matthews.

Built with Sphinx using a theme provided by Read the Docs.