--- name: polars description: Lightning-fast DataFrame library written in Rust for high-performance data manipulation and analysis. Use when user wants blazing fast data transformations, working with large datasets, lazy evaluation pipelines, or needs better performance than pandas. Ideal for ETL, data wrangling, aggregations, joins, and reading/writing CSV, Parquet, JSON files. --- # Polars ## Overview Polars is a blazingly fast DataFrame library written in Rust with Python bindings. Built for performance and memory efficiency, Polars leverages parallel execution and lazy evaluation to process data faster than pandas, especially on large datasets. ## When to Use This Skill Activate when the user: - Wants to work with DataFrames and needs high performance - Mentions Polars explicitly or asks for "fast" data processing - Needs to process large datasets (millions of rows) - Wants lazy evaluation for query optimization - Asks for data transformations, filtering, grouping, or aggregations - Needs to read/write CSV, Parquet, JSON, or other data formats - Wants to combine with DuckDB for SQL + DataFrame workflows ## Installation Check if Polars is installed: ```bash python3 -c "import polars as pl; print(pl.__version__)" ``` If not installed: ```bash pip3 install polars ``` For full features including Parquet support: ```bash pip3 install 'polars[pyarrow]' ``` For DuckDB integration: ```bash pip3 install polars duckdb 'polars[pyarrow]' ``` ## Core Capabilities ### 1. Reading Data Polars can read data from various formats: ```python import polars as pl # Read CSV df = pl.read_csv('data.csv') # Read Parquet (fast, columnar format) df = pl.read_parquet('data.parquet') # Read JSON df = pl.read_json('data.json') # Read multiple files df = pl.read_csv('data/*.csv') # Read with lazy evaluation (doesn't load until needed) lazy_df = pl.scan_csv('large_data.csv') lazy_df = pl.scan_parquet('data/*.parquet') ``` ### 2. Data Selection and Filtering ```python import polars as pl df = pl.read_csv('data.csv') # Select columns result = df.select(['name', 'age', 'city']) # Select with expressions result = df.select([ pl.col('name'), pl.col('age'), pl.col('salary').alias('annual_salary') ]) # Filter rows result = df.filter(pl.col('age') > 25) # Multiple conditions result = df.filter( (pl.col('age') > 25) & (pl.col('city') == 'NYC') ) # Filter with string methods result = df.filter(pl.col('name').str.contains('John')) ``` ### 3. Transformations and New Columns ```python import polars as pl df = pl.read_csv('sales.csv') # Add new columns result = df.with_columns([ (pl.col('quantity') * pl.col('price')).alias('total'), pl.col('product').str.to_uppercase().alias('product_upper'), pl.when(pl.col('quantity') > 10) .then(pl.lit('bulk')) .otherwise(pl.lit('retail')) .alias('sale_type') ]) # Modify existing columns result = df.with_columns([ pl.col('price').round(2), pl.col('date').str.strptime(pl.Date, '%Y-%m-%d') ]) # Rename columns result = df.rename({'old_name': 'new_name'}) ``` ### 4. Aggregations and Grouping ```python import polars as pl df = pl.read_csv('sales.csv') # Group by and aggregate result = df.group_by('category').agg([ pl.col('sales').sum().alias('total_sales'), pl.col('sales').mean().alias('avg_sales'), pl.col('sales').count().alias('num_sales'), pl.col('product').n_unique().alias('unique_products') ]) # Multiple group by columns result = df.group_by(['category', 'region']).agg([ pl.col('revenue').sum(), pl.col('customer_id').n_unique().alias('unique_customers') ]) # Aggregation without grouping stats = df.select([ pl.col('sales').sum(), pl.col('sales').mean(), pl.col('sales').median(), pl.col('sales').std(), pl.col('sales').min(), pl.col('sales').max() ]) ``` ### 5. Sorting and Ranking ```python import polars as pl df = pl.read_csv('data.csv') # Sort by single column result = df.sort('age') # Sort descending result = df.sort('salary', descending=True) # Sort by multiple columns result = df.sort(['department', 'salary'], descending=[False, True]) # Add rank column result = df.with_columns([ pl.col('salary').rank(method='dense').over('department').alias('dept_rank') ]) ``` ### 6. Joins ```python import polars as pl customers = pl.read_csv('customers.csv') orders = pl.read_csv('orders.csv') # Inner join result = customers.join(orders, on='customer_id', how='inner') # Left join result = customers.join(orders, on='customer_id', how='left') # Join on different column names result = customers.join( orders, left_on='id', right_on='customer_id', how='inner' ) # Join on multiple columns result = df1.join(df2, on=['col1', 'col2'], how='inner') ``` ### 7. Window Functions ```python import polars as pl df = pl.read_csv('sales.csv') # Calculate running total result = df.with_columns([ pl.col('sales').cum_sum().over('region').alias('running_total') ]) # Calculate rolling average result = df.with_columns([ pl.col('sales').rolling_mean(window_size=7).alias('7_day_avg') ]) # Rank within groups result = df.with_columns([ pl.col('sales').rank().over('category').alias('category_rank') ]) # Lag and lead result = df.with_columns([ pl.col('sales').shift(1).over('product').alias('prev_sales'), pl.col('sales').shift(-1).over('product').alias('next_sales') ]) ``` ## Lazy Evaluation for Performance Polars' lazy API optimizes queries before execution: ```python import polars as pl # Start with lazy scan (doesn't load data yet) lazy_df = ( pl.scan_csv('large_data.csv') .filter(pl.col('date') >= '2024-01-01') .select(['customer_id', 'product', 'sales', 'date']) .group_by('customer_id') .agg([ pl.col('sales').sum().alias('total_sales'), pl.col('product').n_unique().alias('unique_products') ]) .filter(pl.col('total_sales') > 1000) .sort('total_sales', descending=True) ) # Execute the optimized query result = lazy_df.collect() # Or get execution plan print(lazy_df.explain()) ``` ## Common Patterns ### Pattern 1: ETL Pipeline ```python import polars as pl from datetime import datetime # Extract and Transform result = ( pl.scan_csv('raw_data.csv') # Clean data .filter( (pl.col('amount') > 0) & (pl.col('quantity') > 0) ) # Transform columns .with_columns([ pl.col('date').str.strptime(pl.Date, '%Y-%m-%d'), pl.col('product').str.strip().str.to_uppercase(), (pl.col('quantity') * pl.col('amount')).alias('total'), pl.when(pl.col('quantity') > 10) .then(pl.lit('bulk')) .otherwise(pl.lit('retail')) .alias('order_type') ]) # Aggregate .group_by(['date', 'product', 'order_type']) .agg([ pl.col('total').sum().alias('daily_total'), pl.col('quantity').sum().alias('daily_quantity'), pl.count().alias('num_orders') ]) .collect() ) # Load (save results) result.write_parquet('processed_data.parquet') ``` ### Pattern 2: Data Exploration ```python import polars as pl df = pl.read_csv('data.csv') # Quick overview print(df.head()) print(df.describe()) print(df.schema) # Column statistics print(df.select([ pl.col('age').min(), pl.col('age').max(), pl.col('age').mean(), pl.col('age').median(), pl.col('age').std() ])) # Count nulls print(df.null_count()) # Value counts print(df['category'].value_counts()) # Unique values print(df['status'].n_unique()) ``` ### Pattern 3: Combining with DuckDB Use Polars for data loading and DuckDB for SQL analytics: ```python import polars as pl, duckdb # Load data with Polars df = pl.read_parquet('data/*.parquet') # Use DuckDB for complex SQL result = duckdb.sql(""" SELECT category, DATE_TRUNC('month', date) as month, SUM(revenue) as monthly_revenue, COUNT(DISTINCT customer_id) as unique_customers FROM df WHERE date >= '2024-01-01' GROUP BY category, month ORDER BY month DESC, monthly_revenue DESC """).pl() # Convert back to Polars DataFrame # Continue with Polars final = result.with_columns([ (pl.col('monthly_revenue') / pl.col('unique_customers')).alias('revenue_per_customer') ]) ``` ### Pattern 4: Writing Data ```python import polars as pl df = pl.read_csv('data.csv') # Write to CSV df.write_csv('output.csv') # Write to Parquet (recommended for large data) df.write_parquet('output.parquet') # Write to JSON df.write_json('output.json') # Write partitioned Parquet files df.write_parquet('output/', partition_by='date') ``` ## Expression Chaining Polars uses a powerful expression syntax: ```python import polars as pl result = df.select([ # String operations pl.col('name').str.to_lowercase().str.strip().alias('clean_name'), # Arithmetic (pl.col('price') * 1.1).round(2).alias('price_with_tax'), # Conditional logic pl.when(pl.col('age') < 18) .then(pl.lit('minor')) .when(pl.col('age') < 65) .then(pl.lit('adult')) .otherwise(pl.lit('senior')) .alias('age_group'), # Date operations pl.col('date').dt.year().alias('year'), pl.col('date').dt.month().alias('month'), # List operations pl.col('tags').list.len().alias('num_tags'), ]) ``` ## Performance Tips 1. **Use lazy evaluation** for large datasets - lets Polars optimize the query 2. **Use Parquet format** - columnar, compressed, much faster than CSV 3. **Filter early** - push filters before other operations 4. **Avoid row iteration** - use vectorized operations instead 5. **Use expressions** - more efficient than pandas-style operations ```python # Good: Lazy + filter early result = ( pl.scan_parquet('large.parquet') .filter(pl.col('date') >= '2024-01-01') # Filter first .select(['col1', 'col2', 'col3']) # Then select .collect() ) # Less efficient: Eager loading df = pl.read_parquet('large.parquet') result = df.filter(pl.col('date') >= '2024-01-01').select(['col1', 'col2', 'col3']) ``` ## Polars vs Pandas Key differences: - **Immutability**: Polars DataFrames are immutable (operations return new DataFrames) - **Performance**: Polars is typically 5-10x faster than pandas - **Lazy evaluation**: Polars can optimize queries before execution - **Expressions**: Polars uses expression API instead of method chaining - **Parallel**: Polars automatically parallelizes operations ```python # Pandas style df['new_col'] = df['col1'] * df['col2'] # Polars style df = df.with_columns([ (pl.col('col1') * pl.col('col2')).alias('new_col') ]) ``` ## Integration with DuckDB For the best of both worlds, combine Polars and DuckDB: ```python import polars as pl, duckdb # Polars: Fast data loading and transformation df = ( pl.scan_parquet('data/*.parquet') .filter(pl.col('active') == True) .collect() ) # DuckDB: SQL analytics result = duckdb.sql(""" SELECT category, SUM(amount) as total, AVG(amount) as average FROM df GROUP BY category """).pl() ``` See the `duckdb` skill for more SQL capabilities and the references/api_reference.md file for detailed Polars API documentation. ## Error Handling ```python import polars as pl try: df = pl.read_csv('data.csv') except FileNotFoundError: print("File not found") except pl.exceptions.ComputeError as e: print(f"Polars compute error: {e}") except Exception as e: print(f"Unexpected error: {e}") ``` ## Resources - **references/api_reference.md**: Detailed Polars API documentation and examples - Official docs: https://docs.pola.rs/ - API reference: https://docs.pola.rs/api/python/stable/reference/