---
name: polars
description: Use when "Polars", "fast dataframe", "lazy evaluation", "Arrow backend", or asking about "pandas alternative", "parallel dataframe", "large CSV processing", "ETL pipeline", "expression API"
version: 1.0.0
---

# Polars Fast DataFrame Library

Lightning-fast DataFrame library with lazy evaluation and parallel execution.

## When to Use

- Pandas is too slow for your dataset
- Working with 1-100GB datasets that fit in RAM
- Need lazy evaluation for query optimization
- Building ETL pipelines
- Want parallel execution without extra config

---

## Lazy vs Eager Evaluation

| Mode | Function | Executes | Use Case |
|------|----------|----------|----------|
| **Eager** | `read_csv()` | Immediately | Small data, exploration |
| **Lazy** | `scan_csv()` | On `.collect()` | Large data, pipelines |

**Key concept**: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).

---

## Core Operations

### Data Selection

| Operation | Purpose |
|-----------|---------|
| `select()` | Choose columns |
| `filter()` | Choose rows by condition |
| `with_columns()` | Add/modify columns |
| `drop()` | Remove columns |
| `head(n)` / `tail(n)` | First/last n rows |

### Aggregation

| Operation | Purpose |
|-----------|---------|
| `group_by().agg()` | Group and aggregate |
| `pivot()` | Reshape wide |
| `melt()` | Reshape long |
| `unique()` | Distinct values |

### Joins

| Join Type | Description |
|-----------|-------------|
| **inner** | Matching rows only |
| **left** | All left + matching right |
| **outer** | All rows from both |
| **cross** | Cartesian product |
| **semi** | Left rows with match |
| **anti** | Left rows without match |

---

## Expression API

**Key concept**: Polars uses expressions (`pl.col()`) instead of indexing. Expressions are lazily evaluated and optimized.

### Common Expressions

| Expression | Purpose |
|------------|---------|
| `pl.col("name")` | Reference column |
| `pl.lit(value)` | Literal value |
| `pl.all()` | All columns |
| `pl.exclude(...)` | All except |

### Expression Methods

| Category | Methods |
|----------|---------|
| **Aggregation** | `.sum()`, `.mean()`, `.min()`, `.max()`, `.count()` |
| **String** | `.str.contains()`, `.str.replace()`, `.str.to_lowercase()` |
| **DateTime** | `.dt.year()`, `.dt.month()`, `.dt.day()` |
| **Conditional** | `.when().then().otherwise()` |
| **Window** | `.over()`, `.rolling_mean()`, `.shift()` |

---

## Pandas Migration

| Pandas | Polars |
|--------|--------|
| `df['col']` | `df.select('col')` |
| `df[df['col'] > 5]` | `df.filter(pl.col('col') > 5)` |
| `df['new'] = df['col'] * 2` | `df.with_columns((pl.col('col') * 2).alias('new'))` |
| `df.groupby('col').mean()` | `df.group_by('col').agg(pl.all().mean())` |
| `df.apply(func)` | `df.map_rows(func)` (avoid if possible) |

**Key concept**: Polars prefers explicit operations over implicit indexing. Use `.alias()` to name computed columns.

---

## File I/O

| Format | Read | Write | Notes |
|--------|------|-------|-------|
| **CSV** | `read_csv()` / `scan_csv()` | `write_csv()` | Human readable |
| **Parquet** | `read_parquet()` / `scan_parquet()` | `write_parquet()` | Fast, compressed |
| **JSON** | `read_json()` / `scan_ndjson()` | `write_json()` | Newline-delimited |
| **IPC/Arrow** | `read_ipc()` / `scan_ipc()` | `write_ipc()` | Zero-copy |

**Key concept**: Use Parquet for performance. Use `scan_*` for large files to enable lazy optimization.

---

## Performance Tips

| Tip | Why |
|-----|-----|
| Use lazy mode | Query optimization |
| Use Parquet | Column-oriented, compressed |
| Select columns early | Projection pushdown |
| Filter early | Predicate pushdown |
| Avoid Python UDFs | Breaks parallelism |
| Use expressions | Vectorized operations |
| Set dtypes on read | Avoid inference overhead |

---

## vs Alternatives

| Tool | Best For | Limitations |
|------|----------|-------------|
| **Polars** | 1-100GB, speed critical | Must fit in RAM |
| **Pandas** | Small data, ecosystem | Slow, memory hungry |
| **Dask** | Larger than RAM | More complex API |
| **Spark** | Cluster computing | Infrastructure overhead |
| **DuckDB** | SQL interface | Different API style |

## Resources

- Docs: <https://pola.rs/>
- User Guide: <https://docs.pola.rs/user-guide/>
- Cookbook: <https://docs.pola.rs/user-guide/misc/cookbook/>