import marimo

__generated_with = "0.23.3"
app = marimo.App(width="medium", app_title="Introduction to Polars")


@app.cell
def _():
    import marimo as mo

    return (mo,)


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    # Introduction to Polars

    *Written by Luke Chang*

    Polars is a blazing-fast DataFrame library for Python, written in Rust. It is designed for high-performance data manipulation and offers an expressive, consistent API that makes data wrangling both efficient and enjoyable. Unlike older tools, Polars was built from the ground up to take advantage of modern hardware through multi-threaded execution, lazy evaluation, and memory-efficient columnar storage.

    **Why Polars?**

    - **Speed**: Polars is one of the fastest DataFrame libraries available, routinely outperforming alternatives by 10-100x on large datasets.
    - **Memory efficiency**: Its Apache Arrow-based columnar format minimizes memory usage and avoids unnecessary copies.
    - **Expressive API**: The expression system lets you write concise, readable queries that are easy to compose and optimize.
    - **Lazy evaluation**: Polars can build an optimized query plan before executing, enabling automatic optimizations like predicate pushdown and projection pruning.

    In this tutorial, we will learn the fundamentals of Polars using a faculty salary dataset. By the end, you will be comfortable loading data, transforming columns, filtering rows, grouping and aggregating, and using advanced features like window functions and lazy evaluation.

    For more details, check out the official [Polars documentation](https://docs.pola.rs/).
    """)
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    """)
    return


@app.cell
def _():
    import polars as pl
    import numpy as np

    return np, pl


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Polars Objects

    Polars provides two core data structures: **Series** and **DataFrame**. Understanding these is the first step to working with Polars effectively.
    """)
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ### Series

    A `Series` is a typed, one-dimensional array. Every element in a Series has the same data type, which is determined at creation time. You can think of it as a single column of data.
    """)
    return


@app.cell
def _(pl):
    # Create a Series from a list of integers
    ages = pl.Series("age", [25, 30, 35, 40, 45])
    ages
    return


@app.cell
def _(pl):
    # Create a Series with an explicit type
    scores = pl.Series("score", [88.5, 92.0, 76.3, 95.1], dtype=pl.Float32)
    scores
    return


@app.cell
def _(pl):
    # String Series
    names = pl.Series("name", ["Alice", "Bob", "Charlie", "Diana"])
    names
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ### DataFrame

    A `DataFrame` is a two-dimensional table composed of multiple named Series (columns). Each column has its own data type, and all columns share the same length.

    A key difference from some other tools: **Polars DataFrames have no row index**. Rows are identified by their position, and any row labeling must be done explicitly through a column. This keeps the API simple and avoids ambiguity.
    """)
    return


@app.cell
def _(pl):
    # Create a DataFrame from a dictionary
    df_example = pl.DataFrame({
        "name": ["Alice", "Bob", "Charlie", "Diana"],
        "age": [25, 30, 35, 40],
        "salary": [50000, 60000, 70000, 80000],
    })
    df_example
    return


@app.cell
def _(np, pl):
    # Create a DataFrame from numpy arrays
    rng = np.random.default_rng(42)
    df_from_numpy = pl.DataFrame({
        "x": rng.normal(0, 1, 5),
        "y": rng.normal(0, 1, 5),
    })
    df_from_numpy
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Loading Data

    Polars can read data from many formats including CSV, Parquet, JSON, and more. We will load a faculty salary dataset that contains information about salaries, departments, years of experience, and other attributes.

    The `pl.read_csv()` function reads a CSV file eagerly (loading everything into memory immediately). Polars can also read directly from URLs.
    """)
    return


@app.cell
def _(pl):
    url = "https://raw.githubusercontent.com/ljchang/dartbrains/master/data/salary/salary.csv"
    df = pl.read_csv(url)
    df
    return df, url


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    For very large files, you can use `pl.scan_csv()` instead to create a **LazyFrame** that defers execution until you call `.collect()`. We will explore lazy evaluation in detail later in this tutorial.
    """)
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Inspecting Data

    Before diving into analysis, it is important to understand the structure and contents of your data. Polars provides several handy methods for this.
    """)
    return


@app.cell
def _(df):
    # View the first few rows
    df.head()
    return


@app.cell
def _(df):
    # View the last few rows
    df.tail()
    return


@app.cell
def _(df):
    # Random sample of rows
    df.sample(5, seed=42)
    return


@app.cell
def _(df):
    # Summary statistics for all columns
    df.describe()
    return


@app.cell
def _(df):
    # Column names and their data types
    df.schema
    return


@app.cell
def _(df):
    # Shape of the DataFrame (rows, columns)
    df.shape
    return


@app.cell
def _(df):
    # List of column names
    df.columns
    return


@app.cell
def _(df):
    # Check for missing values in each column
    df.null_count()
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Dealing with Missing Values

    Polars uses `null` to represent missing data. This is distinct from `NaN` (Not a Number), which is a valid floating-point value. The `null` type is consistent across all data types, whether numeric, string, or boolean.
    """)
    return


@app.cell
def _(df):
    # See how many nulls each column has
    df.null_count()
    return


@app.cell
def _(df):
    # Drop all rows that contain any null value
    df_no_nulls = df.drop_nulls()
    print(f"Original rows: {df.shape[0]}, After dropping nulls: {df_no_nulls.shape[0]}")
    df_no_nulls
    return


@app.cell
def _(df, pl):
    # Fill nulls with a specific strategy (e.g., fill numeric nulls with the column mean)
    df_filled = df.with_columns(
        pl.col("years").fill_null(strategy="mean"),
        pl.col("age").fill_null(strategy="mean"),
    )
    df_filled.null_count()
    return


@app.cell
def _(df, pl):
    # Fill nulls with a constant value
    df.with_columns(
        pl.col("years").fill_null(0),
    ).head()
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## The Expression API

    The expression API is the heart of Polars. Expressions describe computations on columns without immediately executing them. They are the building blocks you pass to methods like `select()`, `filter()`, `with_columns()`, and `group_by().agg()`.

    The most common starting point is `pl.col("column_name")`, which references a column. From there, you can chain operations to build up complex transformations.
    """)
    return


@app.cell
def _(df, pl):
    # Reference a single column and compute its mean
    df.select(pl.col("salary").mean())
    return


@app.cell
def _(df, pl):
    # Chain multiple operations: compute several summary statistics at once
    df.select(
        pl.col("salary").mean().alias("mean_salary"),
        pl.col("salary").median().alias("median_salary"),
        pl.col("salary").min().alias("min_salary"),
        pl.col("salary").max().alias("max_salary"),
        pl.col("salary").std().alias("std_salary"),
    )
    return


@app.cell
def _(df, pl):
    # Compute the mean of multiple columns at once
    df.select(pl.col("salary", "years", "age", "publications").mean())
    return


@app.cell
def _(df, pl):
    # Use pl.all() to compute the sum of every numeric column
    df.select(pl.all().sum())
    return


@app.cell
def _(df, pl):
    # Conditional expressions with when/then/otherwise
    df.select(
        pl.col("salary"),
        pl.col("departm"),
        pl.when(pl.col("salary") > 80000)
          .then(pl.lit("high"))
          .otherwise(pl.lit("standard"))
          .alias("salary_tier"),
    )
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Creating New Columns

    The `with_columns()` method adds new columns or transforms existing ones. It always returns a **new** DataFrame, leaving the original unchanged. This immutability makes your code easier to reason about and debug.

    Use `.alias("name")` to give your computed column a name.
    """)
    return


@app.cell
def _(df, pl):
    # Create a column showing salary in thousands
    df_enhanced = df.with_columns(
        (pl.col("salary") / 1000).round(1).alias("salary_thousands"),
    )
    df_enhanced.head()
    return


@app.cell
def _(df, pl):
    # Create multiple new columns at once
    df_multi = df.with_columns(
        (pl.col("salary") > 80000).alias("high_salary"),
        (pl.col("salary") / pl.col("publications")).round(0).alias("salary_per_pub"),
        (pl.col("age") - pl.col("years")).alias("age_at_start"),
    )
    df_multi.head()
    return


@app.cell
def _(df):
    # The original DataFrame is unchanged (immutability)
    print("Original columns:", df.columns)
    df.head(3)
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Selecting and Filtering

    `select()` chooses which columns to keep, while `filter()` chooses which rows to keep. Both accept Polars expressions.
    """)
    return


@app.cell
def _(df):
    # Select specific columns by name
    df.select("salary", "departm", "gender")
    return


@app.cell
def _(df, pl):
    # Select using expressions (allows renaming/transforming inline)
    df.select(
        pl.col("departm"),
        (pl.col("salary") / 1000).alias("salary_k"),
    )
    return


@app.cell
def _(df, pl):
    # Filter rows where salary exceeds 80000
    df.filter(pl.col("salary") > 80000)
    return


@app.cell
def _(df, pl):
    # Combine multiple filter conditions with & (and) and | (or)
    df.filter(
        (pl.col("salary") > 70000) & (pl.col("departm") == "neuro")
    )
    return


@app.cell
def _(df, pl):
    # Filter using is_in to match against a set of values
    df.filter(pl.col("departm").is_in(["neuro", "stat", "bio"]))
    return


@app.cell
def _(df):
    # Get unique departments
    df.select("departm").unique()
    return


@app.cell
def _(df):
    # Sort by salary descending
    df.sort("salary", descending=True).head(10)
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Renaming Columns

    Use `rename()` to change column names. Pass a dictionary mapping old names to new names.
    """)
    return


@app.cell
def _(df):
    df_renamed = df.rename({
        "departm": "department",
        "years": "years_experience",
    })
    df_renamed.head(3)
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Operations

    Polars supports a wide range of operations on columns, including arithmetic, string manipulations, and type casting. Whenever possible, use native Polars expressions rather than custom Python functions for best performance.
    """)
    return


@app.cell
def _(df, pl):
    # Arithmetic: give everyone a 10% raise
    df.select(
        pl.col("departm"),
        pl.col("salary"),
        (pl.col("salary") * 1.10).round(0).cast(pl.Int64).alias("salary_with_raise"),
    )
    return


@app.cell
def _(df, pl):
    # String operations: convert department names to uppercase
    df.select(
        pl.col("departm").str.to_uppercase().alias("department_upper"),
        pl.col("salary"),
    ).head()
    return


@app.cell
def _(df, pl):
    # Cast a column to a different type
    df.select(
        pl.col("salary").cast(pl.Float64).alias("salary_float"),
        pl.col("gender").cast(pl.Utf8).alias("gender_str"),
    ).head()
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    For truly custom logic that cannot be expressed with native Polars expressions, you can use `map_elements()` to apply a Python function element-wise. However, this is significantly slower than native expressions because it bypasses Polars' optimized execution engine.
    """)
    return


@app.cell
def _(df, pl):
    # map_elements example (slow — prefer native expressions when possible)
    df.select(
        pl.col("departm"),
        pl.col("salary").map_elements(
            lambda x: f"${x:,}", return_dtype=pl.Utf8
        ).alias("salary_formatted"),
    ).head()
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Joining Data

    Polars supports several types of joins for combining DataFrames. The syntax is straightforward: call `.join()` on the left DataFrame and pass the right DataFrame along with the join key and type.
    """)
    return


@app.cell
def _(pl):
    # Create two example DataFrames to demonstrate joins
    departments = pl.DataFrame({
        "dept_code": ["bio", "chem", "neuro", "stat", "physics"],
        "full_name": ["Biology", "Chemistry", "Neuroscience", "Statistics", "Physics"],
        "building": ["LSC", "Burke", "Moore", "Kemeny", "Wilder"],
    })

    budgets = pl.DataFrame({
        "dept_code": ["bio", "chem", "neuro", "math", "geol"],
        "annual_budget": [500000, 750000, 900000, 300000, 400000],
    })
    return budgets, departments


@app.cell
def _(budgets, departments):
    # Inner join: only keeps rows where the key exists in both DataFrames
    departments.join(budgets, on="dept_code", how="inner")
    return


@app.cell
def _(budgets, departments):
    # Left join: keeps all rows from the left DataFrame
    departments.join(budgets, on="dept_code", how="left")
    return


@app.cell
def _(budgets, departments):
    # Full outer join: keeps all rows from both DataFrames
    departments.join(budgets, on="dept_code", how="full", coalesce=True)
    return


@app.cell
def _(pl):
    # Vertical stacking (concatenating rows)
    df_a = pl.DataFrame({"name": ["Alice", "Bob"], "score": [90, 85]})
    df_b = pl.DataFrame({"name": ["Charlie", "Diana"], "score": [78, 92]})
    pl.concat([df_a, df_b])
    return


@app.cell
def _(pl):
    # Horizontal stacking (concatenating columns)
    df_left = pl.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]})
    df_right = pl.DataFrame({"salary": [50000, 60000], "dept": ["bio", "chem"]})
    pl.concat([df_left, df_right], how="horizontal")
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Grouping and Aggregation

    Grouping is one of the most powerful operations in data analysis. Polars uses `group_by()` to split a DataFrame by one or more columns, then `agg()` to compute summary statistics within each group.
    """)
    return


@app.cell
def _(df, pl):
    # Average salary by department
    df.group_by("departm").agg(
        pl.col("salary").mean().alias("avg_salary"),
    ).sort("avg_salary", descending=True)
    return


@app.cell
def _(df, pl):
    # Multiple aggregations in a single call
    df.group_by("departm").agg(
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("salary").max().alias("max_salary"),
        pl.col("salary").min().alias("min_salary"),
        pl.col("publications").mean().alias("avg_publications"),
        pl.len().alias("count"),
    ).sort("avg_salary", descending=True)
    return


@app.cell
def _(df, pl):
    # Group by multiple columns
    df.group_by("departm", "gender").agg(
        pl.col("salary").mean().alias("avg_salary"),
        pl.len().alias("count"),
    ).sort("departm", "gender")
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Window Functions

    Window functions compute values across groups **without collapsing rows**. In Polars, you use the `over()` expression to define the grouping. This is extremely powerful because it lets you add group-level statistics as new columns while keeping every individual row intact.

    This replaces the common pattern of grouping, computing a statistic, and then merging the result back to the original DataFrame.
    """)
    return


@app.cell
def _(df, pl):
    # Add each department's average salary as a column
    df.with_columns(
        pl.col("salary").mean().over("departm").alias("dept_avg_salary"),
    ).head(10)
    return


@app.cell
def _(df, pl):
    # Each person's salary as a percentage of their department's mean
    df_pct = df.with_columns(
        (pl.col("salary") / pl.col("salary").mean().over("departm") * 100)
        .round(1)
        .alias("pct_of_dept_mean"),
    )
    df_pct.sort("pct_of_dept_mean", descending=True).head(10)
    return


@app.cell
def _(df, pl):
    # Rank salary within each department
    df.with_columns(
        pl.col("salary").rank(descending=True).over("departm").alias("dept_salary_rank"),
    ).sort("departm", "dept_salary_rank").head(15)
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Reshaping Data

    Polars provides `unpivot()` to go from wide to long format and `pivot()` to go from long to wide format. These are essential when preparing data for visualization or statistical modeling.
    """)
    return


@app.cell
def _(pl):
    # Create a wide-format example
    df_wide = pl.DataFrame({
        "department": ["bio", "chem", "neuro"],
        "q1_budget": [100, 200, 150],
        "q2_budget": [110, 190, 160],
        "q3_budget": [105, 210, 155],
    })
    df_wide
    return (df_wide,)


@app.cell
def _(df_wide):
    # Unpivot (wide to long): melt the quarterly columns into rows
    df_long = df_wide.unpivot(
        index="department",
        on=["q1_budget", "q2_budget", "q3_budget"],
        variable_name="quarter",
        value_name="budget",
    )
    df_long
    return (df_long,)


@app.cell
def _(df_long):
    # Pivot (long to wide): spread the quarter values back into columns
    df_long.pivot(
        on="quarter",
        index="department",
        values="budget",
    )
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Lazy Evaluation

    One of Polars' most powerful features is **lazy evaluation**. Instead of executing each operation immediately, a LazyFrame records the operations as a query plan. Polars then optimizes this plan before execution, which can dramatically improve performance.

    Key optimizations that Polars applies automatically:

    - **Predicate pushdown**: Filters are pushed as early as possible, reducing the amount of data processed.
    - **Projection pushdown**: Only the columns you actually need are loaded from disk.
    - **Common subexpression elimination**: Repeated computations are calculated once.
    - **Parallel execution**: Independent operations run on multiple CPU cores.

    To use lazy mode, start with `pl.scan_csv()` instead of `pl.read_csv()`, or convert an existing DataFrame with `.lazy()`. When you are ready to execute, call `.collect()`.
    """)
    return


@app.cell
def _(pl, url):
    # Create a LazyFrame by scanning the CSV
    lf = pl.scan_csv(url)
    print(type(lf))
    lf
    return (lf,)


@app.cell
def _(lf, pl):
    # Build a query plan without executing it
    query = (
        lf
        .filter(pl.col("salary") > 60000)
        .group_by("departm")
        .agg(
            pl.col("salary").mean().alias("avg_salary"),
            pl.len().alias("count"),
        )
        .sort("avg_salary", descending=True)
    )

    # View the optimized query plan
    print(query.explain())
    return (query,)


@app.cell
def _(query):
    # Execute the query plan and get a DataFrame
    query.collect()
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    You can also convert an existing DataFrame to a LazyFrame with `.lazy()` and back with `.collect()`. This is useful when you want to chain many operations and let Polars optimize them as a batch.
    """)
    return


@app.cell
def _(df, pl):
    # Convert eager DataFrame to lazy, apply operations, then collect
    result = (
        df.lazy()
        .with_columns(
            (pl.col("salary") / 1000).alias("salary_k"),
        )
        .filter(pl.col("salary_k") > 70)
        .select("departm", "salary_k", "publications")
        .collect()
    )
    result
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Exercises

    Try these exercises to practice what you have learned. Each one uses the salary dataset.
    """)
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    **Exercise 1**: Filter the salary data to only include rows where the `departm` is `"neuro"` and salary is above 80,000. How many rows match? What is the average salary of this subset?
    """)
    return


@app.cell
def _():
    # Your code here
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    **Exercise 2**: Group the data by `departm` and `gender`. For each group, calculate the mean salary and the number of people. Sort the result by mean salary in descending order.
    """)
    return


@app.cell
def _():
    # Your code here
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    **Exercise 3**: Using `over()`, create a new column called `pct_of_dept_mean` that shows each person's salary as a percentage of their department's mean salary. Sort by this column in descending order. Who has the highest relative salary compared to their department?
    """)
    return


@app.cell
def _():
    # Your code here
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""
    ## Summary

    In this tutorial, we covered the core concepts of Polars:

    - **Series and DataFrames**: Polars' fundamental data structures with strict typing and no row index.
    - **Loading and inspecting data**: Reading CSVs, checking shapes, schemas, and null counts.
    - **Missing values**: Using `null_count()`, `drop_nulls()`, and `fill_null()`.
    - **The expression API**: Building computations with `pl.col()`, chaining operations, and using `when/then/otherwise`.
    - **Creating columns**: Adding new columns immutably with `with_columns()` and `.alias()`.
    - **Selecting and filtering**: Choosing columns with `select()` and rows with `filter()`.
    - **Operations**: Arithmetic, string methods, type casting, and `map_elements()`.
    - **Joins**: Combining DataFrames with `join()` and `concat()`.
    - **Grouping and aggregation**: Using `group_by().agg()` for summary statistics.
    - **Window functions**: Computing group-level values without collapsing rows using `over()`.
    - **Reshaping**: Converting between wide and long formats with `unpivot()` and `pivot()`.
    - **Lazy evaluation**: Building optimized query plans with `scan_csv()`, `.lazy()`, and `.collect()`.

    Polars' combination of speed, expressiveness, and lazy evaluation makes it an excellent tool for data analysis in Python. As your datasets grow larger and your queries more complex, these features will become increasingly valuable.
    """)
    return


if __name__ == "__main__":
    app.run()