{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Batch processing\n", "\n", "This tutorial covers the `.many()` method for efficient bulk feature extraction:\n", "\n", "- Plain Python lists of `(t, m, sigma)` tuples\n", "- [nested-pandas](https://nested-pandas.readthedocs.io) with real ZTF survey data\n", "- [PyArrow](https://arrow.apache.org/docs/python/) `List` arrays\n", "- [Polars](https://docs.pola.rs) Series\n", "\n", "All Arrow-compatible inputs avoid Python-level iteration and pass data to Rust with zero copies." ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": "# %pip install light-curve" }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "## Plain list of tuples\n", "\n", "`.many()` accepts a list of `(t, m, sigma)` tuples and returns a 2-D NumPy array of shape\n", "`(N, n_features)`. Multi-threading is enabled by default via the `n_jobs` parameter:" ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "import light_curve as licu\n", "import numpy as np\n", "\n", "rng = np.random.default_rng(0)\n", "light_curves = [\n", " (np.sort(rng.random(50)), rng.random(50), rng.random(50) * 0.1)\n", " for _ in range(1000)\n", "]\n", "\n", "results = licu.Amplitude().many(light_curves)\n", "print(f'Extracted from {len(light_curves)} light curves: shape = {results.shape}')\n", "print(f'Mean amplitude = {results.mean():.4f} mag')" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": "## nested-pandas with ZTF survey data\n\n[nested-pandas](https://nested-pandas.readthedocs.io) extends pandas with nested Arrow column\nsupport, useful for catalog data such as ZTF or Rubin LSST." }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": "# %pip install light-curve nested-pandas s3fs universal-pathlib" }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "import light_curve as licu\n", "import nested_pandas as npd\n", "import numpy as np\n", "import pyarrow as pa\n", "from upath import UPath\n", "\n", "s3_path = UPath(\n", " \"s3://ipac-irsa-ztf/contributed/dr23/lc/hats/ztf_dr23_lc-hats/dataset/Norder=6/Dir=30000/Npix=34623/part0.snappy.parquet\",\n", " anon=True,\n", ")\n", "ndf = npd.read_parquet(\n", " s3_path,\n", " columns=[\"objectid\", \"lightcurve.hmjd\", \"lightcurve.mag\", \"lightcurve.magerr\"],\n", ")\n", "\n", "ndf = ndf.loc[ndf[\"lightcurve\"].list_lengths > 10]\n", "\n", "ndf[\"lightcurve.t\"] = np.asarray(ndf[\"lightcurve.hmjd\"] - 58000, dtype=np.float32)\n", "\n", "feature = licu.Extractor(licu.Chi2Pvar(), licu.InterPercentileRange(quantile=0.25), licu.LinearFit())\n", "result = feature.many(pa.array(ndf[\"lightcurve\"]), n_jobs=-1,\n", " arrow_fields={\"t\": \"t\", \"m\": \"mag\", \"sigma\": \"magerr\"})\n", "\n", "ndf = ndf.assign(**dict(zip(feature.names, result.T)))\n", "ndf.head()" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": "## PyArrow\n\n[PyArrow](https://arrow.apache.org/docs/python/) is the reference Python implementation of Apache Arrow.\nPass a `List>` array directly to `.many()` for multiband extraction without sigma." }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": "# %pip install light-curve pyarrow" }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": "import light_curve as licu\nimport numpy as np\nimport pyarrow as pa\n\nBANDS = [\"g\", \"r\"]\nrng = np.random.default_rng(42)\nn_lc, n_per_band = 200, 40\n\nstruct_type = pa.struct([\n (\"t\", pa.float64()),\n (\"m\", pa.float64()),\n (\"band\", pa.string()),\n])\n\n\ndef make_lc():\n rows = []\n for b in BANDS:\n t = rng.uniform(0, 100, n_per_band)\n m = rng.normal(15.0 if b == \"g\" else 15.3, 0.3, n_per_band)\n rows.extend({\"t\": float(ti), \"m\": float(mi), \"band\": b} for ti, mi in zip(t, m))\n rows.sort(key=lambda r: r[\"t\"])\n return rows\n\n\nlcs_arrow = pa.array([make_lc() for _ in range(n_lc)], type=pa.list_(struct_type))\n\nfeature = licu.Extractor(\n licu.InterPercentileRange(quantile=0.1, bands=BANDS), # robust amplitude per band\n licu.AndersonDarlingNormal(bands=BANDS), # normality test per band\n licu.ColorOfMaximum(BANDS), # colour at brightness peak\n licu.ColorOfMinimum(BANDS), # colour at brightness trough\n)\nresult = feature.many(\n lcs_arrow,\n sorted=True,\n arrow_fields={\"t\": \"t\", \"m\": \"m\", \"band\": \"band\"},\n)\nprint(f\"shape: {result.shape}\") # (200, 6)\nprint(\"names:\", feature.names)" }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": "## Polars\n\n[Polars](https://docs.pola.rs) is a fast DataFrame library built on Arrow.\nGroup a flat multiband DataFrame by object and pass the nested Series to `.many()`." }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": "# %pip install light-curve polars" }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "import light_curve as licu\n", "import numpy as np\n", "import polars as pl\n", "\n", "BANDS = [\"g\", \"r\"]\n", "rng = np.random.default_rng(42)\n", "n_obj, n_per_band = 200, 40\n", "\n", "object_id = np.repeat(np.arange(n_obj), n_per_band * len(BANDS))\n", "band_col = np.tile(np.repeat(BANDS, n_per_band), n_obj)\n", "t = np.sort(rng.uniform(0, 100, n_obj * n_per_band * len(BANDS)))\n", "m = rng.normal(15.0, 0.3, len(object_id))\n", "sigma = rng.uniform(0.01, 0.1, len(object_id))\n", "\n", "df = pl.DataFrame({\"object_id\": object_id, \"band\": band_col, \"t\": t, \"m\": m, \"sigma\": sigma})\n", "nested = df.group_by(\"object_id\").agg(pl.struct(\"t\", \"m\", \"sigma\", \"band\").alias(\"lc\"))\n", "\n", "feature = licu.Extractor(\n", " licu.ExcessVariance(bands=BANDS), # variability excess over noise per band\n", " licu.StetsonK(bands=BANDS), # variability index per band\n", " licu.BeyondNStd(nstd=1.5, bands=BANDS), # outlier fraction per band\n", " licu.ColorOfMedian(BANDS), # colour at median brightness\n", " licu.ColorSpread(BANDS), # std dev of per-band means\n", ")\n", "result = feature.many(\n", " nested[\"lc\"],\n", " arrow_fields={\"t\": \"t\", \"m\": \"m\", \"sigma\": \"sigma\", \"band\": \"band\"},\n", ")\n", "nested = nested.with_columns(\n", " [pl.Series(name, result[:, i]) for i, name in enumerate(feature.names)]\n", ")\n", "nested.select([\"object_id\"] + feature.names)\n" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "## Next steps\n", "\n", "- [Feature basics tutorial](basics.ipynb) — single features, Extractor, multiband intro\n", "- [Multiband tutorial](../multiband/) — per-band and cross-band features\n", "- [Periodogram tutorial](../periodogram/) — Lomb–Scargle and period search\n", "- [API reference](../api/) — full signatures and equations" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.13" } }, "nbformat": 4, "nbformat_minor": 5 }