{ "cells": [ { "cell_type": "markdown", "id": "9d06f7dd", "metadata": {}, "source": [ "# Similarity search\n", "\n", "Embeddings are dense fixed-length vectors, so cosine distance is a natural proxy for light-curve\n", "similarity. The example below reads a ZTF DR23 [HATS](https://hats.readthedocs.io) pixel directly\n", "from the public S3 bucket using\n", "[`nested-pandas`](https://nested-pandas.readthedocs.io) and `s3fs`\n", "(install both with `pip install nested-pandas s3fs`), embeds all well-observed light curves with\n", "Astromer2, and finds the closest neighbour to a given object by cosine distance.\n", "\n", "This pixel contains ~50 k objects passing the quality cuts; embedding takes roughly\n", "12 minutes on M2 Pro (~15 ms per object)." ] }, { "cell_type": "code", "execution_count": 1, "id": "pip-install-docs-embed-pre-executed-similarity_search-ipynb", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T21:00:26.124397Z", "iopub.status.busy": "2026-06-09T21:00:26.124195Z", "iopub.status.idle": "2026-06-09T21:00:26.129248Z", "shell.execute_reply": "2026-06-09T21:00:26.127644Z" } }, "outputs": [], "source": [ "# %pip install light-curve huggingface_hub onnxruntime nested-pandas universal-pathlib" ] }, { "cell_type": "code", "execution_count": null, "id": "a5454779-imports", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T21:00:26.131870Z", "iopub.status.busy": "2026-06-09T21:00:26.131674Z", "iopub.status.idle": "2026-06-09T21:00:27.290041Z", "shell.execute_reply": "2026-06-09T21:00:27.288806Z" } }, "outputs": [], "source": "import nested_pandas as npd\nimport numpy as np\nfrom scipy.spatial.distance import cdist\nfrom upath import UPath\n\nfrom light_curve.embed import Astromer2\n\nTARGET_OID = 680213300009232 # ZTF r-band light curve\nMIN_OBS = 1000" }, { "cell_type": "markdown", "id": "step1-load-md", "metadata": {}, "source": [ "## Step 1 — Load data\n", "\n", "Read one HATS pixel of ZTF DR23 directly from the public S3 bucket and keep only\n", "objects with clean photometry (`catflags == 0`) and at least 1 000 observations." ] }, { "cell_type": "code", "execution_count": 3, "id": "a5454779-load", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T21:00:27.295557Z", "iopub.status.busy": "2026-06-09T21:00:27.295154Z", "iopub.status.idle": "2026-06-09T21:01:26.802982Z", "shell.execute_reply": "2026-06-09T21:01:26.801047Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Objects after quality cuts: 49,166\n" ] } ], "source": [ "nf = npd.read_parquet(\n", " UPath(\n", " \"s3://ipac-irsa-ztf/contributed/dr23/lc/hats/ztf_dr23_lc-hats\"\n", " \"/dataset/Norder=5/Dir=0/Npix=2378/\",\n", " anon=True,\n", " )\n", ")\n", "nf = nf.query(\"lightcurve.catflags == 0\").query(f\"lightcurve.list_lengths >= {MIN_OBS}\")\n", "print(f\"Objects after quality cuts: {len(nf):,}\")" ] }, { "cell_type": "markdown", "id": "step2-embed-md", "metadata": {}, "source": [ "## Step 2 — Embed with Astromer2\n", "\n", "Load the pretrained model and run it over all light curves with `map_rows`.\n", "Each call returns a 256-dim vector; the results are stored as a nested column\n", "`embedding.value` and then stacked into a matrix for distance computation." ] }, { "cell_type": "code", "execution_count": 4, "id": "a5454779-embed", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T21:01:26.807749Z", "iopub.status.busy": "2026-06-09T21:01:26.807329Z", "iopub.status.idle": "2026-06-09T22:02:51.157173Z", "shell.execute_reply": "2026-06-09T22:02:51.154827Z" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Embedded 49,166 objects.\n" ] } ], "source": "model = Astromer2.from_hf(output=\"mean\", reduction=\"beginning\")\n\n\ndef embed_row(hmjd, mag):\n return {\"embedding.value\": model(hmjd, mag).squeeze()}\n\n\nnf = nf.map_rows(\n embed_row,\n columns=[\"lightcurve.hmjd\", \"lightcurve.mag\"],\n row_container=\"args\",\n append_columns=True,\n)\nprint(f\"Embedded {len(nf):,} objects.\")" }, { "cell_type": "markdown", "id": "step3-search-md", "metadata": {}, "source": [ "## Step 3 — Find nearest neighbour\n", "\n", "Stack the embeddings into a matrix and compute cosine distances from the query object\n", "to all others. Cosine distance is scale-invariant: only the direction of the embedding\n", "vector matters, not its magnitude." ] }, { "cell_type": "code", "execution_count": 5, "id": "a5454779-search", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T22:02:51.160418Z", "iopub.status.busy": "2026-06-09T22:02:51.160187Z", "iopub.status.idle": "2026-06-09T22:02:51.309298Z", "shell.execute_reply": "2026-06-09T22:02:51.307817Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Query: OID 680213300009232\n", "Nearest neighbour: OID 680113300005170, cosine distance 0.000046\n" ] } ], "source": [ "oids = nf[\"objectid\"].to_numpy()\n", "matrix = np.asarray(nf[\"embedding.value\"]).reshape(len(nf), -1)\n", "\n", "query_idx = np.where(oids == TARGET_OID)[0][0]\n", "distances = cdist(matrix[query_idx : query_idx + 1], matrix, metric=\"cosine\")[0]\n", "distances[query_idx] = np.inf # exclude the query itself\n", "\n", "best_idx = np.argmin(distances)\n", "best_oid = oids[best_idx]\n", "print(f\"Query: OID {TARGET_OID}\")\n", "print(f\"Nearest neighbour: OID {best_oid}, cosine distance {distances[best_idx]:.6f}\")\n", "# Nearest neighbour: OID 680113300005170, cosine distance 0.000046\n", "\n", "assert best_oid == 680113300005170\n", "assert distances[best_idx] < 0.001" ] }, { "cell_type": "markdown", "id": "310ad849", "metadata": {}, "source": [ "The nearest neighbour is OID `680113300005170` — the same physical object\n", "([HZ Her / Her X-1](https://en.wikipedia.org/wiki/Hercules_X-1), an X-ray binary) observed in the\n", "*g*-band, recovered automatically from an *r*-band query through embedding similarity.\n", "\n", "See both objects on SNAD Viewer:\n", "[query (r-band)](https://ztf.snad.space/dr23/view/680213300009232) ·\n", "[nearest neighbour (g-band)](https://ztf.snad.space/dr23/view/680113300005170)" ] }, { "cell_type": "markdown", "id": "9ac81d4f", "metadata": {}, "source": [ "## Next steps\n", "\n", "- [Classification](classification.ipynb) — train a classifier on top of embeddings\n", "- [onnxruntime tips](../onnxruntime.md) — thread control on shared HPC nodes, GPU/CUDA setup\n", "- [API reference](../api.md)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.3" } }, "nbformat": 4, "nbformat_minor": 5 }