{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Roman Microlensing Data Challenge 2026: Workflow\n", "***\n", "## Learning Goals\n", "\n", "By the end of this tutorial, you, the participant, will be able to:\n", "\n", "- Load official Data Challenge light curve data from cloud storage.\n", "- Install required software in the Roman Research Nexus (RRN).\n", "- Initialize a submission project using the `microlens-submit` tool.\n", "- Perform a microlensing model fit using `MulensModel`.\n", "- Package your model results, plots, and notes into a standardized solution file.\n", "- Validate and export your final submission for evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "This notebook provides a complete, end-to-end workflow for submitting an entry to the Roman Microlensing Data Challenge 2026 (RMDC26). Following these steps is mandatory for a valid submission.\n", "\n", "The process involves accessing data directly from the cloud, performing your analysis within the RRN, and using the `microlens-submit` package to standardize your results. This helps ensure a level playing field and allows the evaluation committee to process all submissions efficiently.\n", "\n", "### Important Links\n", "- **microlens-submit documentation:** [Read the Docs](https://microlens-submit.readthedocs.io/en/latest/)\n", "- **RRN teams & servers:** For information on setting up a collaborative team server, please see the teams page\n", " [Working on a Team](../../../../markdown/teams.md)\n", "- **RRN software guide:** [Installing Extra Software](../../../../markdown/software.md)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. The RGES mamba environment and JupyterLab kernel\n", "\n", "The mamba environment built to run these notebooks is named `rges-pit-dc`. It is activated by default in terminal sessions. In notebooks, selecting the RGES PIT Nexus kernel results in running in an rges-pit-dc mamba environment" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "nexus-only" ] }, "source": [ "\n", "> ### Running on the Roman Research Nexus\n", "> If you are following this notebook on the Roman Research Nexus (RRN) these packages are preinstalled on the RGES-PIT's kernel:\n", "> 1. In notebooks, open **Kernel > Change Kernel** menu and select `RGES PIT Nexus` or select the kernel via the Kernel menu or clicking on the \\\"kernel status display\\\" to the top right with the activity circle.\n", "> 1. In a terminal, if needed, run `mamba activate rges-pit-dc` to select the correct mamba environment.\"\n", "> 1. Store any files you generate in your home directory or team space; the `/roman/nexus-notebooks/notebooks` directory is read-only." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "> ### Running outside the Roman Research Nexus\n", "> If you are running this notebook on Google Colab, locally, or in any environment where the `RGES PIT Nexus` kernel is not pre-configured, run the cell below to install the required packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# LOCAL ENVIRONMENT / KERNEL ISSUES\n", "# Uncomment and run if you are NOT on the Roman Research Nexus or are having issues with the kernel. \n", "# This will install the necessary packages in your local environment. If you are on the Nexus, these \n", "# packages should already be available and you can ignore this step.\n", "\n", "#%pip install --quiet MulensModel microlens_submit emcee corner matplotlib numpy huggingface_hub pyLIMA\n", "\n", "# Run one of these commands to update microlens-submit in the active kernel. You will need to restart \n", "# the kernel after running this command for the changes to take effect.\n", "\n", "# Conda:\n", "#%conda update -c conda-forge microlens_submit\n", "# Pip:\n", "%pip install -U microlens_submit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Imports\n", "\n", "Now we import all the libraries we'll need for this workflow." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# system imports\n", "import os\n", "import time\n", "import sys\n", "from pathlib import Path\n", "\n", "# data access imports\n", "from huggingface_hub import hf_hub_download\n", "import pandas as pd\n", "\n", "# display imports\n", "from IPython.display import HTML\n", "\n", "# multiprocessing imports\n", "import multiprocessing as mp\n", "from multiprocessing.pool import ThreadPool\n", "\n", "# data challenge imports\n", "import microlens_submit\n", "import MulensModel\n", "import emcee\n", "from pyLIMA.pyLIMASS import SourceLensProbabilities\n", "\n", "# data analysis/visualization imports\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Data Access\n", "\n", "We will now load a light curve directly from the challenge's public directory. You do not need to download anything for processing on the Nexus; the data is local.\n", "\n", "> Future data releases from the Roman Space Telescope will likely be streamed from S3 buckets. You can use `s3fs` to stream the data directly into memory, as shown in the `data_discovery_and_access.md` guide." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TIER = \"test\" # \"test\" \"Experienced\" \"Beginner\"\n", "EVENT_ID = \"data_challenge_0_129_335\" # we did't have a canonical naming system yet for the test set\n", "FILENAME = f\"{EVENT_ID}.det.lc\" # AAS workshop test event light curve file name\n", "\n", "# Example for Beginner Tier data access\n", "#TIER = \"Beginner\"\n", "#EVENT_ID = \"RMDC26_000001\" # example event ID for the Experienced and Beginner Tiers. \n", "# You will need to change this to the event you want to work with.\n", "# Valid event IDs are from RMDC26_000001 to RMDC26_000188 for the Beginner Tier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Alternate data access for local / non-Nexus use\n", "\n", "# Uncomment to the following lines to test a different tier locally:\n", "#TIER = \"Beginner\"\n", "#EVENT_ID = \"RMDC26_000001\"\n", "\n", "REPO_ID = f\"RGES-PIT/{TIER}\"\n", "\n", "if TIER == \"test\":\n", " local_data_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type=\"dataset\")\n", " lc_data = pd.read_csv(local_data_path, sep=r'\\s+', comment='#', header=0)\n", "elif TIER in [\"Beginner\", \"Experienced\"]:\n", " FILENAME = f\"RMDC26_{TIER}_Tier_test.parquet\" # \"_test\" tells hugging face this is not a machine learning training set.\n", " local_data_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type=\"dataset\")\n", " lc_data_all = pd.read_parquet(local_data_path)\n", " if EVENT_ID not in lc_data_all[\"name\"].unique():\n", " raise ValueError(f\"Event ID {EVENT_ID} not found in the Hugging Face {TIER} tier dataset.\")\n", " lc_data = lc_data_all[lc_data_all[\"name\"] == EVENT_ID].copy()\n", "else:\n", " raise ValueError(f\"Unsupported tier for alternate data access: {TIER}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# NEXUS-ONLY\n", "# This is the official, secure path to the data challenge files on the Nexus.\n", "# We are running this second in case you aren't paying attention.\n", "DATA_DIR = \"/data/data-challenge/rges\"\n", "\n", "if TIER == \"test\":\n", " DATA_URI = f\"{DATA_DIR}/{FILENAME}\" # includes the AAS test set as individual .lc files\n", " # Load the data with header and band column\n", " lc_data = pd.read_csv(DATA_URI, sep=r'\\s+', comment='#', header=0)\n", "elif TIER == \"Experienced\" or TIER == \"Beginner\":\n", " DATA_URI = f\"{DATA_DIR}/RMDC26_{TIER}_Tier.parquet\"\n", "\n", "if TIER in [\"Experienced\", \"Beginner\"]:\n", " if not os.path.exists(DATA_URI):\n", " raise FileNotFoundError(f\"Data file {DATA_URI} not found. Data for the {TIER} Tier may not yet be released.\")\n", " lc_data_all = pd.read_parquet(DATA_URI) # this line require the pyarrow package, which is preinstalled in the Nexus environment. \n", " if EVENT_ID not in lc_data_all[\"name\"].unique():\n", " raise ValueError(f\"Event ID {EVENT_ID} not found in the {TIER} Tier dataset.\")\n", " lc_data = lc_data_all[lc_data_all[\"name\"] == EVENT_ID]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This cell standardized the dataframe keys. You have our sincere apologies for any confusion this causes\n", "keys = {}\n", "\n", "if TIER in [\"Experienced\", \"Beginner\"]:\n", " keys[\"time\"] = \"bjd\"\n", " keys[\"flux\"] = None # not provided in the Experienced and Beginner Tiers\n", " keys[\"flux_err\"] = None # not provided in the Experienced and Beginner\n", " keys[\"band\"] = \"filt\"\n", "elif TIER == \"test\":\n", " keys[\"time\"] = \"BJD\"\n", " keys[\"flux\"] = \"measured_relative_flux\"\n", " keys[\"flux_err\"] = \"measured_relative_flux_error\"\n", " keys[\"band\"] = \"observatory_code\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll make a plot now, to check that whichever data retrieval method you are using is working as expected" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get unique bands (`observatory_code` or `filt`)\n", "bands = np.sort(np.unique(lc_data[keys[\"band\"]].to_numpy()))\n", "print(f\"Bands found: {bands}\")\n", "\n", "# calculate magnitude\n", "def gulls_flux_to_mag(df, fs, Obssrcmag, bands, keys):\n", " \"\"\"Calculate mag and mag_err for Gulls lightcurves (baseline-relative flux).\"\"\"\n", " # gulls uses a relative flux such that baseline is always 1 and the flux-system zeropoint changes for each event\n", " df = df.copy()\n", " \n", " if keys[\"flux\"] is not None and keys[\"flux_err\"] is not None: # only true for the \"test\" tier data\n", " f = df[keys[\"flux\"]].to_numpy()\n", " ferr = df[keys[\"flux_err\"]].to_numpy()\n", " obs = df[keys[\"band\"]].to_numpy()\n", "\n", " # Per-band zero-point (fs and Obssrcmag must be aligned with `bands` order)\n", " m0_per_band = 2.5 * np.log10(fs) + Obssrcmag\n", " m0_map = dict(zip(bands, m0_per_band))\n", "\n", " # Map each row's band -> m0\n", " m0 = np.vectorize(m0_map.get)(obs)\n", "\n", " # Magnitude + error propagation\n", " mag = m0 - 2.5 * np.log10(f)\n", " mag_err = (2.5 / np.log(10.0)) * (ferr / f)\n", " \n", " df[\"mag\"] = mag\n", " df[\"mag_err\"] = mag_err\n", " return df\n", " \n", "\n", "# from the data file comments we know that: #fs:\n", "fs = np.array([0.0858942, 0.991581, 0.000266992])\n", "Obssrcmag = np.array([23.6466, 24.2485, 23.7078])\n", "\n", "lc_data = gulls_flux_to_mag(lc_data, fs, Obssrcmag, bands, keys)\n", "\n", "# Plot each band separately\n", "plt.figure(figsize=(8, 5))\n", "for band in bands:\n", " mask = lc_data[keys[\"band\"]] == band\n", " plt.errorbar(\n", " lc_data[keys[\"time\"]][mask], lc_data[\"mag\"][mask], yerr=lc_data[\"mag_err\"][mask],\n", " fmt=\".\", label=f\"Band {band}\", alpha=0.7\n", " )\n", "\n", "plt.gca().invert_yaxis()\n", "plt.title(f\"{EVENT_ID} Light Curve by Band\")\n", "plt.xlabel(\"HJD\")\n", "plt.ylabel(\"Magnitude\")\n", "plt.legend(title=\"Band\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Initialize Your Submission Project\n", "\n", "To initialize a submission object in Python, use `microlens_submit.load(project_path)`.\n", "If the directory at `project_path` does not exist, it will be created with the correct structure and a blank submission.\n", "You can then set the required metadata (like `team_name`, `tier`, etc.) on the returned `Submission` object, and call `.save()` to persist it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define your project directory in your persistent home folder\n", "#TEAM_NAME = \"The Transiting Poachers\" # Nexus team name\n", "TEAM_NAME = \"rges-pit\"\n", "SUBMISSION_NAME = \"test_submission_1\"\n", "project_path = Path(f\"/teams/{TEAM_NAME}/{SUBMISSION_NAME}\")\n", "\n", "# To locate your submission project in the current working directory, use:\n", "# project_path = Path.cwd() / TEAM_NAME\n", "# To locate your submission project in your home directory, use:\n", "# project_path = Path.home() / TEAM_NAME" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The following lines are for automated testing and should be ignored by users. \n", "# They ensure that the project directory exists and has write access before running the workflow.\n", "\n", "# try accessing the project directory; if we can't access it, change the root directory to ./\n", "try:\n", " # attempt to write to the project directory to check if we have access\n", " test_file = project_path / \"test_write_access.txt\"\n", " project_path.mkdir(parents=True, exist_ok=True)\n", " with open(test_file, \"w\") as f:\n", " f.write(\"Testing write access to project directory.\")\n", "except PermissionError:\n", " # if we can't access it, change the root directory to the home directory and create the project directory\n", " print(f\"Cannot access {project_path}. Changing to home directory and creating project directory.\")\n", " project_path = Path.home() / TEAM_NAME / SUBMISSION_NAME\n", " \n", " # if this fails, it will raise an error and stop the workflow, which is what we want for testing purposes.\n", " test_file = project_path / \"test_write_access.txt\"\n", " project_path.mkdir(parents=True, exist_ok=True)\n", " with open(test_file, \"w\") as f:\n", " f.write(\"Testing write access to project directory.\")\n", " \n", "# remove the test file after confirming write access\n", "print(f\"project path: {project_path}\")\n", "if test_file.exists():\n", " test_file.unlink()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now, create/load the project into our session\n", "submission = microlens_submit.load(project_path) # creates the project directory if it doesn't exist\n", "submission.team_name = TEAM_NAME # assuming your Nexus team name is the same as your data-challenge team name\n", "submission.tier = TIER\n", "print(f\"\\nProject for '{submission.team_name}' loaded successfully.\")\n", "\n", "# You can expect saving at this point to result in warnings about missing info\n", "submission.save()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After initializing or loading your microlens-submit project, and initializing your linked git repository, simply set the repo_url attribute on your Submission. If the directory already exists, load() will just load the existing project, not overwrite it. If the directory exists but no submission content, the content will be added when you run `submission.save()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# initialize your git repo\n", "#!git init\n", "#!git add .\n", "#!git commit -m \"Initial commit\"\n", "#!git branch -m main\n", "#!git remote add origin https://github.com/yourusername/your-repo.git\n", "#!git push -u origin main\n", "\n", "# set the repo url in the submission object\n", "submission.repo_url = \"https://github.com/yourusername/your-repo.git\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> You may leave this repository private during the data challenge, but we ask that you make it public once submissions close, to ensure it is accessible to evaluators." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to auto-populate your hardware info and are using the Nexus (like the CLI `nexus-init` command does), you can call the method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add your hardware info\n", "submission.autofill_nexus_info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will attempt to detect and fill in hardware details if running in the Roman Science Platform environment, but you can always set or override any values manually.\n", "\n", "```python\n", "# add your hardware info\n", "submission.hardware_info = {\n", " \"cpu_model\": \"Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz\",\n", " \"num_cores\": 16,\n", " \"memory_gb\": 64,\n", " \"platform\": \"Linux-6.12.63-84.121.amzn2023.x86_64-x86_64-with-glibc2.39\",\n", " \"os\": \"Linux\",\n", " # ...any other relevant info\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A valid submission will include the attributes:\n", "* `team_name` to know who the submission belongs to\n", "* `tier` to validate event IDs\n", "* `repo_url` for reproducibility and evaluation\n", "* `hardware_info` for benchmarking purposes\n", "\n", "You can continue to edit your project without all of this information to make it \"valid\", but you will not be able to submit. You can check the submission validity of your entire project at any time using `submission.run_validation()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> You can make solution specific hardware info overrides should you use different machines or servers while working on the data challenge.\n", ">\n", "> e.g.\n", "> ```python\n", "> solution.autofill_hardware_info()\n", "> #or\n", "> solution.hardware_info = {\n", "> \"cpu_details\": \"Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz\",\n", "> \"platform\": \"Linux-6.12.63-84.121.amzn2023.x86_64-x86_64-with-glibc2.39\"\n", "> \"memory_gb\": 15.34,\n", "> \"nexus_image\": \"...amazonaws.com/roman:rges-nexus\",\n", "> \"os\": \"linux\",\n", "> }\n", "> submission.save()\n", "> ```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Alternative: Command Line Interface (CLI)\n", "\n", "If you prefer using the command line or want to automate parts of your workflow, you can accomplish the same tasks using the `microlens-submit` CLI. The CLI is particularly useful for batch processing and automation.\n", "\n", "### Nexus-Specific CLI Commands\n", "\n", "The following commands are specifically designed for the Roman Research Nexus environment:\n", "\n", "#### 1. Initialize Project with Nexus Hardware Info\n", "\n", "The `nexus-init` command automatically detects and records your Nexus environment details:\n", "\n", "```bash\n", "# Initialize project with automatic Nexus hardware detection\n", "microlens-submit nexus-init --team-name \"Your Team\" --tier \"2018-test\" ./my_dc2_submission\n", "\n", "# This command will automatically detect:\n", "# - CPU model from /proc/cpuinfo\n", "# - Memory from /proc/meminfo\n", "# - Nexus image from JUPYTER_IMAGE_SPEC\n", "# - Platform information\n", "```\n", "\n", "#### 2. Set Hardware Information Manually\n", "\n", "You can also set hardware information manually using the `set-hardware-info` command:\n", "\n", "```bash\n", "# Set CPU information\n", "microlens-submit set-hardware-info --cpu \"Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz\" \\\n", " --memory-gb 64 --platform \"Roman Research Nexus\" \\\n", " --nexus-image \"roman-dc2:latest\"\n", "\n", "# Or set individual components\n", "microlens-submit set-hardware-info --cpu-details \"16-core Intel Xeon\" \\\n", " --memory-gb 64 --platform \"Roman Science Platform\"\n", "```\n", "\n", "### Complete CLI Workflow Example\n", "\n", "Here's how to accomplish the same tasks using the command-line interface:\n", "\n", "```bash\n", "# 1. Initialize project with Nexus hardware info\n", "microlens-submit nexus-init --team-name \"The Transiting Poachers\" --tier \"2018-test\" ./my_dc2_submission\n", "\n", "# 2. Change into the project directory\n", "# (or pass ./my_dc2_submission as the final argument to each later command)\n", "cd ./my_dc2_submission\n", "\n", "# 3. Add a solution with parameters\n", "microlens-submit add-solution 2018-EVENT-001 1S1L \\\n", " --param t0=2459123.5 --param u0=0.15 --param tE=20.5 \\\n", " --log-likelihood -1234.56 --n-data-points 1250 \\\n", " --cpu-hours 15.2 --wall-time-hours 3.8 \\\n", " --lightcurve-plot-path plots/2018-EVENT-001_lc.png \\\n", " --lens-plane-plot-path plots/2018-EVENT-001_lens_geometry.png \\\n", " --notes \"Initial PSPL fit using MulensModel and emcee\"\n", "\n", "# 4. Add a more complex solution\n", "microlens-submit add-solution 2018-EVENT-001 1S2L \\\n", " --param t0=2459123.5 --param u0=0.12 --param tE=22.1 \\\n", " --param q=0.001 --param s=1.15 --param alpha=45.2 \\\n", " --log-likelihood -1189.34 --n-data-points 1250 \\\n", " --cpu-hours 28.5 --wall-time-hours 7.2 \\\n", " --alias \"planetary_fit\" \\\n", " --notes \"Binary lens fit with planetary companion\"\n", "\n", "# 5. Compare solutions\n", "microlens-submit compare-solutions 2018-EVENT-001\n", "\n", "# 6. Validate your submission\n", "microlens-submit validate-submission\n", "\n", "# 7. Generate dossier for review\n", "microlens-submit generate-dossier\n", "\n", "# 8. Export final submission\n", "microlens-submit export final_submission.zip\n", "```\n", "\n", "### CLI vs Python API\n", "\n", "**When to use CLI:**\n", "- Batch processing multiple events\n", "- Automation and scripting\n", "- Quick parameter updates\n", "- Command-line workflows\n", "- Integration with other tools\n", "\n", "**When to use Python API:**\n", "- Interactive analysis in Jupyter notebooks\n", "- Complex data processing\n", "- Custom validation logic\n", "- Integration with scientific Python ecosystem\n", "\n", "### Additional CLI Resources\n", "\n", "For a comprehensive CLI tutorial, see the [Command Line Tutorial](x_cli_tutorial.ipynb) here or on the [documentation site](https://microlens-submit.readthedocs.io/en/latest/tutorial.html).\n", "\n", "For detailed API reference, see the [API Documentation](https://microlens-submit.readthedocs.io/en/latest/api.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save the project\n", "submission.save()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a project directory of the following structure:\n", "\n", "```\n", " /\n", " ├── submission.json\n", " ├── aliases.json # optional alias lookup table for humans\n", " └── events/\n", "```\n", "\n", "As you add events, solutions, notes, and figures they will be populated inside this project as follows:\n", "\n", "```\n", " /\n", " ├── dossier/\n", " ├ ├── index.html\n", " ├ ├── .html\n", " ├ ├── .html \n", " ├ ├── full_dossier_report.html \n", " ├ └── assets/\n", " ├── submission.json\n", " ├── aliases.json # maps -> \n", " └── events/\n", " └── /\n", " ├── event.json\n", " └── solutions/\n", " ├── # your generated lightcurve plots (optional)\n", " ├── # your generated lens-plane diagram (optional)\n", " ├── .csv # your generated posteriors (optional)\n", " ├── .md # your notes (optional)\n", " └── .json\n", "```\n", "\n", "> `aliases.json` lives alongside `submission.json` and maps ` ` to the underlying solution UUID. Aliases are optional, but much easier to work with than raw UUIDs when you are navigating the project by hand.\n", ">\n", "> A note for later: We recommend you save generated lightcurve plots, lens-plane diagrams, and notes according to this format, but the submission tool is flexible to all save locations, so long as you include the relative path to your in the .json (accessed through the microlens_submit tool or manually). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Microlensing Model Fitting\n", "\n", "This is where the science happens. Define your `MulensModel` and use your preferred fitting algorithm (e.g., MCMC) to find the best-fit parameters for the event data. \n", "\n", "### 5.1. Define the Model\n", "As an example, we are going to do a preliminary fit using only one band and a point-source-point-lens model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 5.1. Define the model (e.g., Point Source Point Lens)\n", "# Use only the first band for this preliminary fit\n", "first_band = bands[0]\n", "band_mask = lc_data[keys[\"band\"]] == first_band\n", "band_data = lc_data[band_mask]\n", "\n", "# Create MulensModel data object for the first band\n", "my_data = MulensModel.MulensData(\n", " data_list=[band_data[keys[\"time\"]], \n", " band_data['mag'], \n", " band_data['mag_err']],\n", " phot_fmt='mag',\n", " bandpass='H' # randomly chosen - not indicative of the true band color\n", ")\n", "\n", "# Initial model parameters (no parallax for preliminary fit)\n", "initial_params = {\n", " 't_0': band_data[keys[\"time\"]][np.argmin(band_data['mag'])],\n", " 'u_0': 0.1,\n", " 't_E': 25.,\n", "}\n", "\n", "pspl_model = MulensModel.Model(initial_params)\n", "\n", "# Create the event object\n", "event_object = MulensModel.Event(datasets=my_data, model=pspl_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2. Setting up the Fitting Procedure\n", "\n", "We will use `emcee` to drive this fit, so we can demonstrate more attributes of the submission object, like the `posterior_path` and `cpu_time` vs `wall_time`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 5.2. Define the likelihood function for emcee\n", "def log_likelihood(theta, event_object, parameters_to_fit):\n", " \"\"\"Log likelihood function for emcee\"\"\"\n", " try:\n", " # Update model parameters\n", " for i, param_name in enumerate(parameters_to_fit):\n", " if param_name == 'log_rho':\n", " # Convert log_rho back to rho\n", " event_object.model.parameters.rho = 10**theta[i]\n", " else:\n", " setattr(event_object.model.parameters, param_name, theta[i])\n", " \n", " # Fit the fluxes given the current model parameters\n", " event_object.fit_fluxes()\n", " \n", " # Get the source and blend fluxes\n", " ([F_S], F_B) = event_object.get_flux_for_dataset(event_object.datasets[0])\n", " \n", " # Calculate chi-squared\n", " chi2 = event_object.get_chi2()\n", " \n", " # Add flux priors if needed (optional)\n", " penalty = 0.0\n", " if F_B <= 0:\n", " penalty = ((F_B / 100)**2) # Penalize negative blend flux\n", " if F_S <= 0 or (F_S + F_B) <= 0:\n", " return -np.inf # Return inf if fluxes are non-physical\n", " \n", " return -0.5 * chi2 - penalty\n", " except:\n", " return -np.inf\n", "\n", "# Set up emcee\n", "nwalkers = 32\n", "ndim = 3 # t_0, u_0, t_E\n", "nsteps = 400\n", "burnin = 200" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3. Performing the Fit\n", "\n", "We will use multithreading with `ThreadPool` from the `multiprocessing` library for this fit, because it is more compatible with notebooks. However, it caches the state of your notebook and can cause unexpected behaviors when you edit code related to your fit and re-run. It is a better idea to use a script and multiprocessing using the `Pool` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 5.3. Run MCMC fitting\n", "print(\"Starting MCMC fit...\")\n", "start_time = time.time()\n", "start_cpu = time.process_time()\n", "\n", "# Set up parameters to fit\n", "parameters_to_fit = [\"t_0\", \"u_0\", \"t_E\"]\n", "ndim = len(parameters_to_fit)\n", "\n", "# Initial positions (slightly perturbed from initial guess)\n", "pos = np.array([\n", " initial_params['t_0'], initial_params['u_0'], initial_params['t_E']\n", "]) + 1e-4 * np.random.randn(nwalkers, ndim)\n", "\n", "# Use ThreadPool instead of Pool for notebooks\n", "n_cores = mp.cpu_count()\n", "print(f\"Using {n_cores} threads for MCMC\")\n", "\n", "# Create the thread pool\n", "with ThreadPool(processes=n_cores) as pool:\n", " sampler = emcee.EnsembleSampler(\n", " nwalkers, ndim, log_likelihood, \n", " args=(event_object, parameters_to_fit),\n", " pool=pool\n", " )\n", " sampler.run_mcmc(pos, nsteps, progress=True)\n", "\n", "# Calculate timing\n", "wall_time_hours = (time.time() - start_time) / 3600\n", "cpu_time_hours = (time.process_time() - start_cpu) / 3600\n", "\n", "# Get \"best fit\" parameters (median of posterior)\n", "samples = sampler.chain[:, burnin:, :].reshape((-1, ndim))\n", "best_fit_params = {\n", " \"t0\": np.median(samples[:, 0]),\n", " \"u0\": np.median(samples[:, 1]),\n", " \"tE\": np.median(samples[:, 2]),\n", "}\n", "\n", "# Update the model with best fit parameters and calculate fluxes\n", "for i, param_name in enumerate(parameters_to_fit):\n", " setattr(event_object.model.parameters, param_name, best_fit_params[f\"t{param_name[2:]}\" if param_name.startswith('t_') else param_name.replace('_', '')])\n", "\n", "# Fit fluxes using MulensModel\n", "event_object.fit_fluxes()\n", "([F_S], F_B) = event_object.get_flux_for_dataset(event_object.datasets[0])\n", "\n", "best_fit_params[f\"F{first_band}_S\"] = F_S\n", "best_fit_params[f\"F{first_band}_B\"] = F_B\n", "\n", "# Calculate log likelihood at best fit\n", "best_log_likelihood = log_likelihood([\n", " best_fit_params[\"t0\"], \n", " best_fit_params[\"u0\"], \n", " best_fit_params[\"tE\"]\n", "], event_object, parameters_to_fit)\n", "\n", "n_data_points = len(my_data.time)\n", "\n", "print(f\"Best-fit parameters obtained: {best_fit_params}\")\n", "print(f\"Wall time: {wall_time_hours:.3f} hours\")\n", "print(f\"CPU time: {cpu_time_hours:.3f} hours\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While that's running, consider consulting the [`microlens-submit` documentation](https://microlens-submit.readthedocs.io/en/latest/submission_manual.html) for details on the submission structure. **You can add solutions as you progress with your analysis or collect like-models in csv files and use the bulk-csv-import function.** The documentation includes more detailed instructions for building your own submission [manually](https://microlens-submit.readthedocs.io/en/latest/submission_manual.html#manual-submission-format), using the [python API](https://microlens-submit.readthedocs.io/en/latest/api.html), using the [CLI tool](https://microlens-submit.readthedocs.io/en/latest/tutorial.html), and using the [csv bulk import procedure](https://microlens-submit.readthedocs.io/en/latest/submission_manual.html#csv-import-format). \n", "\n", "The sumission tool will check that you meet all sumission requirements and that your inputs for a model, solution, event, or tier are as expected. It also creates html dossiers that allow you to inspect your progress.\n", "\n", "You can choose for yourself how integrated you want the tool to be in your workflow, but **submissions intended for evalutation should pass validation**. This ensures that we can partially automate the evaluation process, quantitatively assesing your submission accuracy compared with select simulation truths. \n", "\n", "If you run in to issues with the tool, contact the developer through the [RMDC26 Slack Workspace](rmdc26.slack.com), [GitHub Issues](https://github.com/rges-pit/microlens-submit/issues), or on the [RGES-PIT website help form](https://rges-pit.org/data-challenge/help/). \n", "\n", "> This tool is under active development and feedback or feature requests are welcome. Updates such as dossier plots can be expected during the challenge so we recommend you periodically update the package; `conda update -c conda-forge microlens-submit` or `pip install -U microlens_submit`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.4. Results\n", "Let's save and plot all the fit results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 5.4. Create the initial solution and event objects\n", "event = submission.get_event(EVENT_ID)\n", "solution = event.add_solution(\n", " model_type=\"1S1L\",\n", " parameters=best_fit_params,\n", " alias=f\"Preliminary PSPL - Band {first_band}\",\n", " bands=[f\"{first_band}\"]\n", ")\n", "\n", "# We are doing this now because we want generated solution id for our file paths" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save posteriors\n", "posterior_dir = Path(project_path) / \"events\" / EVENT_ID / \"solutions\" / solution.solution_id\n", "posterior_dir.mkdir(parents=True, exist_ok=True)\n", "posterior_path = posterior_dir / \"posteriors.csv\"\n", "\n", "# Save posterior samples with column headers (no parallax parameters)\n", "posterior_data = np.column_stack([\n", " samples[:, 0], samples[:, 1], samples[:, 2]\n", "])\n", "np.savetxt(posterior_path, posterior_data, \n", " delimiter=',', \n", " header='t0,u0,tE',\n", " comments='')\n", "\n", "# Save lightcurve plot\n", "plt.figure(figsize=(10, 6))\n", "plt.errorbar(my_data.time, my_data.mag, yerr=my_data.err_mag, fmt='.', \n", " color='k', alpha=0.5, label='Data')\n", "\n", "# Plot best fit model using the event object\n", "event_object.plot_model(color='r', label='Best Fit')\n", "plt.legend()\n", "plt.gca().invert_yaxis()\n", "plt.title(f\"{EVENT_ID} Best Fit - Band {first_band}\")\n", "plt.xlabel(\"BJD\")\n", "plt.ylabel(\"Magnitude\")\n", "\n", "lightcurve_path = posterior_dir / \"lightcurve.png\"\n", "plt.savefig(lightcurve_path, dpi=150, bbox_inches='tight')\n", "plt.show()\n", "\n", "# Save caustic diagram (for PSPL, this will be empty since no caustics)\n", "plt.figure(figsize=(8, 8))\n", "event_object.model.plot_caustics()\n", "plt.title(f\"{EVENT_ID} Caustic Diagram - Band {first_band}\")\n", "caustic_path = posterior_dir / \"caustic.png\"\n", "plt.savefig(caustic_path, dpi=150, bbox_inches='tight')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.5 Solution Management\n", "\n", "Now, we package these results into the submission object.\n", "\n", "A solution entry can store core fit information, review metadata, and optional artifacts. Common fields include:\n", "* `model_type`\n", "* `parameters` matching the indicated `model_type`, `higher_order_effects`, and any declared `bands`\n", "* `higher_order_effects` (if any are present in the model)\n", "* `bands` and `t_ref` when required by the fit configuration\n", "* `log_likelihood` and `n_data_points` for solution comparison and BIC-based ranking\n", "* `compute_info` containing CPU / wall time, dependencies, and git provenance\n", "* `alias`, `notes_path`, `posterior_path`, `lightcurve_plot_path`, and `lens_plane_plot_path`\n", "* `parameter_uncertainties`, `physical_parameters`, and `physical_parameter_uncertainties`\n", "* `uncertainty_method` and `confidence_level`\n", "* `used_astrometry`, `used_postage_stamps`, `limb_darkening_model`, and `limb_darkening_coeffs`\n", "* `hardware_info` if this solution was produced on different hardware from the submission default\n", "\n", "For baseline validity, the tool mainly checks that the solution is self-consistent and not duplicated:\n", "* `model_type` is recognized\n", "* `parameters` match the declared `model_type`\n", "* any declared `higher_order_effects`, `bands`, and `t_ref` requirements are satisfied\n", "* parameter and uncertainty metadata are structurally consistent\n", "* solution alias is unique (if one is provided) for a given event.\n", "\n", "`log_likelihood`, `n_data_points`, and `compute_info` are strongly recommended, but they are not required just to save a valid solution. They become especially useful for dossier context, solution comparison, and automatic relative-probability calculation during export.\n", "\n", "You can set `compute_info` values using the method:\n", "`solution.set_compute_info(cpu_hours=..., wall_time_hours=...)`\n", "This method also captures:\n", "* The current Python environment (pip freeze output)\n", "* Git repository info (commit, branch, dirty status)\n", "* You can call `set_compute_info()` with either or both values, or leave them blank if you want only the environment info.\n", "\n", "We calculated these compute times in the previous section using `time.time()` and `time.process_time()`. `time.time()` records the literal time, whereas `time.process_time()` returns the total CPU time (in seconds) that the current process has used since it started. So calculating the difference between two `time.time()` calls will tell you the time elapsed (in seconds) between each call, while the difference between two `time.process_time()` calls will tell you the CPU time elapsed, not including time spent sleeping or waiting for I/O. For example, for a process with 4 cores, running for 1 hour, this would mean a CPU time of 4 hours." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Set compute info\n", "# automatically detects the current python environment and git repository state\n", "solution.set_compute_info(cpu_hours=cpu_time_hours, wall_time_hours=wall_time_hours)\n", "\n", "# Set other metadata\n", "solution.log_likelihood = best_log_likelihood\n", "solution.n_data_points = n_data_points\n", "solution.bands = [str(first_band)]\n", "\n", "# Set file references\n", "solution.posterior_path = str(posterior_path.relative_to(project_path))\n", "solution.lightcurve_plot_path = str(lightcurve_path.relative_to(project_path))\n", "solution.lens_plane_plot_path = str(caustic_path.relative_to(project_path))\n", "\n", "# Set notes\n", "solution.set_notes(f\"\"\"\n", "# Preliminary Fit Analysis\n", "\n", "This is a preliminary PSPL fit to determine approximate parameters.\n", "\n", "## Fit Details\n", "- **Model Type**: 1S1L\n", "- **Band Used**: {first_band}\n", "- **Data Points**: {n_data_points}\n", "- **MCMC Walkers**: {nwalkers}\n", "- **MCMC Steps**: {nsteps}\n", "- **Burn-in**: {burnin} steps\n", "\n", "This solution remains active by default; you can deactivate it later if you decide not to include it in your final export.\n", "\"\"\")\n", "\n", "# Save everything\n", "submission.save()\n", "\n", "print(f\"✅ Solution '{solution.alias}' ({solution.solution_id}) created and saved!\")\n", "print(f\" - Posteriors saved to: {posterior_path}\")\n", "print(f\" - Lightcurve plot saved to: {lightcurve_path}\")\n", "print(f\" - Caustic diagram saved to: {caustic_path}\")\n", "print(f\" - Solution remains active unless you deactivate it later\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.6. Estimating Physical Parameters with pyLIMASS\n", "\n", "Once we have the microlensing model parameters (like $t_E$, $\\pi_E$, $\\theta_E$) and typical observables, we can use tools like `pyLIMASS` to estimate the physical properties of the lens (Mass $M_L$, Distance $D_L$).\n", "\n", "In this example, our preliminary fit yielded $t_E$. To estimate physical parameters, we typically need additional constraints such as microlens parallax ($\\pi_E$) or angular source size ($\\theta_E$). For demonstration purposes, let's assume we have constraints on $\\theta_E$ from a color-magnitude diagram analysis and $\\pi_E$ from a parallax fit (or use priors).\n", "\n", "We will use `pyLIMA.pyLIMASS` to generate posterior distributions for the lens mass and distance, and then save these results to our `microlens-submit` solution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -----------------------------------------------------------------------------\n", "# Estimate Physical Parameters\n", "# -----------------------------------------------------------------------------\n", "\n", "# In a real analysis, you would use measured values for piE, thetaE, etc.\n", "# Here we will simulate them for demonstration.\n", "\n", "# 1. Gather/Simulate Inputs\n", "tE_val = best_fit_params['tE']\n", "tE_err = tE_val * 0.1 # Assume 10% error\n", "\n", "# Simulated constraints (e.g. from parallax fit and CMD)\n", "thetaE_val = 0.5 # mas\n", "thetaE_err = 0.05\n", "piE_val = 0.15 # magnitude\n", "piE_err = 0.02\n", "Js = 16.5; eJs = 0.02\n", "Hs = 16.0; eHs = 0.02\n", "Ks = 15.8; eKs = 0.02\n", "\n", "# 2. Run pyLIMASS (or Mock)\n", "# Prepare observables dictionary for pyLIMASS\n", "# Note: This setup mimics pyLIMA_example_6.py usage\n", "\n", "# Generate samples for the observables\n", "n_samples = 1000\n", "pie_samples = np.random.normal(piE_val, piE_err, n_samples)\n", "thetae_samples = np.random.normal(thetaE_val, thetaE_err, n_samples)\n", "\n", "# Mock magnitudes\n", "mags_source = {\n", " 'Jmag': np.random.normal(Js, eJs, n_samples),\n", " 'Hmag': np.random.normal(Hs, eHs, n_samples),\n", " 'Kmag': np.random.normal(Ks, eKs, n_samples)\n", "}\n", "\n", "obs = {\n", " 'log10(pi_E)': np.log10(np.abs(pie_samples)),\n", " 'log10(theta_E)': np.log10(np.abs(thetae_samples)),\n", " 'log10(t_E)': np.log10(np.random.normal(tE_val, tE_err, n_samples)),\n", " 'mags_source': mags_source,\n", " # Add other necessary fields with dummy values if needed by constructor\n", " 'mags_baseline': {'Hmag': np.random.normal(16.0, 0.05, n_samples)}\n", "}\n", " \n", "\n", "SLP = SourceLensProbabilities(observables=obs, stellar_lens=True)\n", "\n", "\n", "# Derived values (Median of distributions)\n", "# For this demo, we assume the physical outputs from the tool:\n", "Mtot_est = 0.52 # Solar masses\n", "Mtot_err = 0.08\n", "DL_est = 4.2 # kpc\n", "DL_err = 0.5\n", "\n", "print(f\"\\nDerived Physical Parameters:\")\n", "print(f\" Mtot: {Mtot_est} +/- {Mtot_err} M_sun\")\n", "print(f\" D_L: {DL_est} +/- {DL_err} kpc\")\n", "\n", "# 3. Add to Solution\n", "# This is how you allow microlens-submit to track your physical results!\n", "\n", "solution.physical_parameters = {\n", " \"Mtot\": Mtot_est,\n", " \"D_L\": DL_est,\n", " \"thetaE\": thetaE_val,\n", " \"piE\": piE_val,\n", " # You can add vector components if you have them:\n", " # \"piE_N\": ..., \"piE_E\": ...\n", "}\n", "\n", "# Add uncertainties (optional but recommended)\n", "solution.physical_parameter_uncertainties = {\n", " \"Mtot\": [Mtot_est - Mtot_err, Mtot_est + Mtot_err], # Asymmetric range [low, high]\n", " \"D_L\": DL_err, # Symmetric uncertainty\n", " \"thetaE\": thetaE_err,\n", " \"piE\": piE_err\n", "}\n", "\n", "# 4. Save Updates\n", "submission.save()\n", "print(f\"✅ Physical parameters saved to solution '{solution.alias}'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.7 Solution Comparison and Relative Probability\n", "\n", "When you have multiple solutions for the same event, you can specify their relative probabilities to indicate which model you think is most likely to be correct.\n", "\n", "**Key Points:**\n", "* **Relative probabilities are optional.**\n", "* **If you set them manually for multiple active solutions, they should sum to 1.0.**\n", "* **If some or all active solutions leave them unset, export calculates the missing probability mass using the Bayesian Information Criterion (BIC) when it has enough information.**\n", "* **If BIC calculation is not possible for the unset solutions, the remaining probability mass is split equally among those unset solutions.**\n", "\n", "**Example of manual relative probability assignment:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Example: If you have multiple solutions for the same event\n", "# solution1.relative_probability = 0.7 # 70% confidence\n", "# solution2.relative_probability = 0.3 # 30% confidence\n", "# Total must equal 1.0\n", "\n", "# For this example, we'll just note that relative probability is optional\n", "# and will be calculated automatically if not provided\n", "print(\"Relative probability will be calculated automatically during export if not specified.\")\n", "print(\"The calculation uses BIC: BIC = k*ln(n) - 2*log_likelihood\")\n", "print(\"where k = number of parameters, n = number of data points\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How automatic calculation works during export:**\n", "1. **If all active solutions have relative probabilities** and they sum to 1.0, those values are used as-is\n", "2. **If some active solutions have relative probabilities** but others don't, the remaining probability mass is distributed across the unset solutions using BIC when possible\n", "3. **If no active solutions have relative probabilities**, export applies the same BIC-based calculation across all active solutions when possible\n", "4. **If BIC calculation is not possible for the unset solutions** (missing log_likelihood, n_data_points, or countable model parameters), the remaining probability mass is split equally among those unset solutions\n", "\n", "This ensures that evaluators always have a complete probability distribution for your active solutions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Dossier and Solution Management\n", "\n", "This is the final step. We will generate a human-readable HTML report (a 'dossier'), validate the submission for completeness, and export the final `.zip` file to the official submission location." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 1. Create the HTML Dossier for your own records\n", "from microlens_submit.dossier import generate_dashboard_html\n", "\n", "dossier_path = project_path / \"dossier\"\n", "generate_dashboard_html(submission, dossier_path, open=False)\n", "print(f\"HTML Dossier report generated at: {dossier_path}/index.html\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.1. Display the Dashboard" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "HTML(submission.notebook_display_dashboard())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2. Display the Event Page" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show the generated HTML file\n", "HTML(submission.notebook_display_event(event.event_id))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.3. Display the Solution Page" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show the dashboard in the notebook\n", "HTML(submission.notebook_display_solution(solution.solution_id))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.4. Display Full Report\n", "\n", "This is the report that will be generated for the evaluators, with placeholders for sections that are not meant for participants. You can generate this report to view the full state of your project as you go, in the format in which it will be evaluated. This can get large for a mature submission, so the example is shown as a fenced block rather than a runnable cell.\n", "\n", "```python\n", "# Show the full dossier in a notebook\n", "HTML(submission.notebook_display_full_dossier())\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optional: Deactivating or Removing a Fit\n", "\n", "If you want to keep a fit locally but exclude it from the dossier and export, you can deactivate it. Do not deactivate your only active solution unless you plan to activate another one before exporting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "# Optional: exclude this fit from dossier/export while keeping it locally\n", "solution.deactivate()\n", "submission.save()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also hard delete a solution by calling `event.remove_solution(solution_id, force=True)` or navigating to the solution in the project directory and delete it manually. If you change your mind later, reactivate a fit with `solution.activate()`.\n", "\n", "If you make one of these changes, save the submission again before validating or exporting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save any final changes before validation\n", "submission.save()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Validation and Submitting the Project\n", "\n", "This is the final step. We will validate the submission for completeness, and export the final `.zip` file to the official submission location." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# 1. Validate the submission\n", "print(\"\\nRunning validation...\")\n", "warnings = submission.run_validation()\n", "if warnings:\n", " print(\"⚠️ Validation Warnings Found:\")\n", " for w in warnings:\n", " print(f\" - {w}\")\n", "else:\n", " print(\"✅ Submission is valid and ready for export!\")\n", "\n", "# 2. Export the final submission file\n", "\n", "try:\n", " export_path = project_path / \"final_submission\"\n", " print(f\"\\nExporting submission to: {export_path}\")\n", " submission.export(str(export_path))\n", " print(f\"\\n🎉 Successfully exported to {export_path}\")\n", " print(\"You would now upload this file to the secure submission location.\")\n", "except ValueError as e:\n", " print(f\"❌ Export failed: {e}\")\n", "\n", "# That bit's a trick. It doesn't actually go anywhere.\n", "# We will collect the most recent submission zip in your team storage on the RRN\n", "# for you. No need to do anything. 🥳" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources\n", "- [Roman Research Nexus: Data Discovery & Access](https://spacetelescope.github.io/roman-notebooks/notebooks/data_discovery_and_access/data_discovery_and_access.html)\n", "- [Session A — Nexus Setup & Help (AAS Workshop site)](https://rges-pit.org/data-challenge/aas-workshop/1-nexus/)\n", "- [Roman Microlensing Data Challenge landing page](https://rges-pit.org/data-challenge/)\n", "- [microlens-submit documentation](https://microlens-submit.readthedocs.io/en/latest/)\n", "- [Next notebook: Session B — Single Lens & Pipelines](https://github.com/rges-pit/data-challenge-notebooks/blob/main/AAS%20Workshop/Session%20B%3A%20Single%20Lens%20%26%20Pipelines/Single_Lens_Pipeline.ipynb)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix: Working with Parquet Files\n", "\n", "The Roman Microlensing Data Challenge may provide data in **Parquet** format, a columnar storage format that is highly efficient for large datasets. This appendix covers the basics of reading, writing, and working with Parquet files in Python.\n", "\n", "### Why Parquet?\n", "\n", "- **Efficient storage**: Parquet uses columnar compression, significantly reducing file sizes compared to CSV.\n", "- **Fast read performance**: Only the columns you need are loaded into memory.\n", "- **Schema preservation**: Data types are preserved, avoiding issues with type inference.\n", "- **Compatibility**: Works seamlessly with pandas, PyArrow, and cloud storage (S3)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Installing Required Packages\n", "\n", "If you don't already have `pyarrow` or `fastparquet` installed, you can install them with:\n", "\n", "```python\n", "# Install pyarrow (recommended) or fastparquet\n", "%pip install pyarrow\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading Parquet Files\n", "\n", "#### Basic Reading with pandas\n", "\n", "```python\n", "import pandas as pd\n", "\n", "# Read a local Parquet file\n", "df = pd.read_parquet(\"path/to/file.parquet\")\n", "\n", "# Read only specific columns (more memory efficient)\n", "df = pd.read_parquet(\"path/to/file.parquet\", columns=[\"HJD\", \"mag\", \"mag_err\", \"band\"])\n", "\n", "# Example: Display the first few rows\n", "df.head()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Reading Parquet from S3 (Cloud Storage)\n", "\n", "For data stored in S3 buckets (like the Data Challenge data), you can read Parquet files directly:\n", "\n", "```python\n", "import s3fs\n", "import pandas as pd\n", "\n", "# Connect to S3 (anonymous access for public buckets)\n", "fs = s3fs.S3FileSystem(anon=True)\n", "\n", "# Read a Parquet file directly from S3\n", "PARQUET_URI = \"s3://rmdc26-data-public/example/lightcurve.parquet\"\n", "with fs.open(PARQUET_URI, 'rb') as f:\n", " df = pd.read_parquet(f)\n", "\n", "# Alternative: Use pandas with storage_options\n", "df = pd.read_parquet(\n", " \"s3://rmdc26-data-public/example/lightcurve.parquet\",\n", " storage_options={\"anon\": True}\n", ")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing Parquet Files\n", "\n", "Saving your processed data or results as Parquet is straightforward:\n", "\n", "```python\n", "# Create a sample DataFrame\n", "sample_data = pd.DataFrame({\n", " \"event_id\": [\"event_001\", \"event_002\", \"event_003\"],\n", " \"t0\": [2459000.5, 2459100.2, 2459200.8],\n", " \"tE\": [25.3, 42.1, 18.7],\n", " \"u0\": [0.1, 0.05, 0.2],\n", " \"classification\": [\"PSPL\", \"Binary\", \"PSPL\"]\n", "})\n", "\n", "# Save to Parquet with default compression (snappy)\n", "sample_data.to_parquet(\"results.parquet\", index=False)\n", "\n", "# Save with gzip compression (smaller file, slower read/write)\n", "sample_data.to_parquet(\"results.parquet\", index=False, compression=\"gzip\")\n", "\n", "print(\"Sample DataFrame:\")\n", "sample_data\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspecting Parquet File Metadata\n", "\n", "You can inspect a Parquet file's schema and metadata without loading the entire file:\n", "\n", "```python\n", "import pyarrow.parquet as pq\n", "\n", "# Read metadata without loading data\n", "parquet_file = pq.ParquetFile(\"path/to/file.parquet\")\n", "\n", "# Get schema (column names and types)\n", "print(\"Schema:\")\n", "print(parquet_file.schema_arrow)\n", "\n", "# Get number of rows without reading the data\n", "print(f\"\\nNumber of rows: {parquet_file.metadata.num_rows}\")\n", "\n", "# Get number of row groups\n", "print(f\"Number of row groups: {parquet_file.metadata.num_row_groups}\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with Large Parquet Datasets\n", "\n", "For datasets too large to fit in memory, you can process them in chunks:\n", "\n", "```python\n", "# Read Parquet file in batches using PyArrow\n", "parquet_file = pq.ParquetFile(\"large_dataset.parquet\")\n", "\n", "# Process one row group at a time\n", "for batch in parquet_file.iter_batches(batch_size=10000):\n", " df_chunk = batch.to_pandas()\n", " # Process each chunk\n", " print(f\"Processing {len(df_chunk)} rows...\")\n", "\n", "# Alternative: Read row groups individually\n", "for i in range(parquet_file.metadata.num_row_groups):\n", " table = parquet_file.read_row_group(i)\n", " df_chunk = table.to_pandas()\n", " # Process each row group\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting Between Formats\n", "\n", "#### CSV to Parquet\n", "\n", "```python\n", "# Convert CSV to Parquet\n", "df = pd.read_csv(\"lightcurve.csv\")\n", "df.to_parquet(\"lightcurve.parquet\", index=False)\n", "\n", "# Convert Parquet to CSV\n", "df = pd.read_parquet(\"lightcurve.parquet\")\n", "df.to_csv(\"lightcurve.csv\", index=False)\n", "\n", "print(\"Conversion examples shown above (uncomment to run)\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tips for Working with Parquet in the Data Challenge\n", "\n", "1. **Use column selection**: When reading large files, only load the columns you need:\n", " ```python\n", " df = pd.read_parquet(\"file.parquet\", columns=[\"bjd\", \"mag\", \"mag_err\", \"filt\"])\n", " ```\n", "\n", "2. **Leverage filtering**: PyArrow supports predicate pushdown for efficient filtering:\n", " ```python\n", " df = pd.read_parquet(\"file.parquet\", filters=[(\"filt\", \"==\", \"F146\")])\n", " ```\n", "\n", "3. **Choose appropriate compression**:\n", " - `snappy` (default): Fast compression/decompression, moderate size\n", " - `gzip`: Smaller files, slower I/O\n", " - `zstd`: Good balance of speed and compression\n", "\n", "4. **Preserve data types**: Parquet maintains column types, avoiding CSV parsing issues with floats/integers.\n", "\n", "5. **Use partitioned datasets**: For very large datasets, partition by a key column (e.g., `event_id`) for faster access to specific events." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## About this Notebook\n", "**Author(s):** Amber Malpas, Meet Vyas
\n", "**Keyword(s):** Roman, Microlensing, Data Challenge, Workflow
\n", "**Last Updated:** December 2025\n", "\n", "## Citations\n", "\n", "If you use `MulensModel`, `pyLIMASS`, `microlens-submit`, `emcee`, or this notebook for published research, please cite the\n", "authors. Follow these links for more information about citing:\n", "\n", "* [Citing `MulensModel`](https://github.com/rpoleski/MulensModel/blob/master/CITATION.cff)\n", "* [Citing `pyLIMASS`](https://github.com/ebachelet/pyLIMA?tab=readme-ov-file#citations)\n", "* [Citing `microlens-submit`](https://github.com/rges-pit/microlens-submit/blob/main/CITATION.cff)\n", "* [Citing `emcee`](https://github.com/dfm/emcee?tab=readme-ov-file#attribution)\n", "* [Citing **Roman Microlensing Data Challenge 2026 Notebooks**](https://github.com/rges-pit/data-challenge-notebooks/blob/main/zenodo.txt)\n", "\n", "\n", "### Citing **microlens-submit**\n", "\n", "If you use `microlens-submit` in your research, please cite:\n", "\n", "```\n", "Malpas, A. (2025). microlens-submit. Zenodo. https://doi.org/10.5281/zenodo.17459752\n", "```\n", "\n", "**BibTeX:**\n", "```bibtex\n", "@software{malpas_2025_17468488,\n", " author = {Malpas, Amber},\n", " title = {microlens-submit},\n", " month = oct,\n", " year = 2025,\n", " publisher = {Zenodo},\n", " version = {v0.16.3},\n", " doi = {10.5281/zenodo.17468488},\n", " url = {https://doi.org/10.5281/zenodo.17468488},\n", "}\n", "```\n", "\n", "### Citing Roman Microlensing Data Challenge 2026 Notebooks\n", "\n", "If you use our notebooks in your project, please cite:\n", "\n", "```\n", "Malpas, A., Murlidhar, A., Vandorou, K., Kruszyńska, K., Crisp, A., & Vyas, M. (2026). Roman Microlensing Data Challenge 2026 Notebooks (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.18262183\n", "```\n", "\n", "**BibTeX:**\n", "```bibtex\n", "@software{malpas_2025_17806271,\n", " author = {Malpas, Amber and Murlidhar, Arjun and Vyas, Meet and Vandorou, Katie and Kruszyńska, Katarzyna and Crisp, Ali},\n", " title = {Roman Microlensing Data Challenge 2026 Notebooks},\n", " month = dec,\n", " year = 2026,\n", " publisher = {Zenodo},\n", " version = {v1.0.0},\n", " doi = {10.5281/zenodo.18262183},\n", " url = {https://doi.org/10.5281/zenodo.18262183}\n", "}\n", "```\n", "\n", "\n", "\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "rges-pit-dc", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 4 }