{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "\n", "sys.path.append(\"..\")\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sn\n", "import utils\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "plt.style.use(\"ggplot\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tiltaksovervakingen: opsjon for kvalitetskontroll av analysedata\n", "## Notebook 3: Outlier detection for whole water samples\n", "\n", "Exploring distributions for **single parameters** (as in notebook 2) is a reasonable starting point for quality assurance, but a more general approach is to look for \"outliers\" at the **water sample** level i.e. samples that are of questionable quality because *one or more* parameter values are unusual. If the suite of water quality parameters analysed is consistent (i.e. no data gaps), then each sample can be considered as a point in $n$ dimensional space, where $n$ is the number of parameters measured. Rather than looking for \"outliers\" along a single dimension (as in the distribution plots considered already), we can instead look for \"outliers\" in this higher dimensional space. A variety of algorithms are available to do this, some of which are explored below. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Terminology: \"outlier\" versus \"novelty\" detection\n", "\n", "\"**Outlier**\" and \"**novelty**\" detection are two different kinds of **anomaly** detection i.e. where we are interested in detecting abnormal or unusual observations. The difference between the two is important, but often overlooked:\n", "\n", " * **Outlier detection:** We have a **single dataset** that is believed to contain \"outliers\", which are observations that are \"far\" from the others (for some chosen definition of \"far\"). Outlier detection estimators thus try to fit the regions where the data is most concentrated, ignoring the deviant observations.\n", "\n", " * **Novelty detection:** We have access to a **reference dataset** that is *not* polluted by outliers, and we are interested in detecting whether a new observation (from a second dataset) is an outlier - a \"novelty\" - or not\n", " \n", "In the context of this project, we are primarily interested in **novelty detection**, because we have a reference historic dataset extracted from Vannmiljø and we would like to gauge whether observations in the \"new\" dataset are sufficiently unusual/unlikely to warrant further investigation and reanalysis. However, the [summary in the previous notebook](https://nbviewer.jupyter.org/github/NIVANorge/tiltaksovervakingen/blob/master/notebooks/02_distribution_plots.ipynb#3.-Summary) identified some possible issues in the Vannmiljø data, as well as in the \"new\" data. I am therefore reluctant to use the historic Vannmiljø dataset as a \"reference\" without additional cleaning. Instead, to begin with, at least, I will combine the \"new\" and \"historic\" results into a single dataset and then perform **outlier detection** (not novelty detection). This can be revised later if desired." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Read data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Choose dataset to process\n", "lab = \"Eurofins\"\n", "year = 2022\n", "qtr = 3\n", "version = 1" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "fold_path = f\"../../output/{lab.lower()}_{year}_q{qtr}_v{version}\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vannmiljo_codesample_datelabperioddepth1depth2ALK_mmol/lANC_µekv/lCA_mg/lCL_mg/l...N-NO3_µg/l NN-TOT_µg/l NNA_mg/lP-TOT_µg/l PPH_<ubenevnt>RAL_µg/l AlSIO2_µg/l SiSO4_mg/lTEMP_°CTOC_mg/l C
0002-1059612022-07-06Eurofinsnew0.00.00.0395.02.11.5...5.0260.00.4717.05.567.0701.3149130.40NaN19.0
1002-1059612022-07-19Eurofinsnew0.00.0NaNNaN2.5NaN...NaNNaNNaNNaN5.9NaNNaNNaN15.0NaN
2002-1059612022-08-03Eurofinsnew0.00.00.06130.02.71.4...5.0250.00.3517.06.248.0607.8062580.2715.016.0
3002-1059612022-08-12Eurofinsnew0.00.00.04110.02.31.5...5.0330.00.4817.05.661.0654.5605860.32NaN21.0
4002-1059612022-08-30Eurofinsnew0.00.0NaNNaN2.6NaN...NaNNaNNaNNaN5.7NaNNaNNaN13.0NaN
\n", "

5 rows × 25 columns

\n", "
" ], "text/plain": [ " vannmiljo_code sample_date lab period depth1 depth2 ALK_mmol/l \\\n", "0 002-105961 2022-07-06 Eurofins new 0.0 0.0 0.03 \n", "1 002-105961 2022-07-19 Eurofins new 0.0 0.0 NaN \n", "2 002-105961 2022-08-03 Eurofins new 0.0 0.0 0.06 \n", "3 002-105961 2022-08-12 Eurofins new 0.0 0.0 0.04 \n", "4 002-105961 2022-08-30 Eurofins new 0.0 0.0 NaN \n", "\n", " ANC_µekv/l CA_mg/l CL_mg/l ... N-NO3_µg/l N N-TOT_µg/l N NA_mg/l \\\n", "0 95.0 2.1 1.5 ... 5.0 260.0 0.47 \n", "1 NaN 2.5 NaN ... NaN NaN NaN \n", "2 130.0 2.7 1.4 ... 5.0 250.0 0.35 \n", "3 110.0 2.3 1.5 ... 5.0 330.0 0.48 \n", "4 NaN 2.6 NaN ... NaN NaN NaN \n", "\n", " P-TOT_µg/l P PH_ RAL_µg/l Al SIO2_µg/l Si SO4_mg/l TEMP_°C \\\n", "0 17.0 5.5 67.0 701.314913 0.40 NaN \n", "1 NaN 5.9 NaN NaN NaN 15.0 \n", "2 17.0 6.2 48.0 607.806258 0.27 15.0 \n", "3 17.0 5.6 61.0 654.560586 0.32 NaN \n", "4 NaN 5.7 NaN NaN NaN 13.0 \n", "\n", " TOC_mg/l C \n", "0 19.0 \n", "1 NaN \n", "2 16.0 \n", "3 21.0 \n", "4 NaN \n", "\n", "[5 rows x 25 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read from SQLite\n", "stn_df, df = utils.read_data_from_sqlite(lab, year, qtr, version)\n", "\n", "# # Subset data to just the quarter of interest\n", "# months_dict = {\n", "# \"q1\": [1, 2, 3],\n", "# \"q2\": [4, 5, 6],\n", "# \"q3\": [7, 8, 9],\n", "# \"q4\": [10, 11, 12],\n", "# }\n", "# months = months_dict[qtr]\n", "# df = df[df[\"sample_date\"].dt.month.isin(months)]\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Select parameters\n", "\n", "In order to perform outlier detection in a multi-dimensional space, it is necessary that all water samples have a **complete set of values for all parameters**. This is because outlier detection algorithms work by calculating distance metrics between samples, and this is not possible if a sample can't be located along one or more of the dimension axes due to missing values. It is therefore necessary to choose a set of parameters where the data are complete.\n", "\n", "The code below calculates the percentage of the time that each lab measures each parameter, where the percentage is calculated as:\n", "\n", " 100 * number_of_samples_for_par_X_from_lab_Y / total_number_of_samples_from_lab_Y" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ANC_µekv/lCA_mg/lCL_mg/lILAL_µg/l AlKOND_mS/mK_mg/lLAL_µg/l AlMG_mg/lN-NO3_µg/l NN-TOT_µg/l NNA_mg/lP-TOT_µg/l PPH_<ubenevnt>RAL_µg/l AlSIO2_µg/l SiSO4_mg/lTEMP_°CTOC_mg/l C
lab
Eurofins29.250457100.00000029.25045757.67824557.67824529.25045752.37660029.25045729.25045729.25045729.25045729.25045799.90859257.67824529.25045729.25045720.56672857.678245
Eurofins (historic)27.47778573.95762126.93096456.32262556.66438834.3814080.00000025.56391027.13602225.08544125.15379425.01708893.43814156.5276830.00000027.54613827.61449156.459330
NIVA (historic)15.17643597.58522715.75956936.32625697.54784715.7520930.05233315.77452215.73714115.77452215.77452215.75956997.74222536.32625611.81967715.7595695.21082515.774522
VestfoldLAB (historic)17.49632199.87314217.82107928.81209717.81600417.81600428.35540717.8160043.00400917.46080117.81600416.20743999.87314236.7635870.00000017.81093018.17628322.702593
\n", "
" ], "text/plain": [ " ANC_µekv/l CA_mg/l CL_mg/l ILAL_µg/l Al \\\n", "lab \n", "Eurofins 29.250457 100.000000 29.250457 57.678245 \n", "Eurofins (historic) 27.477785 73.957621 26.930964 56.322625 \n", "NIVA (historic) 15.176435 97.585227 15.759569 36.326256 \n", "VestfoldLAB (historic) 17.496321 99.873142 17.821079 28.812097 \n", "\n", " KOND_mS/m K_mg/l LAL_µg/l Al MG_mg/l \\\n", "lab \n", "Eurofins 57.678245 29.250457 52.376600 29.250457 \n", "Eurofins (historic) 56.664388 34.381408 0.000000 25.563910 \n", "NIVA (historic) 97.547847 15.752093 0.052333 15.774522 \n", "VestfoldLAB (historic) 17.816004 17.816004 28.355407 17.816004 \n", "\n", " N-NO3_µg/l N N-TOT_µg/l N NA_mg/l P-TOT_µg/l P \\\n", "lab \n", "Eurofins 29.250457 29.250457 29.250457 29.250457 \n", "Eurofins (historic) 27.136022 25.085441 25.153794 25.017088 \n", "NIVA (historic) 15.737141 15.774522 15.774522 15.759569 \n", "VestfoldLAB (historic) 3.004009 17.460801 17.816004 16.207439 \n", "\n", " PH_ RAL_µg/l Al SIO2_µg/l Si SO4_mg/l \\\n", "lab \n", "Eurofins 99.908592 57.678245 29.250457 29.250457 \n", "Eurofins (historic) 93.438141 56.527683 0.000000 27.546138 \n", "NIVA (historic) 97.742225 36.326256 11.819677 15.759569 \n", "VestfoldLAB (historic) 99.873142 36.763587 0.000000 17.810930 \n", "\n", " TEMP_°C TOC_mg/l C \n", "lab \n", "Eurofins 20.566728 57.678245 \n", "Eurofins (historic) 27.614491 56.459330 \n", "NIVA (historic) 5.210825 15.774522 \n", "VestfoldLAB (historic) 18.176283 22.702593 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Percentage of total samples analysed per parameter, split by lab\n", "pct_df = df.groupby(\"lab\").agg(\"count\")\n", "tot_samps = pct_df[\"period\"].copy()\n", "\n", "for col in pct_df.columns:\n", " pct_df[col] = 100 * pct_df[col] / tot_samps\n", "\n", "pct_df = pct_df.iloc[:, 6:]\n", "\n", "pct_df" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Percentage of total samples analysed per parameter, split by lab')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot\n", "ax = pct_df.T.plot.bar(figsize=(15, 5))\n", "ax.set_ylabel(\"Percent of samples\")\n", "ax.set_title(\"Percentage of total samples analysed per parameter, split by lab\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Isolation Forests\n", "\n", "Isolation Forests are a type of random forest well suited to outlier detection. They have the advantage of making few assumptions about the underlying data distribution, which is useful in situations where - as here - most/all of the variables are strongly skewed.\n", "\n", "Isolation forests have a `contamination` parameter, which can be broadly interpreted as the \"expected proportion of outliers in the dataset\". In other words, setting `contamination=0.01` roughly translates to finding the most unusual 1% of data values. Without a strong theoretical basis for setting the `contamination` parameter, it must be found either by manual tuning or be fixed based on practical considerations (e.g. how many water samples can we realistically afford to reanalyse).\n", "\n", "### 4.1. `CA` and `PH` only\n", "\n", "The code below applies the isolation forest algorithm to `CA` and `PH`. Since these are almost always measured, this includes virtually all water samples (both new and historic) in the dataset." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.10/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "The total number of samples in the dataset is: 34852.\n", "\n", "The total number of outliers detected is 349:\n", " 320 in the 'historic' period\n", " 29 in the 'new' period\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vannmiljo_codesample_datelabperioddepth1depth2CA_mg/lPH_<ubenevnt>pred
3930021-284502022-07-08Eurofinsnew0.00.07.76.7outlier
3931021-284502022-07-28Eurofinsnew0.00.07.66.8outlier
3933021-284502022-08-17Eurofinsnew0.00.012.06.8outlier
3935021-284502022-09-09Eurofinsnew0.00.08.66.6outlier
4871021-463732022-07-05Eurofinsnew0.00.06.17.2outlier
4872021-463732022-07-19Eurofinsnew0.00.07.97.4outlier
4873021-463732022-08-11Eurofinsnew0.00.07.47.4outlier
5752022-320182022-09-23Eurofinsnew0.00.09.67.5outlier
5847022-320192022-08-01Eurofinsnew0.00.05.47.0outlier
5849022-320192022-09-23Eurofinsnew0.00.07.46.9outlier
5940022-320202022-07-04Eurofinsnew0.00.07.46.8outlier
5941022-320202022-07-22Eurofinsnew0.00.06.07.0outlier
5942022-320202022-07-26Eurofinsnew0.00.06.26.8outlier
5943022-320202022-08-01Eurofinsnew0.00.06.56.7outlier
6243022-457692022-07-19Eurofinsnew0.00.06.56.6outlier
6245022-457692022-08-17Eurofinsnew0.00.05.96.5outlier
6246022-457692022-08-29Eurofinsnew0.00.05.76.5outlier
6762022-589042022-07-04Eurofinsnew0.00.06.06.7outlier
6763022-589042022-07-22Eurofinsnew0.00.05.76.8outlier
6764022-589042022-07-26Eurofinsnew0.00.05.46.8outlier
6765022-589042022-08-01Eurofinsnew0.00.05.66.8outlier
6770022-589042022-09-29Eurofinsnew0.00.05.86.8outlier
14226027-792782022-07-05Eurofinsnew0.00.04.87.1outlier
14227027-792782022-08-02Eurofinsnew0.00.04.97.2outlier
17196030-588382022-08-16Eurofinsnew0.00.07.36.2outlier
17197030-588382022-08-30Eurofinsnew0.00.05.66.4outlier
19593036-587512022-09-07Eurofinsnew0.00.011.07.9outlier
31621079-588782022-09-13Eurofinsnew0.00.06.66.2outlier
33482082-588702022-09-06Eurofinsnew0.00.010.07.3outlier
\n", "
" ], "text/plain": [ " vannmiljo_code sample_date lab period depth1 depth2 CA_mg/l \\\n", "3930 021-28450 2022-07-08 Eurofins new 0.0 0.0 7.7 \n", "3931 021-28450 2022-07-28 Eurofins new 0.0 0.0 7.6 \n", "3933 021-28450 2022-08-17 Eurofins new 0.0 0.0 12.0 \n", "3935 021-28450 2022-09-09 Eurofins new 0.0 0.0 8.6 \n", "4871 021-46373 2022-07-05 Eurofins new 0.0 0.0 6.1 \n", "4872 021-46373 2022-07-19 Eurofins new 0.0 0.0 7.9 \n", "4873 021-46373 2022-08-11 Eurofins new 0.0 0.0 7.4 \n", "5752 022-32018 2022-09-23 Eurofins new 0.0 0.0 9.6 \n", "5847 022-32019 2022-08-01 Eurofins new 0.0 0.0 5.4 \n", "5849 022-32019 2022-09-23 Eurofins new 0.0 0.0 7.4 \n", "5940 022-32020 2022-07-04 Eurofins new 0.0 0.0 7.4 \n", "5941 022-32020 2022-07-22 Eurofins new 0.0 0.0 6.0 \n", "5942 022-32020 2022-07-26 Eurofins new 0.0 0.0 6.2 \n", "5943 022-32020 2022-08-01 Eurofins new 0.0 0.0 6.5 \n", "6243 022-45769 2022-07-19 Eurofins new 0.0 0.0 6.5 \n", "6245 022-45769 2022-08-17 Eurofins new 0.0 0.0 5.9 \n", "6246 022-45769 2022-08-29 Eurofins new 0.0 0.0 5.7 \n", "6762 022-58904 2022-07-04 Eurofins new 0.0 0.0 6.0 \n", "6763 022-58904 2022-07-22 Eurofins new 0.0 0.0 5.7 \n", "6764 022-58904 2022-07-26 Eurofins new 0.0 0.0 5.4 \n", "6765 022-58904 2022-08-01 Eurofins new 0.0 0.0 5.6 \n", "6770 022-58904 2022-09-29 Eurofins new 0.0 0.0 5.8 \n", "14226 027-79278 2022-07-05 Eurofins new 0.0 0.0 4.8 \n", "14227 027-79278 2022-08-02 Eurofins new 0.0 0.0 4.9 \n", "17196 030-58838 2022-08-16 Eurofins new 0.0 0.0 7.3 \n", "17197 030-58838 2022-08-30 Eurofins new 0.0 0.0 5.6 \n", "19593 036-58751 2022-09-07 Eurofins new 0.0 0.0 11.0 \n", "31621 079-58878 2022-09-13 Eurofins new 0.0 0.0 6.6 \n", "33482 082-58870 2022-09-06 Eurofins new 0.0 0.0 10.0 \n", "\n", " PH_ pred \n", "3930 6.7 outlier \n", "3931 6.8 outlier \n", "3933 6.8 outlier \n", "3935 6.6 outlier \n", "4871 7.2 outlier \n", "4872 7.4 outlier \n", "4873 7.4 outlier \n", "5752 7.5 outlier \n", "5847 7.0 outlier \n", "5849 6.9 outlier \n", "5940 6.8 outlier \n", "5941 7.0 outlier \n", "5942 6.8 outlier \n", "5943 6.7 outlier \n", "6243 6.6 outlier \n", "6245 6.5 outlier \n", "6246 6.5 outlier \n", "6762 6.7 outlier \n", "6763 6.8 outlier \n", "6764 6.8 outlier \n", "6765 6.8 outlier \n", "6770 6.8 outlier \n", "14226 7.1 outlier \n", "14227 7.2 outlier \n", "17196 6.2 outlier \n", "17197 6.4 outlier \n", "19593 7.9 outlier \n", "31621 6.2 outlier \n", "33482 7.3 outlier " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Columns of interest\n", "key_cols = [\"vannmiljo_code\", \"sample_date\", \"lab\", \"period\", \"depth1\", \"depth2\"]\n", "par_cols = [\"CA_mg/l\", \"PH_\"]\n", "\n", "# Run algorithm\n", "data = df[key_cols + par_cols].dropna()\n", "data = utils.isolation_forest(data, par_cols, contamination=0.01)\n", "\n", "# Summarise results\n", "all_out = data.query(\"pred == 'outlier'\")\n", "his_out = data.query(\"(pred == 'outlier') and (period == 'historic')\")\n", "new_out = data.query(\"(pred == 'outlier') and (period == 'new')\")\n", "\n", "csv_path = os.path.join(fold_path, \"isoforest_ca_ph.csv\")\n", "new_out.to_csv(csv_path, index=False)\n", "\n", "print(f\"The total number of samples in the dataset is: {len(data)}.\\n\")\n", "print(\n", " f\"The total number of outliers detected is {len(all_out)}:\\n\"\n", " f\" {len(his_out)} in the 'historic' period\\n\"\n", " f\" {len(new_out)} in the 'new' period\\n\"\n", ")\n", "\n", "new_out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This initial approach identifies the **strangest 1% of the dataset overall**. \n", "\n", " * Most of the unusual values (300 out of 326) are actually in the historic dataset (i.e. they are already in Vannmiljø). \n", " \n", " * The 26 samples in the 'new' dataset that have been classified as outliers are predominantly those with unusually high concentrations of `CA` (greater than around 5 mg/l; see plots below)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot all samples\n", "fig, ax = plt.subplots(figsize=(6, 6))\n", "sn.scatterplot(\n", " data=data,\n", " x=\"PH_\",\n", " y=\"CA_mg/l\",\n", " hue=\"pred\",\n", " ax=ax,\n", " hue_order=[\"outlier\", \"inlier\"],\n", ")\n", "_ = ax.set_title(\"All data\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot just the 'new' samples\n", "data_new = data.query(\"period == 'new'\")\n", "fig, ax = plt.subplots(figsize=(6, 6))\n", "sn.scatterplot(\n", " data=data_new,\n", " x=\"PH_\",\n", " y=\"CA_mg/l\",\n", " hue=\"pred\",\n", " ax=ax,\n", " hue_order=[\"outlier\", \"inlier\"],\n", ")\n", "_ = ax.set_title(\"'New' data only\")\n", "plt.tight_layout()\n", "png_path = os.path.join(fold_path, \"isoforest_ca_ph_plot.png\")\n", "plt.savefig(png_path, dpi=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2. `CA`, `PH`, `ILAL` and `RAL`" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.10/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "The total number of samples in the dataset is: 11439.\n", "\n", "The total number of outliers detected is 115:\n", " 100 in the 'historic' period\n", " 15 in the 'new' period\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vannmiljo_codesample_datelabperioddepth1depth2CA_mg/lPH_<ubenevnt>ILAL_µg/l AlRAL_µg/l Alpred
1620019-1010222022-09-06Eurofinsnew0.00.01.405.7130.0180.0outlier
2205019-791482022-09-06Eurofinsnew0.00.01.705.8150.0190.0outlier
4073021-292462022-09-30Eurofinsnew0.00.01.205.1140.0240.0outlier
5752022-320182022-09-23Eurofinsnew0.00.09.607.59.711.0outlier
5846022-320192022-07-22Eurofinsnew0.00.04.606.95.15.9outlier
5847022-320192022-08-01Eurofinsnew0.00.05.407.05.49.2outlier
5849022-320192022-09-23Eurofinsnew0.00.07.406.917.021.0outlier
5941022-320202022-07-22Eurofinsnew0.00.06.007.06.26.7outlier
6763022-589042022-07-22Eurofinsnew0.00.05.706.85.05.0outlier
8335024-588942022-09-07Eurofinsnew0.00.04.206.95.05.0outlier
14226027-792782022-07-05Eurofinsnew0.00.04.807.15.47.3outlier
14227027-792782022-08-02Eurofinsnew0.00.04.907.25.06.9outlier
14228027-792782022-09-06Eurofinsnew0.00.04.607.05.05.0outlier
19873036-587522022-08-23Eurofinsnew0.00.00.645.3140.0170.0outlier
34645082-588742022-07-05Eurofinsnew0.00.00.545.6130.0170.0outlier
\n", "
" ], "text/plain": [ " vannmiljo_code sample_date lab period depth1 depth2 CA_mg/l \\\n", "1620 019-101022 2022-09-06 Eurofins new 0.0 0.0 1.40 \n", "2205 019-79148 2022-09-06 Eurofins new 0.0 0.0 1.70 \n", "4073 021-29246 2022-09-30 Eurofins new 0.0 0.0 1.20 \n", "5752 022-32018 2022-09-23 Eurofins new 0.0 0.0 9.60 \n", "5846 022-32019 2022-07-22 Eurofins new 0.0 0.0 4.60 \n", "5847 022-32019 2022-08-01 Eurofins new 0.0 0.0 5.40 \n", "5849 022-32019 2022-09-23 Eurofins new 0.0 0.0 7.40 \n", "5941 022-32020 2022-07-22 Eurofins new 0.0 0.0 6.00 \n", "6763 022-58904 2022-07-22 Eurofins new 0.0 0.0 5.70 \n", "8335 024-58894 2022-09-07 Eurofins new 0.0 0.0 4.20 \n", "14226 027-79278 2022-07-05 Eurofins new 0.0 0.0 4.80 \n", "14227 027-79278 2022-08-02 Eurofins new 0.0 0.0 4.90 \n", "14228 027-79278 2022-09-06 Eurofins new 0.0 0.0 4.60 \n", "19873 036-58752 2022-08-23 Eurofins new 0.0 0.0 0.64 \n", "34645 082-58874 2022-07-05 Eurofins new 0.0 0.0 0.54 \n", "\n", " PH_ ILAL_µg/l Al RAL_µg/l Al pred \n", "1620 5.7 130.0 180.0 outlier \n", "2205 5.8 150.0 190.0 outlier \n", "4073 5.1 140.0 240.0 outlier \n", "5752 7.5 9.7 11.0 outlier \n", "5846 6.9 5.1 5.9 outlier \n", "5847 7.0 5.4 9.2 outlier \n", "5849 6.9 17.0 21.0 outlier \n", "5941 7.0 6.2 6.7 outlier \n", "6763 6.8 5.0 5.0 outlier \n", "8335 6.9 5.0 5.0 outlier \n", "14226 7.1 5.4 7.3 outlier \n", "14227 7.2 5.0 6.9 outlier \n", "14228 7.0 5.0 5.0 outlier \n", "19873 5.3 140.0 170.0 outlier \n", "34645 5.6 130.0 170.0 outlier " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Columns of interest\n", "key_cols = [\"vannmiljo_code\", \"sample_date\", \"lab\", \"period\", \"depth1\", \"depth2\"]\n", "par_cols = [\"CA_mg/l\", \"PH_\", \"ILAL_µg/l Al\", \"RAL_µg/l Al\"]\n", "\n", "# Run algorithm\n", "data = df[key_cols + par_cols].dropna()\n", "data = utils.isolation_forest(data, par_cols, contamination=0.01)\n", "\n", "# Summarise results\n", "all_out = data.query(\"pred == 'outlier'\")\n", "his_out = data.query(\"(pred == 'outlier') and (period == 'historic')\")\n", "new_out = data.query(\"(pred == 'outlier') and (period == 'new')\")\n", "\n", "print(f\"The total number of samples in the dataset is: {len(data)}.\\n\")\n", "print(\n", " f\"The total number of outliers detected is {len(all_out)}:\\n\"\n", " f\" {len(his_out)} in the 'historic' period\\n\"\n", " f\" {len(new_out)} in the 'new' period\\n\"\n", ")\n", "\n", "new_out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3. All parameters *except* `LAL` and `TEMP`" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.10/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "The total number of samples in the dataset is: 1759.\n", "\n", "The total number of outliers detected is 18:\n", " 5 in the 'historic' period\n", " 13 in the 'new' period\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vannmiljo_codesample_datelabperioddepth1depth2ALK_mmol/lANC_µekv/lCA_mg/lCL_mg/l...N-NO3_µg/l NN-TOT_µg/l NNA_mg/lP-TOT_µg/l PPH_<ubenevnt>RAL_µg/l AlSIO2_µg/l SiSO4_mg/lTOC_mg/l Cpred
3930021-284502022-07-08Eurofinsnew0.00.00.1579.07.735.0...270.0530.016.014.06.711.01309.1211724.664.6outlier
3932021-284502022-08-09Eurofinsnew0.00.00.0969.05.026.0...270.03600.013.022.06.631.01215.6125173.907.0outlier
3935021-284502022-09-09Eurofinsnew0.00.00.16110.08.633.0...720.0830.015.08.46.69.11402.6298274.824.4outlier
6879022-970112022-07-06Eurofinsnew0.00.00.25270.04.26.6...150.0290.04.97.07.127.02477.9793612.474.9outlier
6881022-970112022-08-03Eurofinsnew0.00.00.25270.04.17.1...160.0270.05.55.77.028.02805.2596543.064.6outlier
6884022-970112022-09-07Eurofinsnew0.00.00.24230.03.97.5...140.0300.04.94.17.029.02805.2596542.675.1outlier
8333024-588942022-07-06Eurofinsnew0.00.00.15-0.33.813.0...810.0990.05.312.06.77.3701.3149133.983.0outlier
8334024-588942022-08-02Eurofinsnew0.00.00.16120.04.111.0...1100.01200.06.66.97.08.91122.1038623.932.5outlier
8335024-588942022-09-07Eurofinsnew0.00.00.1697.04.211.0...1300.01600.06.64.86.95.01122.1038624.662.1outlier
14226027-792782022-07-05Eurofinsnew0.00.00.30230.04.818.0...530.0740.09.637.07.17.31683.1557923.673.6outlier
14227027-792782022-08-02Eurofinsnew0.00.00.30230.04.918.0...810.0950.010.031.07.26.92197.4533953.873.3outlier
14228027-792782022-09-06Eurofinsnew0.00.00.27240.04.617.0...590.0790.09.333.07.05.01215.6125173.174.1outlier
17198030-588382022-09-06Eurofinsnew0.00.00.11100.03.64.7...1700.02000.02.6110.06.29.1794.8235691.942.8outlier
\n", "

13 rows × 24 columns

\n", "
" ], "text/plain": [ " vannmiljo_code sample_date lab period depth1 depth2 ALK_mmol/l \\\n", "3930 021-28450 2022-07-08 Eurofins new 0.0 0.0 0.15 \n", "3932 021-28450 2022-08-09 Eurofins new 0.0 0.0 0.09 \n", "3935 021-28450 2022-09-09 Eurofins new 0.0 0.0 0.16 \n", "6879 022-97011 2022-07-06 Eurofins new 0.0 0.0 0.25 \n", "6881 022-97011 2022-08-03 Eurofins new 0.0 0.0 0.25 \n", "6884 022-97011 2022-09-07 Eurofins new 0.0 0.0 0.24 \n", "8333 024-58894 2022-07-06 Eurofins new 0.0 0.0 0.15 \n", "8334 024-58894 2022-08-02 Eurofins new 0.0 0.0 0.16 \n", "8335 024-58894 2022-09-07 Eurofins new 0.0 0.0 0.16 \n", "14226 027-79278 2022-07-05 Eurofins new 0.0 0.0 0.30 \n", "14227 027-79278 2022-08-02 Eurofins new 0.0 0.0 0.30 \n", "14228 027-79278 2022-09-06 Eurofins new 0.0 0.0 0.27 \n", "17198 030-58838 2022-09-06 Eurofins new 0.0 0.0 0.11 \n", "\n", " ANC_µekv/l CA_mg/l CL_mg/l ... N-NO3_µg/l N N-TOT_µg/l N NA_mg/l \\\n", "3930 79.0 7.7 35.0 ... 270.0 530.0 16.0 \n", "3932 69.0 5.0 26.0 ... 270.0 3600.0 13.0 \n", "3935 110.0 8.6 33.0 ... 720.0 830.0 15.0 \n", "6879 270.0 4.2 6.6 ... 150.0 290.0 4.9 \n", "6881 270.0 4.1 7.1 ... 160.0 270.0 5.5 \n", "6884 230.0 3.9 7.5 ... 140.0 300.0 4.9 \n", "8333 -0.3 3.8 13.0 ... 810.0 990.0 5.3 \n", "8334 120.0 4.1 11.0 ... 1100.0 1200.0 6.6 \n", "8335 97.0 4.2 11.0 ... 1300.0 1600.0 6.6 \n", "14226 230.0 4.8 18.0 ... 530.0 740.0 9.6 \n", "14227 230.0 4.9 18.0 ... 810.0 950.0 10.0 \n", "14228 240.0 4.6 17.0 ... 590.0 790.0 9.3 \n", "17198 100.0 3.6 4.7 ... 1700.0 2000.0 2.6 \n", "\n", " P-TOT_µg/l P PH_ RAL_µg/l Al SIO2_µg/l Si SO4_mg/l \\\n", "3930 14.0 6.7 11.0 1309.121172 4.66 \n", "3932 22.0 6.6 31.0 1215.612517 3.90 \n", "3935 8.4 6.6 9.1 1402.629827 4.82 \n", "6879 7.0 7.1 27.0 2477.979361 2.47 \n", "6881 5.7 7.0 28.0 2805.259654 3.06 \n", "6884 4.1 7.0 29.0 2805.259654 2.67 \n", "8333 12.0 6.7 7.3 701.314913 3.98 \n", "8334 6.9 7.0 8.9 1122.103862 3.93 \n", "8335 4.8 6.9 5.0 1122.103862 4.66 \n", "14226 37.0 7.1 7.3 1683.155792 3.67 \n", "14227 31.0 7.2 6.9 2197.453395 3.87 \n", "14228 33.0 7.0 5.0 1215.612517 3.17 \n", "17198 110.0 6.2 9.1 794.823569 1.94 \n", "\n", " TOC_mg/l C pred \n", "3930 4.6 outlier \n", "3932 7.0 outlier \n", "3935 4.4 outlier \n", "6879 4.9 outlier \n", "6881 4.6 outlier \n", "6884 5.1 outlier \n", "8333 3.0 outlier \n", "8334 2.5 outlier \n", "8335 2.1 outlier \n", "14226 3.6 outlier \n", "14227 3.3 outlier \n", "14228 4.1 outlier \n", "17198 2.8 outlier \n", "\n", "[13 rows x 24 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Columns of interest\n", "key_cols = [\"vannmiljo_code\", \"sample_date\", \"lab\", \"period\", \"depth1\", \"depth2\"]\n", "excl_cols = [\"TEMP_°C\", \"LAL_µg/l Al\"]\n", "par_cols = [col for col in df.columns if col not in (key_cols + excl_cols)]\n", "\n", "# Run algorithm\n", "data = df[key_cols + par_cols].dropna()\n", "data = utils.isolation_forest(data, par_cols, contamination=0.01)\n", "\n", "# Summarise results\n", "all_out = data.query(\"pred == 'outlier'\")\n", "his_out = data.query(\"(pred == 'outlier') and (period == 'historic')\")\n", "new_out = data.query(\"(pred == 'outlier') and (period == 'new')\")\n", "\n", "print(f\"The total number of samples in the dataset is: {len(data)}.\\n\")\n", "print(\n", " f\"The total number of outliers detected is {len(all_out)}:\\n\"\n", " f\" {len(his_out)} in the 'historic' period\\n\"\n", " f\" {len(new_out)} in the 'new' period\\n\"\n", ")\n", "\n", "new_out" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }