{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sqlite3\n", "import sys\n", "\n", "sys.path.append(\"..\")\n", "\n", "import nivapy3 as nivapy\n", "import numpy as np\n", "import pandas as pd\n", "import utils" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tiltaksovervakingen: opsjon for kvalitetskontroll av analysedata\n", "# Eurofins 2020 Q4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook 1: Initial exploration and data cleaning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Initial data check\n", "\n", "**Note:** This notebook uses an updated export from Vannmiljø reflecting corrections made to two station codes in August 2021 - see e-mail from Kjetil received 27.05.2021 at 17:15 for details.\n", "\n", "Initial screening of the Eurofins Excel file shows the following:\n", "\n", " * The data matches the VestfoldLAB submission template, with the exception that SiO2 is reported by Eurofins in ug/l, not mg/l. I have added additional columns to `../../data/parameter_unit_mapping.xlsx` in order to correctly convert the units reported by Eurofins\n", " \n", " * There are 61 records without a valid `Lokalitets ID`. These records are associated with 6 stations. 5 of these stations were newly added in 2020 and the correct IDs were only supplied by Kjetil in October 2020, so it makes sense that these should be missing. However, one is an old station (Barstadvassdraget v Liland (1); 026-30849) so it's strange that the code for this was missing. I have added the missing codes based on data in `../../data/active_stations_2020.xlsx`, but Eurofins will need to do this too\n", " \n", " * Some cells are marked `.`, others `-`, others `N/A` and others left blank. Is there any significance to these characters? For now, I have assumed they are all equivalent ways of indicating \"no data\" and have deleted these entries accordingly. **Check with Eurofins**\n", " \n", " * The Eurofins data includes `<` characters, whereas VestfoldLAB removed these and replaced the original value by half the LOQ. I will keep the `<` characters, and simply use the LOQ value itself (rather than half the value) in any subsequent analysis\n", " \n", " * The Excel file mixes `.` and `,` as the decimal separator. I have replaced all occurrences of `,` as a decimal separator with `.`\n", " \n", " * Inconsistent flag combinations for Al fractions. Usually LAl is simply calculated as (RAl - ILAl), which leads to negative values being reported in some cases. However, in one example, RAl is reported as `5.3` ug/l and ILAl as `< 5` ug/l. For this sample, LAl is reported as `< 0,3` ug/l, which doesn't seem correct. Valid values for LAl in this case lie in the range $0.3 < LAl <= 5.3$ ug/l. For now, I have converted this value to 0.3 ug/l (but we could choose the middle of the range " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Choose dataset to process\n", "lab = \"Eurofins\"\n", "year = 2020\n", "qtr = 4\n", "version = 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Create SQLite database to store results\n", "\n", "Using a database will provide basic checks on data integrity and consistency. For this project, three tables will be sufficient:\n", "\n", " * Station locations and metadata\n", " * Parameters and units used by VestfoldLAB, Eurofins and Vannmiljø, and conversion factors between these\n", " * Water chemistry data\n", " \n", "The code below creates a basic database structure, which will be populated later." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create database\n", "fold_path = f\"../../output/{lab.lower()}_{year}_q{qtr}_v{version}\"\n", "if not os.path.exists(fold_path):\n", " os.makedirs(fold_path)\n", "\n", "db_path = os.path.join(fold_path, \"kalk_data.db\")\n", "if os.path.exists(db_path):\n", " os.remove(db_path)\n", "eng = sqlite3.connect(db_path, detect_types=sqlite3.PARSE_DECLTYPES)\n", "\n", "# Turn off journal mode for performance\n", "eng.execute(\"PRAGMA synchronous = OFF\")\n", "eng.execute(\"PRAGMA journal_mode = OFF\")\n", "eng.execute(\"PRAGMA foreign_keys = ON\")\n", "\n", "# Create stations table\n", "sql = (\n", " \"CREATE TABLE stations \"\n", " \"( \"\n", " \" fylke text NOT NULL, \"\n", " \" vassdrag text NOT NULL, \"\n", " \" station_name text NOT NULL, \"\n", " \" station_number text, \"\n", " \" vannmiljo_code text NOT NULL, \"\n", " \" vannmiljo_name text, \"\n", " \" utm_east real NOT NULL, \"\n", " \" utm_north real NOT NULL, \"\n", " \" utm_zone integer NOT NULL, \"\n", " \" lon real NOT NULL, \"\n", " \" lat real NOT NULL, \"\n", " \" liming_status text NOT NULL, \"\n", " \" comment text, \"\n", " \" PRIMARY KEY (vannmiljo_code) \"\n", " \")\"\n", ")\n", "eng.execute(sql)\n", "\n", "# Create parameters table\n", "sql = (\n", " \"CREATE TABLE parameters_units \"\n", " \"( \"\n", " \" vannmiljo_name text NOT NULL UNIQUE, \"\n", " \" vannmiljo_id text NOT NULL UNIQUE, \"\n", " \" vannmiljo_unit text NOT NULL, \"\n", " \" vestfoldlab_name text NOT NULL UNIQUE, \"\n", " \" vestfoldlab_unit text NOT NULL, \"\n", " \" vestfoldlab_to_vm_conv_fac real NOT NULL, \"\n", " \" eurofins_name text NOT NULL UNIQUE, \"\n", " \" eurofins_unit text NOT NULL, \"\n", " \" eurofins_to_vm_conv_fac real NOT NULL, \"\n", " \" min real NOT NULL, \"\n", " \" max real NOT NULL, \"\n", " \" PRIMARY KEY (vannmiljo_id) \"\n", " \")\"\n", ")\n", "eng.execute(sql)\n", "\n", "# Create chemistry table\n", "sql = (\n", " \"CREATE TABLE water_chemistry \"\n", " \"( \"\n", " \" vannmiljo_code text NOT NULL, \"\n", " \" sample_date datetime NOT NULL, \"\n", " \" lab text NOT NULL, \"\n", " \" period text NOT NULL, \"\n", " \" depth1 real, \"\n", " \" depth2 real, \"\n", " \" parameter text NOT NULL, \"\n", " \" flag text, \"\n", " \" value real NOT NULL, \"\n", " \" unit text NOT NULL, \"\n", " \" PRIMARY KEY (vannmiljo_code, sample_date, depth1, depth2, parameter), \"\n", " \" CONSTRAINT vannmiljo_code_fkey FOREIGN KEY (vannmiljo_code) \"\n", " \" REFERENCES stations (vannmiljo_code) \"\n", " \" ON UPDATE NO ACTION ON DELETE NO ACTION, \"\n", " \" CONSTRAINT parameter_fkey FOREIGN KEY (parameter) \"\n", " \" REFERENCES parameters_units (vannmiljo_id) \"\n", " \" ON UPDATE NO ACTION ON DELETE NO ACTION \"\n", " \")\"\n", ")\n", "eng.execute(sql)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Explore station data\n", "\n", "Station details are stored in `../../data/active_stations_2020.xlsx`, which is a tidied version of Øyvind's original file here:\n", "\n", " K:\\Prosjekter\\langtransporterte forurensninger\\Kalk Tiltaksovervåking\\12 KS vannkjemi\\Vannlokaliteter koordinater_kun aktive stasj 2020.xlsx\n", " \n", "Note that corrections (e.g. adjusted station co-ordinates) have been made to the tidied file, but not the original on `K:`. **The version in this repository should therefore been used as the \"master\" copy**." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The following stations are missing spatial co-ordinates:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fylkevassdragstation_namestation_numbervannmiljo_codevannmiljo_nameutm_eastutm_northutm_zoneliming_statuscommentlatlon
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [fylke, vassdrag, station_name, station_number, vannmiljo_code, vannmiljo_name, utm_east, utm_north, utm_zone, liming_status, comment, lat, lon]\n", "Index: []" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read station data\n", "stn_df = pd.read_excel(r\"../../data/active_stations_2020.xlsx\", sheet_name=\"data\")\n", "stn_df = nivapy.spatial.utm_to_wgs84_dd(stn_df)\n", "\n", "print(\"The following stations are missing spatial co-ordinates:\")\n", "stn_df.query(\"lat != lat\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The following stations do not have a code in Vannmiljø:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fylkevassdragstation_namestation_numbervannmiljo_codevannmiljo_nameutm_eastutm_northutm_zoneliming_statuscommentlatlon
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [fylke, vassdrag, station_name, station_number, vannmiljo_code, vannmiljo_name, utm_east, utm_north, utm_zone, liming_status, comment, lat, lon]\n", "Index: []" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"The following stations do not have a code in Vannmiljø:\")\n", "stn_df.query(\"vannmiljo_code != vannmiljo_code\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Map\n", "stn_map = nivapy.spatial.quickmap(\n", " stn_df.dropna(subset=[\"lat\"]),\n", " lat_col=\"lat\",\n", " lon_col=\"lon\",\n", " popup=\"station_name\",\n", " cluster=True,\n", " kartverket=True,\n", " aerial_imagery=True,\n", ")\n", "\n", "stn_map.save(\"../../pages/stn_map.html\")\n", "\n", "stn_map" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "218" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add to database\n", "stn_df.dropna(subset=[\"vannmiljo_code\", \"lat\"], inplace=True)\n", "stn_df.to_sql(name=\"stations\", con=eng, if_exists=\"append\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Parameters and units of interest\n", "\n", "The file `../../data/parameter_unit_mapping.xlsx` provides a lookup between parameter names & units used by the labs and those in Vannmiljø. It also contains plausible ranges (using Vannmiljø units) for each parameter. These ranges have been chosen by using the values already in Vannmiljø as a reference. However, it looks as though some of the data in Vannmiljø might also be spurious, so it would be good to refine these ranges based on domain knowledge, if possible.\n", "\n", "**Note:** Concentrations reported as *exactly* zero are likely to be errors, because most (all?) lab methods should report an LOQ instead." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vannmiljo_namevannmiljo_idvannmiljo_unitvestfoldlab_namevestfoldlab_unitvestfoldlab_to_vm_conv_faceurofins_nameeurofins_uniteurofins_to_vm_conv_facminmax
0TemperaturTEMP°CTemp°C1.000000Temp°C1.000000-1030
1pHPH<ubenevnt>pHenh1.000000pHenh1.000000110
2KonduktivitetKONDmS/mKondmS/m1.000000Kondms/m1.0000000100
3Total alkalitetALKmmol/lAlkmmol/l1.000000Alkmmol/l1.00000002
4TotalfosforP-TOTµg/l PTot-Pµg/l1.000000Tot-Pµg/l1.0000000500
5TotalnitrogenN-TOTµg/l NTot-Nµg/l1.000000Tot-Nµg/l1.00000004000
6NitratN-NO3µg/l NNO3µg/l1.000000NO3µg/l1.00000002000
7Totalt organisk karbon (TOC)TOCmg/l CTOCmg/l1.000000TOCmg/l1.0000000100
8Reaktivt aluminiumRALµg/l AlRAlµg/l1.000000RAlµg/l1.0000000500
9Ikke-labilt aluminiumILALµg/l AlILAlµg/l1.000000ILAlµg/l1.0000000500
10Labilt aluminiumLALµg/l AlLAlµg/l1.000000LAlµg/l1.0000000500
11KloridCLmg/lClmg/l1.000000Clmg/l1.0000000100
12SulfatSO4mg/lSO4mg/l1.000000SO4mg/l1.000000020
13KalsiumCAmg/lCamg/l1.000000Camg/l1.0000000500
14KaliumKmg/lKmg/l1.000000Kmg/l1.000000010
15MagnesiumMGmg/lMgmg/l1.000000Mgmg/l1.0000000100
16NatriumNAmg/lNamg/l1.000000Namg/l1.000000050
17Totalt silikatSIO2µg/l SiSIO2mg/l467.543276SIO2µg/l0.46754307000
18Syrenøytraliserende kapasitet (ANC)ANCµekv/lANCµekv/l1.000000ANCµekv/l1.000000-10006000
\n", "
" ], "text/plain": [ " vannmiljo_name vannmiljo_id vannmiljo_unit \\\n", "0 Temperatur TEMP °C \n", "1 pH PH \n", "2 Konduktivitet KOND mS/m \n", "3 Total alkalitet ALK mmol/l \n", "4 Totalfosfor P-TOT µg/l P \n", "5 Totalnitrogen N-TOT µg/l N \n", "6 Nitrat N-NO3 µg/l N \n", "7 Totalt organisk karbon (TOC) TOC mg/l C \n", "8 Reaktivt aluminium RAL µg/l Al \n", "9 Ikke-labilt aluminium ILAL µg/l Al \n", "10 Labilt aluminium LAL µg/l Al \n", "11 Klorid CL mg/l \n", "12 Sulfat SO4 mg/l \n", "13 Kalsium CA mg/l \n", "14 Kalium K mg/l \n", "15 Magnesium MG mg/l \n", "16 Natrium NA mg/l \n", "17 Totalt silikat SIO2 µg/l Si \n", "18 Syrenøytraliserende kapasitet (ANC) ANC µekv/l \n", "\n", " vestfoldlab_name vestfoldlab_unit vestfoldlab_to_vm_conv_fac \\\n", "0 Temp °C 1.000000 \n", "1 pH enh 1.000000 \n", "2 Kond mS/m 1.000000 \n", "3 Alk mmol/l 1.000000 \n", "4 Tot-P µg/l 1.000000 \n", "5 Tot-N µg/l 1.000000 \n", "6 NO3 µg/l 1.000000 \n", "7 TOC mg/l 1.000000 \n", "8 RAl µg/l 1.000000 \n", "9 ILAl µg/l 1.000000 \n", "10 LAl µg/l 1.000000 \n", "11 Cl mg/l 1.000000 \n", "12 SO4 mg/l 1.000000 \n", "13 Ca mg/l 1.000000 \n", "14 K mg/l 1.000000 \n", "15 Mg mg/l 1.000000 \n", "16 Na mg/l 1.000000 \n", "17 SIO2 mg/l 467.543276 \n", "18 ANC µekv/l 1.000000 \n", "\n", " eurofins_name eurofins_unit eurofins_to_vm_conv_fac min max \n", "0 Temp °C 1.000000 -10 30 \n", "1 pH enh 1.000000 1 10 \n", "2 Kond ms/m 1.000000 0 100 \n", "3 Alk mmol/l 1.000000 0 2 \n", "4 Tot-P µg/l 1.000000 0 500 \n", "5 Tot-N µg/l 1.000000 0 4000 \n", "6 NO3 µg/l 1.000000 0 2000 \n", "7 TOC mg/l 1.000000 0 100 \n", "8 RAl µg/l 1.000000 0 500 \n", "9 ILAl µg/l 1.000000 0 500 \n", "10 LAl µg/l 1.000000 0 500 \n", "11 Cl mg/l 1.000000 0 100 \n", "12 SO4 mg/l 1.000000 0 20 \n", "13 Ca mg/l 1.000000 0 500 \n", "14 K mg/l 1.000000 0 10 \n", "15 Mg mg/l 1.000000 0 100 \n", "16 Na mg/l 1.000000 0 50 \n", "17 SIO2 µg/l 0.467543 0 7000 \n", "18 ANC µekv/l 1.000000 -1000 6000 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read parameter mappings\n", "par_df = utils.get_par_unit_mappings()\n", "\n", "# Add to database\n", "par_df.to_sql(name=\"parameters_units\", con=eng, if_exists=\"append\", index=False)\n", "\n", "par_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Historic data from Vannmiljø\n", "\n", "The Vannmiljø dataset is large and reading from Excel is slow; the code below takes a couple of minutes to run.\n", "\n", "Note from the output below that **there are more than 1600 \"duplicated\" samples in the Vannmiljø dataset** i.e. where the station code, sample date, sample depth, lab and parameter name are all the same, but a different value is reported. It would be helpful to know why these duplicates were collected e.g. are these reanalysis values, where only one of the duplicates should be used, or are they genuine (in which case should they be averaged or kept separate?). **For the moment, I will ignore these values**." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of unique stations with data is: 211.\n", "\n", "\n", "There are 1620 duplicated records (same station_code-date-depth-parameter, but different value).\n", "These will be dropped.\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vannmiljo_codesample_datelabdepth1depth2par_unitflagvalueperiod
0027-284352012-01-02NIVA (historic)0.00.0ALK_mmol/l=0.06historic
1027-284352012-01-02NIVA (historic)0.00.0CA_mg/l=1.23historic
2027-284352012-01-02NIVA (historic)0.00.0ILAL_µg/l Al=18.00historic
3027-284352012-01-02NIVA (historic)0.00.0KOND_mS/m=3.80historic
4027-284352012-01-02NIVA (historic)0.00.0PH_<ubenevnt>=6.24historic
\n", "
" ], "text/plain": [ " vannmiljo_code sample_date lab depth1 depth2 par_unit \\\n", "0 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 ALK_mmol/l \n", "1 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 CA_mg/l \n", "2 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 ILAL_µg/l Al \n", "3 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 KOND_mS/m \n", "4 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 PH_ \n", "\n", " flag value period \n", "0 = 0.06 historic \n", "1 = 1.23 historic \n", "2 = 18.00 historic \n", "3 = 3.80 historic \n", "4 = 6.24 historic " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read historic data from Vannmiljø\n", "his_df = utils.read_historic_data(\n", " r\"../../data/vannmiljo_export_2012-19_2021-08-16.xlsx\"\n", ")\n", "\n", "# Tidy lab names for clarity\n", "his_df[\"lab\"].replace(\n", " {\"NIVA\": \"NIVA (historic)\", \"VestfoldLAB AS\": \"VestfoldLAB (historic)\"},\n", " inplace=True,\n", ")\n", "\n", "# Add label for data period\n", "his_df[\"period\"] = \"historic\"\n", "\n", "# Print summary\n", "n_stns = len(his_df[\"vannmiljo_code\"].unique())\n", "print(f\"The number of unique stations with data is: {n_stns}.\\n\")\n", "\n", "# Handle duplicates\n", "his_dup_csv = r\"../../output/vannmiljo_historic/vannmiljo_duplicates.csv\"\n", "his_df = utils.handle_duplicates(his_df, his_dup_csv, action=\"drop\")\n", "\n", "his_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. New data from lab\n", "\n", "The code below reads the Excel template provided by Eurofins and reformats it to the same structure (parameter names, units etc.) as the data in Vannmiljø." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The following location IDs in the new data are not in the definitive station list.\n", "{'062-58824', '062-58827', '062-58828', '062-58826', '021-46388 ', '030-58776', '027-47427', '062-58825'}\n", "\n", "The following location IDs have inconsistent names within this template:\n", " 025-45779 ['Litlåna oppstrøms doserer (5)'] ==> ['Logen nedstrøms doserer ' 'Litlåna oppstrøms doserer ']\n", " 025-81555 ['Stakkeland bru (Rv. 42) (9)'] ==> ['Stakkeland bru (Rv. 42)' 'Stallemobekken v/Underåsen']\n", " 022-54615 ['Logna v/Kyrkjebygda'] ==> ['Logna v/Kyrkjebygda' 'Logna v/Kyrkjebygda ']\n", " 022-90731 ['Entredalsbekken'] ==> ['Entredalsbekken' 'Entredalsbekken ']\n", " 022-97011 ['Stemkjerrbekken ved bru'] ==> ['Stemkjerrbekken ved bru' 'Stemkjerrbekken ved bru ']\n", " 019-58794 ['Fyresvatn 4 dyp (5)'] ==> ['Fyresvatn ' 'Fyresvatn']\n", " 021-101031 ['Otra oppstrøms doserer Iveland (Dalanekilen inntak)'] ==> ['Otra nedstrøms doserer Iveland' 'Otra oppstrøms doserer Iveland']\n", " 036-58752 ['Mosåna oppstrøms doserer (23.1)'] ==> ['Monebekken oppstrøms doserer' 'Monebekken oppstrøms doserer '\n", " 'Mosåna oppstrøms doserer ']\n", " 027-58846 ['Eikeland nedstrøms doserer (51)'] ==> ['Eikeland oppstrøms doserer ' 'Eikeland nedstrøms doserer ']\n", " 062-58820 ['Raundalselva (1)'] ==> ['Evanger kraftstasjon nedstrøms' 'Raundalselva ']\n", " 062-58821 ['Strandaelva (2)'] ==> ['Evanger kraftstasjon nedstrøms' 'Strandaelva ']\n", " 062-58822 ['Teigdalselva (6)'] ==> ['Evanger kraftstasjon nedstrøms' 'Teigdalselva ']\n", " 062-58823 ['Vossedalselva (18)'] ==> ['Evanger kraftstasjon nedstrøms' 'Vossedalselva ']\n", "\n", "The following location names have multiple IDs within this template:\n", " Logen nedstrøms doserer [] ==> ['025-45779' '067-58826']\n", " Stallemobekken v/Underåsen ['022-32019'] ==> ['025-81555' '022-32019']\n", " Eikeland oppstrøms doserer [] ==> ['027-58846' '027-58847']\n", " Samnanger [] ==> ['055-58811' '055-58812' '055-58813']\n", " Evanger kraftstasjon nedstrøms [] ==> ['062-58819' '062-58820' '062-58821' '062-58822' '062-58823' '062-58824'\n", " '062-58825' '062-58826' '062-58827' '062-58828']\n", "\n", "WARNING: File contains samples from several year quarters.\n", "\n", "The following samples have nitrate greater than total nitrogen:\n", " vannmiljo_code sample_date depth1 depth2 NO3_µg/l Tot-N_µg/l\n", "513 027-38543 2020-12-08 12:50:27 0 0 380.0 360\n", "574 030-58776 2020-12-08 07:20:51 0 0 290.0 280\n", "634 030-58838 2020-10-13 07:20:17 0 0 730.0 680\n", "650 027-79278 2020-12-08 07:20:36 0 0 760.0 750\n", "809 038-58854 2020-12-08 12:50:42 0 0 290.0 280\n", "892 036-58748 2020-12-08 07:20:42 0 0 260.0 250\n", "1016 038-58868 2020-12-08 12:50:42 0 0 300.0 290\n", "1349 064-82800 2020-12-08 07:20:42 0 0 150.0 140\n", "1424 045-58816 2020-12-08 12:50:45 0 0 410.0 380\n", "1429 045-58817 2020-12-08 12:50:45 0 0 240.0 220\n", "\n", "The following samples have LAl != RAl - ILAl:\n", " vannmiljo_code sample_date depth1 depth2 RAl_µg/l ILAl_µg/l \\\n", "633 030-58838 2020-10-06 07:20:52 0 0 68.0 5.0 \n", "646 027-79278 2020-09-03 07:10:51 0 0 5.2 5.0 \n", "889 036-58748 2020-10-06 11:45:47 0 0 6.6 5.0 \n", "920 036-58752 2020-10-06 11:45:47 0 0 6.6 5.0 \n", "1221 045-58809 2020-09-08 07:15:11 0 0 5.9 5.0 \n", "1326 064-28998 2020-09-09 07:00:55 0 0 8.8 5.0 \n", "1400 045-58814 2020-10-06 07:20:23 0 0 10.0 5.0 \n", "1412 045-58815 2020-10-06 07:20:23 0 0 11.0 5.0 \n", "1446 062-58819 2020-09-09 07:00:55 0 0 5.5 5.0 \n", "1452 062-58825 2020-11-27 07:15:33 0 0 5.2 5.0 \n", "1467 062-58822 2020-10-07 07:20:55 0 0 6.4 5.0 \n", "\n", " LAl_µg/l LAl_Calc_µg/l \n", "633 0.0 63.0 \n", "646 0.0 0.2 \n", "889 0.0 1.6 \n", "920 0.0 1.6 \n", "1221 0.0 0.9 \n", "1326 0.0 3.8 \n", "1400 0.0 5.0 \n", "1412 0.0 6.0 \n", "1446 0.0 0.5 \n", "1452 0.0 0.2 \n", "1467 0.0 1.4 \n", "\n", "There are 499 duplicated records (same station_code-date-depth-parameter, but different value).\n", "These will be dropped.\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vannmiljo_codesample_datelabdepth1depth2par_unitflagvalueperiod
0025-457792020-09-10 00:00:00Eurofins0.00.0TEMP_°Cnan14.0new
1019-444982020-09-04 07:00:20Eurofins0.00.0TEMP_°Cnan17.3new
2019-444982020-09-10 07:20:15Eurofins0.00.0TEMP_°Cnan16.0new
3019-444982020-09-15 07:00:19Eurofins0.00.0TEMP_°Cnan15.0new
4019-444982020-10-07 07:20:07Eurofins0.00.0TEMP_°Cnan13.0new
\n", "
" ], "text/plain": [ " vannmiljo_code sample_date lab depth1 depth2 par_unit flag \\\n", "0 025-45779 2020-09-10 00:00:00 Eurofins 0.0 0.0 TEMP_°C nan \n", "1 019-44498 2020-09-04 07:00:20 Eurofins 0.0 0.0 TEMP_°C nan \n", "2 019-44498 2020-09-10 07:20:15 Eurofins 0.0 0.0 TEMP_°C nan \n", "3 019-44498 2020-09-15 07:00:19 Eurofins 0.0 0.0 TEMP_°C nan \n", "4 019-44498 2020-10-07 07:20:07 Eurofins 0.0 0.0 TEMP_°C nan \n", "\n", " value period \n", "0 14.0 new \n", "1 17.3 new \n", "2 16.0 new \n", "3 15.0 new \n", "4 13.0 new " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read new data\n", "new_df = utils.read_data_template_to_wide(\n", " f\"../../data/{lab.lower()}_data_{year}_q{qtr}_v{version}.xlsx\",\n", " sheet_name=\"results\",\n", " lab=lab,\n", ")\n", "utils.perform_basic_checks(new_df)\n", "new_df = utils.wide_to_long(new_df, lab)\n", "\n", "# Add label for data period\n", "new_df[\"period\"] = \"new\"\n", "\n", "# Handle duplicates\n", "dup_csv = os.path.join(\n", " fold_path, f\"{lab.lower()}_{year}_q{qtr}_v{version}_duplicates.csv\"\n", ")\n", "new_df = utils.handle_duplicates(new_df, dup_csv, action=\"drop\")\n", "\n", "new_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7 . Combine\n", "\n", "Combine the `historic` and `new` datasets into a single dataframe in \"long\" format." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vannmiljo_codesample_datelabdepth1depth2flagvalueperiodparameterunit
0027-284352012-01-02NIVA (historic)0.00.0=0.06historicALKmmol/l
1027-284352012-01-02NIVA (historic)0.00.0=1.23historicCAmg/l
2027-284352012-01-02NIVA (historic)0.00.0=18.00historicILALµg/l Al
3027-284352012-01-02NIVA (historic)0.00.0=3.80historicKONDmS/m
4027-284352012-01-02NIVA (historic)0.00.0=6.24historicPH<ubenevnt>
\n", "
" ], "text/plain": [ " vannmiljo_code sample_date lab depth1 depth2 flag value \\\n", "0 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 = 0.06 \n", "1 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 = 1.23 \n", "2 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 = 18.00 \n", "3 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 = 3.80 \n", "4 027-28435 2012-01-02 NIVA (historic) 0.0 0.0 = 6.24 \n", "\n", " period parameter unit \n", "0 historic ALK mmol/l \n", "1 historic CA mg/l \n", "2 historic ILAL µg/l Al \n", "3 historic KOND mS/m \n", "4 historic PH " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Combine\n", "df = pd.concat([his_df, new_df], axis=\"rows\")\n", "\n", "# Separate par and unit\n", "df[[\"parameter\", \"unit\"]] = df[\"par_unit\"].str.split(\"_\", n=1, expand=True)\n", "del df[\"par_unit\"]\n", "\n", "df.reset_index(drop=True, inplace=True)\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Apply correction to historic SIO2\n", "df[\"value\"] = np.where(\n", " (df[\"lab\"] == \"VestfoldLAB (historic)\") & (df[\"parameter\"] == \"SIO2\"),\n", " df[\"value\"] * 467.5432,\n", " df[\"value\"],\n", ")\n", "\n", "# Reclassify (nitrate + nitrite) to nitrate\n", "df[\"parameter\"].replace({\"N-SNOX\": \"N-NO3\"}, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Check data ranges\n", "\n", "A simple method for preliminary quality control is to check whether parameter values are within sensible ranges (as defined in the `parameters_units` table; see Section 4 above). I believe this screening should be implemented differently for the `historic` (i.e. Vannmiljø) and `new` datsets, as follows:\n", "\n", " * For the `historic` data in Vannmiljø, values outside the plausible ranges should be **removed from the dataset entirely**. This is because we intend to use the Vannmiljø data as a reference against which new values will be compared, so it is important the dataset does not contain anything too strange. Ideally, the reference dataset should be carefully manually curated to ensure it is as good as possible, but I'm not sure we have the resouces in this project to thoroughly quality assess the data *already* in Vannmiljø. Dealing with any obvious issues is a good start, though\n", " \n", " * For the `new` data, values outside the plausible ranges should be highlighted and checked with the reporting lab\n", " \n", "**Note:** At present, my code will remove any concentration values of exactly zero from the historic dataset. **Check with Øyvind whether this is too strict**." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Checking data ranges for the 'historic' period.\n", " KOND: Maximum value of 309.00 is greater than or equal to upper limit (100.00).\n", " LAL: Minimum value of 0.00 is less than or equal to lower limit (0.00).\n", " SO4: Minimum value of 0.00 is less than or equal to lower limit (0.00).\n", " CA: Minimum value of 0.00 is less than or equal to lower limit (0.00).\n", " SIO2: Maximum value of 51897.30 is greater than or equal to upper limit (7000.00).\n", "\n", "Checking data ranges for the 'new' period.\n", " TEMP: Maximum value of 30.00 is greater than or equal to upper limit (30.00).\n", " N-NO3: Maximum value of 2700.00 is greater than or equal to upper limit (2000.00).\n", " LAL: Minimum value of -1.00 is less than or equal to lower limit (0.00).\n", " SO4: Maximum value of 58.00 is greater than or equal to upper limit (20.00).\n", " ANC: Minimum value of -1000.00 is less than or equal to lower limit (-1000.00).\n", "\n", "Dropping problem rows from historic data.\n", " Dropping rows for KOND.\n", " Dropping rows for LAL.\n", " Dropping rows for SO4.\n", " Dropping rows for CA.\n", " Dropping rows for SIO2.\n" ] } ], "source": [ "# Check ranges\n", "df = utils.check_data_ranges(df)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "ename": "IntegrityError", "evalue": "FOREIGN KEY constraint failed", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIntegrityError\u001b[0m Traceback (most recent call last)", "Input \u001b[0;32mIn [14]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Add to database\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mdf\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_sql\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mname\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mwater_chemistry\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[43mcon\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43meng\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 5\u001b[0m \u001b[43m \u001b[49m\u001b[43mif_exists\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mappend\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 6\u001b[0m \u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 7\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmulti\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 8\u001b[0m \u001b[43m \u001b[49m\u001b[43mchunksize\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m1000\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 9\u001b[0m \u001b[43m)\u001b[49m\n", "File \u001b[0;32m/opt/conda/lib/python3.9/site-packages/pandas/core/generic.py:2963\u001b[0m, in \u001b[0;36mNDFrame.to_sql\u001b[0;34m(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)\u001b[0m\n\u001b[1;32m 2806\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 2807\u001b[0m \u001b[38;5;124;03mWrite records stored in a DataFrame to a SQL database.\u001b[39;00m\n\u001b[1;32m 2808\u001b[0m \n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 2959\u001b[0m \u001b[38;5;124;03m[(1,), (None,), (2,)]\u001b[39;00m\n\u001b[1;32m 2960\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m \u001b[38;5;66;03m# noqa:E501\u001b[39;00m\n\u001b[1;32m 2961\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mpandas\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mio\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m sql\n\u001b[0;32m-> 2963\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43msql\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_sql\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2964\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2965\u001b[0m \u001b[43m \u001b[49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2966\u001b[0m \u001b[43m \u001b[49m\u001b[43mcon\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2967\u001b[0m \u001b[43m \u001b[49m\u001b[43mschema\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mschema\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2968\u001b[0m \u001b[43m \u001b[49m\u001b[43mif_exists\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mif_exists\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2969\u001b[0m \u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2970\u001b[0m \u001b[43m \u001b[49m\u001b[43mindex_label\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex_label\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2971\u001b[0m \u001b[43m \u001b[49m\u001b[43mchunksize\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mchunksize\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2972\u001b[0m \u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2973\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2974\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/opt/conda/lib/python3.9/site-packages/pandas/io/sql.py:697\u001b[0m, in \u001b[0;36mto_sql\u001b[0;34m(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method, engine, **engine_kwargs)\u001b[0m\n\u001b[1;32m 692\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(frame, DataFrame):\n\u001b[1;32m 693\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mNotImplementedError\u001b[39;00m(\n\u001b[1;32m 694\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mframe\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m argument should be either a Series or a DataFrame\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 695\u001b[0m )\n\u001b[0;32m--> 697\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mpandas_sql\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_sql\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 698\u001b[0m \u001b[43m \u001b[49m\u001b[43mframe\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 699\u001b[0m \u001b[43m \u001b[49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 700\u001b[0m \u001b[43m \u001b[49m\u001b[43mif_exists\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mif_exists\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 701\u001b[0m \u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 702\u001b[0m \u001b[43m \u001b[49m\u001b[43mindex_label\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex_label\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 703\u001b[0m \u001b[43m \u001b[49m\u001b[43mschema\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mschema\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 704\u001b[0m \u001b[43m \u001b[49m\u001b[43mchunksize\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mchunksize\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 705\u001b[0m \u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 706\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 707\u001b[0m \u001b[43m \u001b[49m\u001b[43mengine\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mengine\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 708\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mengine_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 709\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/opt/conda/lib/python3.9/site-packages/pandas/io/sql.py:2190\u001b[0m, in \u001b[0;36mSQLiteDatabase.to_sql\u001b[0;34m(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method, **kwargs)\u001b[0m\n\u001b[1;32m 2180\u001b[0m table \u001b[38;5;241m=\u001b[39m SQLiteTable(\n\u001b[1;32m 2181\u001b[0m name,\n\u001b[1;32m 2182\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 2187\u001b[0m dtype\u001b[38;5;241m=\u001b[39mdtype,\n\u001b[1;32m 2188\u001b[0m )\n\u001b[1;32m 2189\u001b[0m table\u001b[38;5;241m.\u001b[39mcreate()\n\u001b[0;32m-> 2190\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mtable\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43minsert\u001b[49m\u001b[43m(\u001b[49m\u001b[43mchunksize\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/opt/conda/lib/python3.9/site-packages/pandas/io/sql.py:950\u001b[0m, in \u001b[0;36mSQLTable.insert\u001b[0;34m(self, chunksize, method)\u001b[0m\n\u001b[1;32m 947\u001b[0m \u001b[38;5;28;01mbreak\u001b[39;00m\n\u001b[1;32m 949\u001b[0m chunk_iter \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mzip\u001b[39m(\u001b[38;5;241m*\u001b[39m(arr[start_i:end_i] \u001b[38;5;28;01mfor\u001b[39;00m arr \u001b[38;5;129;01min\u001b[39;00m data_list))\n\u001b[0;32m--> 950\u001b[0m num_inserted \u001b[38;5;241m=\u001b[39m \u001b[43mexec_insert\u001b[49m\u001b[43m(\u001b[49m\u001b[43mconn\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkeys\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mchunk_iter\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 951\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m num_inserted \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 952\u001b[0m total_inserted \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n", "File \u001b[0;32m/opt/conda/lib/python3.9/site-packages/pandas/io/sql.py:1902\u001b[0m, in \u001b[0;36mSQLiteTable._execute_insert_multi\u001b[0;34m(self, conn, keys, data_iter)\u001b[0m\n\u001b[1;32m 1900\u001b[0m data_list \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(data_iter)\n\u001b[1;32m 1901\u001b[0m flattened_data \u001b[38;5;241m=\u001b[39m [x \u001b[38;5;28;01mfor\u001b[39;00m row \u001b[38;5;129;01min\u001b[39;00m data_list \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m row]\n\u001b[0;32m-> 1902\u001b[0m \u001b[43mconn\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mexecute\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43minsert_statement\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnum_rows\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mlen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mdata_list\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mflattened_data\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1903\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m conn\u001b[38;5;241m.\u001b[39mrowcount\n", "\u001b[0;31mIntegrityError\u001b[0m: FOREIGN KEY constraint failed" ] } ], "source": [ "# Add to database\n", "df.to_sql(\n", " name=\"water_chemistry\",\n", " con=eng,\n", " if_exists=\"append\",\n", " index=False,\n", " method=\"multi\",\n", " chunksize=1000,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eng.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Version 1 summary\n", "\n", "Points to note from the initial data exploration:\n", "\n", "### 9.1. Data formatting and missing data\n", "\n", " * The data submitted matches the VestfoldLAB submission template, with the exception that SiO2 is reported by Eurofins in ug/l, not mg/l. This is fine (I have added additional columns to `../../data/parameter_unit_mapping.xlsx` in order to correctly convert the units reported by Eurofins)\n", " \n", " * There are 61 records in the Eurofins dataset without a valid `Lokalitets ID`. These records are associated with 6 stations. 5 of these were newly added in 2020 and the correct IDs were only supplied by Kjetil in October 2020, so perhaps it makes sense that these should be missing. However, one is an old station (Barstadvassdraget v Liland (1); 026-30849) so it's strange that the code for this was missing. I have added the missing codes based on data in `../../data/active_stations_2020.xlsx`, but **Eurofins will need to do this too**\n", " \n", " * Some cells are marked `.`, others `-`, others `N/A` and others left blank. Is there any significance to these characters? For now, I have assumed they are all equivalent ways of indicating \"no data\" and have deleted these entries accordingly. **Check with Eurofins**\n", " \n", " * The Eurofins data includes `<` characters, whereas VestfoldLAB removed these and replaced the original value by half the LOQ. This is fine; I will keep the `<` characters, and simply use the LOQ value itself (rather than half the value) in subsequent analyses\n", " \n", " * The Excel file from Eurofins mixes `.` and `,` as the decimal separator. I have replaced all occurrences of `,` as a decimal separator with `.`. **Eurofins will probably need to do the same before submitting to Vannmiljø**\n", " \n", " * Inconsistent flag combinations for Al fractions. LAl is simply calculated as (RAl - ILAl), which leads to negative values being reported in some cases (**is this OK?**). However, for one sample, RAl is reported as `5.3` ug/l and ILAl as `< 5` ug/l, but LAl is reported as `< 0,3` ug/l, which doesn't seem correct. Valid values for LAl in this case lie in the range $0.3 < LAl <= 5.3$ ug/l (i.e. LAl is *greater than* 0.3 ug/l). For now, I have converted this value to 0.3 ug/l, but we could choose the middle of the valid range instead\n", " \n", "### 9.2. Duplicates\n", "\n", " * There are **499 duplicated records** in the Eurofins data submission, where same station_code-date-depth-parameter has been reported more than once (see [here](https://github.com/NIVANorge/tiltaksovervakingen/blob/master/output/eurofins_2020_q4_duplicates.csv) for a list). It looks as though these are genuine duplicates (i.e. these samples have been analysed more than once), because they have different `Labreferanse` codes in the spreadsheet. Are these reanalysis values (in which cases only the best/most recent value should probably be included), or do they have another significance? **Check with Eurofins**\n", " \n", "### 9.3. Stations\n", "\n", "The following location IDs have inconsistent names within this template:\n", "\n", " 025-45779 ['Litlåna oppstrøms doserer (5)'] ==> ['Logen nedstrøms doserer ' 'Litlåna oppstrøms doserer ']\n", " 025-81555 ['Stakkeland bru (Rv. 42) (9)'] ==> ['Stakkeland bru (Rv. 42)' 'Stallemobekken v/Underåsen']\n", " 022-54615 ['Logna v/Kyrkjebygda'] ==> ['Logna v/Kyrkjebygda' 'Logna v/Kyrkjebygda ']\n", " 022-90731 ['Entredalsbekken'] ==> ['Entredalsbekken' 'Entredalsbekken ']\n", " 022-97011 ['Stemkjerrbekken ved bru'] ==> ['Stemkjerrbekken ved bru' 'Stemkjerrbekken ved bru ']\n", " 019-58794 ['Fyresvatn 4 dyp (5)'] ==> ['Fyresvatn ' 'Fyresvatn']\n", " 021-101031 ['Otra oppstrøms doserer Iveland (Dalanekilen inntak)'] ==> ['Otra nedstrøms doserer Iveland' 'Otra oppstrøms doserer Iveland']\n", " 036-58752 ['Mosåna oppstrøms doserer (23.1)'] ==> ['Monebekken oppstrøms doserer' 'Monebekken oppstrøms doserer ' 'Mosåna oppstrøms doserer '] \n", " 027-58846 ['Eikeland nedstrøms doserer (51)'] ==> ['Eikeland oppstrøms doserer ' 'Eikeland nedstrøms doserer ']\n", " 062-58820 ['Raundalselva (1)'] ==> ['Evanger kraftstasjon nedstrøms' 'Raundalselva ']\n", " 062-58821 ['Strandaelva (2)'] ==> ['Evanger kraftstasjon nedstrøms' 'Strandaelva ']\n", " 062-58822 ['Teigdalselva (6)'] ==> ['Evanger kraftstasjon nedstrøms' 'Teigdalselva ']\n", " 062-58823 ['Vossedalselva (18)'] ==> ['Evanger kraftstasjon nedstrøms' 'Vossedalselva ']\n", "\n", "The following location names have multiple IDs within this template:\n", "\n", " Logen nedstrøms doserer [] ==> ['025-45779' '067-58826']\n", " Stallemobekken v/Underåsen ['022-32019'] ==> ['025-81555' '022-32019']\n", " Eikeland oppstrøms doserer [] ==> ['027-58846' '027-58847']\n", " Samnanger [] ==> ['055-58811' '055-58812' '055-58813']\n", " Evanger kraftstasjon nedstrøms [] ==> ['062-58819' '062-58820' '062-58821' '062-58822' '062-58823' '062-58824' '062-58825' '062-58826' '062-58827' '062-58828']\n", " \n", "### 9.4. Range checks\n", "\n", "**Note:** Some of the points identified below are explored more thoroughly in the time series analysis.\n", "\n", " * Calculating LAl as (RAl - ILAl) gives negative values in some cases. Should these be reported \"as is\", or should these values be set to zero? **Check with Kjetil and Eurofins**\n", " \n", " * Values for nitrate > 2000 ug/l seem very high?\n", " \n", " * An SO4 concentration of 58 mg/l seems very high?\n", " \n", " * ANC of -1000 uekv/l seems implausibly low?\n", " \n", "The following samples have nitrate greater than total nitrogen:\n", "\n", " vannmiljo_code sample_date depth1 depth2 NO3_µg/l Tot-N_µg/l\n", " 513 027-38543 2020-12-08 12:50:27 0 0 380.0 360\n", " 574 030-58776 2020-12-08 07:20:51 0 0 290.0 280\n", " 634 030-58838 2020-10-13 07:20:17 0 0 730.0 680\n", " 650 027-79278 2020-12-08 07:20:36 0 0 760.0 750\n", " 809 038-58854 2020-12-08 12:50:42 0 0 290.0 280\n", " 892 036-58748 2020-12-08 07:20:42 0 0 260.0 250\n", " 1016 038-58868 2020-12-08 12:50:42 0 0 300.0 290\n", " 1349 064-82800 2020-12-08 07:20:42 0 0 150.0 140\n", " 1424 045-58816 2020-12-08 12:50:45 0 0 410.0 380\n", " 1429 045-58817 2020-12-08 12:50:45 0 0 240.0 220" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" } }, "nbformat": 4, "nbformat_minor": 4 }