{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Approximative Zwischenergebnisse des Nowcastings aus dem RKI Situationsbericht\n", "\n", "Thomas Viehmann, \n", "\n", "Hier extrahieren wir die Zahlen für Bekannte/Imputierte/Nowcast-Fälle aus den Situationsberichten.\n", "\n", "Bitte beachten Sie die Datenlizenz für die RKI Daten." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import collections\n", "import bs4\n", "import pandas\n", "%matplotlib inline\n", "from matplotlib import pyplot\n", "import math\n", "import numpy\n", "import re\n", "import datetime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ausgangspunkt ist die SVG-Convertierte Seite mit der Nowcasting grafik aus dem Situationsbericht:\n", "\n", "https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Situationsberichte/Gesamt.html\n", "\n", "Die kann man z.B. durch `pdf2svg 2020-05-16-de.pdf 2020-05-16-de.svg 7` oder mit Inkscape bekommen.\n", "Das ganze habe ich an einigen Tagen ausprobiert, es ist möglicherweise aber nicht stabil, weil es die \"unendlichen Möglichkeiten\" des SVG-Formats nicht abbildet.\n", "\n", "Wir benutzen die vom RKI veröffentlichten Nowcasting-Zahlen des RKI, um die Skalierung zu berechnen (eleganter, da eigenständiger wäre natürlich, eine Horizontale Achse (z.B. 4000) im Graphen zu suchen.\n", "\n", "https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Projekte_RKI/Nowcasting.html\n", "\n", "Den Verlauf nehmen wir auch gleich zur Kontrolle.\n", "\n", "Hier sind die Dateinamen:" ] }, { "cell_type": "code", "execution_count": 226, "metadata": {}, "outputs": [], "source": [ "fn_nowcasting_excel = 'Nowcasting_Zahlen.2020-05-17.xlsx'\n", "fn_svg_from_report = '/tmp/2020-05-17-de.svg'\n", "fn_output = './nowcast_imputed_known_from_graph_2020-05-17.csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Erst mal die Tabelle auslesen:" ] }, { "cell_type": "code", "execution_count": 311, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5330" ] }, "execution_count": 311, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rki_numbers = pandas.read_excel(fn_nowcasting_excel, sheet_name='Nowcast_R', index_col=0)\n", "nowcasting_excel = rki_numbers.iloc[:, 3]\n", "assert nowcasting_excel.name.startswith('Punkts')\n", "max_from_excel = nowcasting_excel.max()\n", "max_from_excel # should be around 5331, but does vary (presumably due to stochastic imputation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Das extrahieren der Zahlen ist sehr ad-hoc. Wir finden mit den passenden Farben gefüllte `path`-Elemente und versuchen einen improvisierten Parser für die Bewegungen und merken uns die minimalen und maximalen x und y Koordinaten \"der Balken\"." ] }, { "cell_type": "code", "execution_count": 325, "metadata": {}, "outputs": [], "source": [ "root = bs4.BeautifulSoup(open(fn_svg_from_report))\n", "fill_re = re.compile('fill:rgb|fill:#')\n", "lookfor = {(21, 48, 64) : 'known', (56, 72, 85): 'nowcast', (70, 70, 70) : 'imputed',\n", " 'fill:#367ba4': 'known', 'fill:#b3b3b3': 'imputed', 'fill:#90bad9': 'nowcast'}\n", "paths_with_fill_style = [p for p in root.find_all('path') if fill_re.search(p.get('style',''))]\n", "\n", "values = collections.defaultdict(list)\n", "for p in paths_with_fill_style:\n", " fill_style = [s.strip() for s in p['style'].split(';') if fill_re.match(s) is not None][0]\n", " if 'rgb' in fill_style:\n", " rgb = tuple(int(float(s)) for s in re.match(r\"fill:rgb\\(([0-9\\.]+)%,([0-9\\.]+)%,([0-9\\.]+)%\\)\", fill_style).groups())\n", " else:\n", " rgb = fill_style.strip()\n", " key = lookfor.get(rgb)\n", " if key is not None:\n", " movements = re.split('[, ]+', p['d'].strip())\n", " i = 0\n", " ymin = math.inf\n", " ymax = -math.inf\n", " xmax = -math.inf\n", " xmin = math.inf\n", " while i < len(movements):\n", " v = movements[i]\n", " if v in {'M', 'L', 'm'}:\n", " x = float(movements[i + 1])\n", " y = float(movements[i + 2])\n", " i += 3\n", " elif v in {'Z', 'z'}:\n", " i += 1\n", " elif v == 'h':\n", " x += float(movements[i+1])\n", " i += 2\n", " elif v == 'H':\n", " x = float(movements[i+1])\n", " i += 2\n", " elif v == 'v':\n", " y += float(movements[i+1])\n", " i += 2\n", " elif v == 'V':\n", " y = float(movements[i+1])\n", " i += 2\n", " else:\n", " raise Exception(f\"need to parse more path: {i}, {v}\")\n", " xmin = min(xmin, x)\n", " ymin = min(ymin, y)\n", " ymax = max(ymax, y)\n", " xmax = max(xmax, x)\n", " values[key].append((xmin, ymin, ymax))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jertzt haben wir zu viel: mindestens mal aus der Legende bekommen wir \"Irrläufer\". Wir ermitteln eine \"Grundlinie\", die \"fast alle\" Pfade als Minimale/Maximale y-Ausdehnung haben. Ich kenne mich mit SVG nicht aus, aber ich habe den Eindruck, dass das Koordinatensystem sowohl unten als auch oben kleine y-Werte kann, deshalb checkt das Skript beides. Dann behalten wir nur Pfade, die auf der Grundline liegen." ] }, { "cell_type": "code", "execution_count": 326, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nowcast 86 85\n", "imputed 86 85\n", "known 85 84\n" ] } ], "source": [ "values = dict(values)\n", "for k, v in values.items():\n", " vals = numpy.array(sorted(v))\n", " # drop legend\n", " p25 = numpy.percentile(vals, 25, axis=0)\n", " p75 = numpy.percentile(vals, 75, axis=0)\n", " iqr = p75-p25\n", " if iqr[1] < 1:\n", " values[k] = numpy.array([(x, ymax - ymin) for x, ymin, ymax in sorted(v) if numpy.abs(ymin-p75[1]) < 1])\n", " elif iqr[2] < 1:\n", " values[k] = numpy.array([(x, ymax - ymin) for x, ymin, ymax in sorted(v) if numpy.abs(ymax-p75[2]) < 1])\n", " else:\n", " raise Exception('Need xmin or xmax to vary little')\n", " print(k, len(vals), len(values[k]))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Es scheint möglich, dass Werte (am Anfang?) fehlen. Also basteln wir Indizes aus der x-Position und füllen Leerstellen durch Nullen auf." ] }, { "cell_type": "code", "execution_count": 327, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "85" ] }, "execution_count": 327, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xes = set()\n", "for v in values.values():\n", " xes |= {int(10*x) for x in v[:, 0]}\n", "xdict = {v: i for i,v in enumerate(sorted(xes))}\n", "\n", "for k, v in values.items():\n", " new_v = numpy.zeros(len(xdict))\n", " for x, h in v:\n", " new_v[xdict[int(10*x)]] = h\n", " values[k] = new_v\n", "len(xdict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jetzt skalieren wir alles mit dem Maximum und schon können wir graphen machen." ] }, { "cell_type": "code", "execution_count": 338, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "mult = max_from_excel / values['nowcast'].max()\n", "\n", "for k, v in values.items():\n", " v *= mult\n", " pyplot.plot(values[k], label=k)\n", "\n", "pyplot.legend()\n", "pyplot.savefig(fn_output+'.svg')\n", "pyplot.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dann machen wir einen Dataframe. Wir gehen davon aus, dass der Graph und die Nowcasting-Tabelle am selben Tag enden." ] }, { "cell_type": "code", "execution_count": 333, "metadata": {}, "outputs": [], "source": [ "end_date = nowcasting_excel.index.max()\n", "idx = pandas.date_range(end_date - pandas.Timedelta(len(values['nowcast'])-1, unit='D'), end_date, name='Refdatum')\n", "df = pandas.DataFrame(values, index=idx)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vergleichen wir noch mit den Excel-Zahlen. Ich bekomme Abweichungen von 0-4." ] }, { "cell_type": "code", "execution_count": 334, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 334, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pyplot.plot((nowcasting_excel - df.nowcast).dropna().abs())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wenn alles gut aussieht, sind wir fertig. Wir schreiben die Ergebnisse in ein CSV." ] }, { "cell_type": "code", "execution_count": 336, "metadata": {}, "outputs": [], "source": [ "df.to_csv(fn_output)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3rc1" } }, "nbformat": 4, "nbformat_minor": 2 }