{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Demo 2 - Objekterkennung (NER = Named Entity Recognition)\n",
    "\n",
    "Jupyter Notebooks enthalten Text- oder Codeblöcke. Codeblöcke werden durch SHIFT-ENTER ausgeführt (Dreiecksymbol oben rechts auf Mobil).\n",
    "\n",
    "Solange ein Block ausgeführt wird, wird nicht mehr die Nr [3] des Blocks, sondern [*] angezeigt.\n",
    "\n",
    "**<a href=\"./Demo1-Distanz.ipynb\" target=\"_blank\">Link zu Demo 1</a>**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialisierung"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# NER: https://spacy.io/usage/linguistic-features#named-entities\n",
    "# Compatible with: spaCy v2.0.0+\n",
    "\n",
    "# Standard libs\n",
    "from __future__ import unicode_literals, print_function\n",
    "import random\n",
    "from pathlib import Path\n",
    "\n",
    "# spacy als Standard für Textverarbeitung\n",
    "import spacy\n",
    "from spacy.util import minibatch, compounding\n",
    "from spacy import displacy\n",
    "\n",
    "# Check, welche Version genutzt wird. Ändert sich häufig mit starken Anpassungen.\n",
    "print(spacy.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Objekterkennung Klickibunti"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Laden des deutschen Sprachmodells\n",
    "# https://spacy.io/models/de#de_core_news_sm\n",
    "nlp = spacy.load('de_core_news_sm')\n",
    "\n",
    "# Einlesen eines Satzes als Unicode-String\n",
    "doc = nlp(u'Die Bundesregierung wechselte ihren Sitz von Bonn nach Berlin.')\n",
    "\n",
    "# Ausgabe der erkannten Objekte (Start, Ende) und Art\n",
    "for ent in doc.ents:\n",
    "    print(ent.text, ent.start_char, ent.end_char, ent.label_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Und in Farbe :-)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualisierung der erkannten Textbestandteile\n",
    "displacy.render(doc, style='ent', jupyter=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualisierung der Satzstruktur (Dependency Model)\n",
    "displacy.render(doc, style='dep', jupyter=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = nlp(u'Theresa Mays Plan B könnte erneut Plan A sein, mit dem die Premierministerin in der vergangenen Woche so krachend im Unterhaus gescheitert ist: das mit Brüssel ausgehandelte Austrittsabkommen. Sie könnte jedoch eine wichtige Änderung vorschlagen: May will, nach Berichten britischer Medien, den Backstopp, die Auffanglösung für die Grenze auf der irischen Insel, noch einmal mit der EU nachverhandeln.')\n",
    "\n",
    "displacy.render(doc, style='ent', jupyter=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hinzufügen von Objekten\n",
    "\n",
    "Das deutsche Sprachmodell von spacy ist von 2017 und basiert auf Nachrichtenartikeln mit hoher Häufigkeit. \n",
    "\n",
    "Wenn bestimmte Fachbegriffe hinzugfügt werden sollen, muss das Modell nachtrainiert werden."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Hinzufügen der Objekte Bundesregierung (ORG) und London/München (LOC)\n",
    "TRAIN_DATA = [\n",
    "    (\"Die Bundesregierung wechselte ihren Sitz von Bonn nach Berlin.\", {\"entities\": [(4, 19, \"ORG\"), \n",
    "                                                                                     (45, 49, \"LOC\"),\n",
    "                                                                                     (55, 61, \"LOC\")]}),\n",
    "    (\"London, Muenchen\", {\"entities\": [(1, 7, \"LOC\"), (9, 17, \"LOC\")]}),\n",
    "]\n",
    "\n",
    "# Typische Anzahl für Lerniterationen n = 100, hier \"nur\" 10 fürs schnelle Durchlaufen\n",
    "def main(model=None, output_dir=\"./spacy_modell/\", n_iter=10):\n",
    "    # NER Pipeline laden    \n",
    "    ner = nlp.get_pipe(\"ner\")\n",
    "    # Datenvorbereitung\n",
    "    for _, annotations in TRAIN_DATA:\n",
    "        for ent in annotations.get(\"entities\"):\n",
    "            ner.add_label(ent[2])\n",
    "    # Andere Pipelines fürs Lernen abschalten\n",
    "    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != \"ner\"]\n",
    "    with nlp.disable_pipes(*other_pipes):  # only train NER\n",
    "        # Gewichte des neuen Modells setzen\n",
    "        if model is None:\n",
    "            nlp.begin_training()\n",
    "        for itn in range(n_iter):\n",
    "            random.shuffle(TRAIN_DATA)\n",
    "            losses = {}\n",
    "            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))\n",
    "            for batch in batches:\n",
    "                texts, annotations = zip(*batch)\n",
    "                nlp.update(\n",
    "                    texts,  \n",
    "                    annotations,  \n",
    "                    drop=0.5,  \n",
    "                    losses=losses,\n",
    "                )\n",
    "            print(\"Losses\", losses)\n",
    "\n",
    "    # Test der trainierten Daten\n",
    "    for text, _ in TRAIN_DATA:\n",
    "        doc = nlp(text)\n",
    "        print(\"Entities\", [(ent.text, ent.label_) for ent in doc.ents])\n",
    "        print(\"Tokens\", [(t.text, t.ent_type_, t.ent_iob) for t in doc])\n",
    "\n",
    "    # Modell speichern\n",
    "    nlp.to_disk(Path(output_dir))\n",
    "\n",
    "# Methode aufrufen\n",
    "main()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp = spacy.load(\"./spacy_modell/\")\n",
    "doc = nlp(u'Hannover ist die schönste Stadt in Deutschland.')\n",
    "\n",
    "for ent in doc.ents:\n",
    "    print(ent.text, ent.start_char, ent.end_char, ent.label_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "displacy.render(doc, style='ent', jupyter=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = nlp(u'Die Bundesregierung wechselte ihren Sitz von Bonn nach Berlin.')\n",
    "displacy.render(doc, style='ent', jupyter=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Probleme beim Re-Training\n",
    "\n",
    "Idealerweise wird das komplette Netz mit den neuen Informationen komplett neu trainiert, ansonsten kann das Netz einige Informationen \"vergessen\".\n",
    "\n",
    "Mehr dazu: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting\n",
    "\n",
    "Volles Trainig deutlich mehr Zeit!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "options = {'compact': True, 'color': 'blue'}\n",
    "displacy.render(doc, style='dep', jupyter=True, options=options)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}