{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#
Creating Rules-Based Pipeline for Holocaust Documents
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Dr. W.J.B. Mattingly
\n", "\n", "
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
\n", "\n", "
January 2021
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Concepts in this Notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1) How to add pipes to a spaCy model
\n", "2) How to Consider Rules
\n", "3) How to Implement those Rules
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will walk through some of the heuristic pipes I developed or am developing for my Holocaust NER spaCy pipeline. The purpose of these heuristic pipes is not to catch all potential entities, but to return with a high degree of confidence only true positives. We accept that the heuristics won't catch everything because the final item in this long pipeline is a machine learning NER model that will generalize, or make predictions, on the unseen data. By structuring many heuristics in the pipeline, we can radically reduce the chances of our ML model making a wrong prediction because the all known true positives will have already been annotated.\n", "\n", "I should also note that the code in my pipes is repetitious by design. As this is a textbook, it would be difficult for the reader to consistently look at the top of the notebook to identify the variable set 20 cells earlier. For this reason, I opt to recreate the variables later in the notebook. While this is bad practice, it allows the reader to understand better what is happening at any given moment in the notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Blank spaCy Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing we need to do is import all of the different components from spaCy and other libraries that we will need. I will explain these as we go forward." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Enabling eager execution\n", "INFO:tensorflow:Enabling v2 tensorshape\n", "INFO:tensorflow:Enabling resource variables\n", "INFO:tensorflow:Enabling tensor equality\n", "INFO:tensorflow:Enabling control flow v2\n" ] } ], "source": [ "import spacy\n", "from spacy.util import filter_spans\n", "from spacy.tokens import Span\n", "from spacy.language import Language\n", "import re\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be using Pandas in this notebook to run data checks in a CSV file originally produced by the Holocaust Geographies Collaborative, headed by Anne Knowles, Tim Cole, Alberto Giordano, Paul Jaskot, and Anika Walke. We will be importing RegEx because a lot of our heuristics will rely on capturing multi-word tokens.\n", "\n", "Now that we have imported everything, let's create a blank English pipeline in spaCy. As we work through this notebook, we will add pipes to it." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages\\spacy\\util.py:730: UserWarning: [W095] Model 'en_core_web_trf' (3.0.0) was trained with spaCy v3.0 and may not be 100% compatible with the current version (3.1.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate\n", " warnings.warn(warn_msg)\n" ] } ], "source": [ "nlp = spacy.load(\"en_core_web_trf\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add Pipe for Finding Streets" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "streets_pattern = r\"([A-Z][a-z]*(strasse|straße|straat)\\b|([A-Z][a-z]* (Street|St|Boulevard|Blvd|Avenue|Ave|Road|Rd|Lane|Ln|Place|Pl)(\\.)*))\"\n", "@Language.component(\"find_streets\")\n", "def find_streets(doc):\n", " text = doc.text\n", " camp_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(streets_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if span is not None:\n", " camp_ents.append((span.start, span.end, span.text))\n", " for ent in camp_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"STREET\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)\n", "nlp.add_pipe(\"find_streets\", before=\"ner\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Pipe for Finding Ships" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The very first pipe I want to create is a pipe that can find ships, specifically transport ships. In order to achieve this objective, this pipe does several things at once. First, it leverages RegEx to find known patterns that are always ships, e.g. multi-word tokens that may or may not begin with \"S.S.\", \"SS\", or \"The\", followed by a list of known ships. I have opted to not use a spaCy EntityRuler here primarily because I want to expand this pipe in the future and it allows me to find matches that are more varied. Were I to implement this in an EntityRuler pipe, I would need to have many patterns sit in its knowledge-base.\n", "\n", "But finding one of these patterns isn't enough. I want to ensure that the thing referenced is in fact a ship. Many of these terms could easily be toponyms, or entities that share the same spelling but mean different things in different contexts, e.g. the Ile de France, could easily be a GPE that refers to the area around Paris. General Hosey could easily be a PERSON. To ensure toponym disambiguation, I set up several contextual clues, e.g. the list of nautical terms. If any of these words appear in area around the hit, then the heuristics assign that token the label of SHIP. If not, it ignores it and allows later pipes or the machine learning model to annotate it." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ships_pattern = r\"((S.S. |SS |The )*(Lieutenant Colonel James Barker|General Hosey|Pan Crescent|Marilyn Marlene|Winnipeg|Ile de France|Scythia|Aquitania|Empress of Britain|General A. W. Greely|General J. H. McRae|Empress of Scotland|General T. H. Bliss|New Amsterdam|Niagara|Henry Gibbs|Serpa Pinto|Mauretania|Cabo de Hornos|Julius Caesar|Ben Hecht|Sțrumah|Strumah|General Harry Taylor|General W.P. Richardson|Marine Jumper|Simon Bolivar|Pan York|Mauretania|Orduña|Wilhelm Gustloff|Orduna|General W.H. Gordon|Rakuyō Maru|Rakuyo Maru|Mouzinho|Saturnia|St. Louis|Saint Louis|Nyassa|Simon Bolivar|Queen Elizabeth|Exodus 1947|Dunera|Cap Arcona|Ernie Pyle|Hayim Arlozorov|Patria))\"\n", "@Language.component(\"find_ships\")\n", "def find_ships(doc):\n", " text = doc.text\n", " new_ents = []\n", " original_ents = list(doc.ents)\n", " nautical = [\"ship\", \"boat\", \"sail\", \"captain\", \"sea\", \"harbor\", \"aboard\", \"admiral\", \"liner\"]\n", " for match in re.finditer(ships_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " context = text[start-100:end+100]\n", " if any(term in context.lower() for term in nautical):\n", " if span is not None:\n", " new_ents.append((span.start, span.end, span.text))\n", " else:\n", " span = doc.char_span(start, end-1)\n", " if span is not None:\n", " new_ents.append((span.start, span.end-1, span.text))\n", " for ent in new_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"SHIP\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)\n", "nlp.add_pipe(\"find_ships\", before=\"ner\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look and load in some data to see how this pipe functions. Remember, our goal is not to capture all ships, just ensure the ones we captured are true positives." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# import glob\n", "# hits = []\n", "# files = glob.glob(\"data/new_ocr/*trs_en.txt\")\n", "# all_data = {}\n", "# for file in files[:30]:\n", "# all_hits = []\n", "# with open (file, \"r\", encoding=\"utf-8\") as f:\n", "# print (file)\n", "# text = f.read()\n", "# doc = nlp(text)\n", "# for ent in doc.ents:\n", "# if ent.label_ == \"SHIP\":\n", "# print (ent.text, ent.label_)\n", "# print (text[ent.start_char-100:ent.end_char+100])\n", "# print ()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Success! File, data/new_ocr\\RG-50.030.0006_trs_en.txt, referenced two ships, the Cap Arcona and the St. Louis. Now that we know we can capture ships in this manner, let's try out the same principle on ghettos." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Pipe for Finding Ghettos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As of the time of pushing this notebook, August 13, 2021, I have yet to receive a comprehensive dataset of ghettos. In the near future, this pipe will be vastly improved, similar to the concentration camp pipe below. For now, this pipe functions precisely the same way our earlier ship pipe function. Here, we're looking for the use of the word ghetto around one of these known cities that had ghettos. This is absolutely necessary because any of these cities could be a GPE in general. The use of the word ghetto within a small contextual window is a good heuristic for assigning this city the label of GHETTO over GPE." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ghetto_pattern = r\"(Anykščiai|Anyksciai|Arad|Ashmiany|Babruĭsk|Babruisk|Balassagyarmat|Baranavichy|Barysaŭ|Barysau|Będzin|Bedzin|Bełżyce|Belzyce|Berdychiv|Berehove|Berestechko|Berezdiv|Berezhany|Berezne|Bershad'|Biała Podlaska|Biala Podlaska|Birkenau|Biała Rawska|Białystok|Bialystok|Biaroza|Bibrka|Bielsko-Biała|Biržai|Bitola|Blazhiv|Bobowa|Bochnia|Bolekhiv|Borshchuv|Boryslav|Boskovice|Brańsk|Bratslav|Brody|Brzesko|Buczacz|Budapest|Bus'k|Bychawa|Chashniki|Chrzanów|Chrzanow|Ciechanów|Ciechanow|Cieszanów|Cristuru Secuiesc|Czernowitz|Częstochowa|Czortków|Dąbrowa Górnicza|Dąbrowa Tarnowska|Damashėvichy|Daugavpils|Dokshytsy|Dombóvár|Dombrowa|Drohobycz|Drzewica|Dubrovytsia|Dzialoszyce|Dziarechyn|Dziatlava|Glebokie|Gol'shany|Góra Kalwaria|Gorodnaia|Gostynin|Gyöngyös|Hajdúszoboszló|Halushchyntsi|Halych|Hantsavichy|Haradnaia|Hatvan|Hlusk|Hlyniany|Homel'|Horodenka|Horokhiv|Hradzianka|Hrodna|Hvizdets'|Iaktoriv|Izbica Lubelska|Józefów|Kalisz|Kałuszyn|Kam'iane Pole|Kamin'-Kashyrs'kyĭ|Katowice|Kecskemét|Kelme|Kharkiv|Khmel'nyts'ka oblast'|Khmel'nyts'kyĭ|Khust|Kielce|Kisvárda|Kletsk|Kobryn|Kolbuszowa|Kolozsvár|Komarów-Osada|Kopychyntsi|Korets'|Košice|Kőszeg|Kovel'|Kozienice|Kraków|Kraśnik|Kretinga|Krośniewice|Krymne|Kryzhopil'|Kul'chyny|Kunhegyes|Kutno|Kysylyn|Ladyzhyn|Lakhva|Lask|Lęczyca|Lesko|Lida|Liepāja|Lipinki|Lithakia|Litin|Litzmannstadt|Liubavichi|Łomża|Lubaczów|Lubartów|Lublin|Łuck|Lwów|Lyubcha|Mahiliou|Maków Mazowiecki|Marcinkonys|Matejovce nad Hornádom|Mátészalka|Miechów|Międzyrzec Podlaski|Minsk|Mir|Miskolc|Modliborzyce|Mogilev|Monastyrok|Monor|Munkács|Nadvirna|Nagyvárad|Navahrudak|Novomyrhorod|Nowy Sącz|Nyíregyháza|Odessa|Oleyëvo-Korolëvka|Opatów|Opoczno|Opole|Opole Lubelskie|Orla|Orsha|Ostroh|Ostrowiec Świętokrzyski|Otwock|Ozarintsy|Ozorków|Pabianice|Papul|Parichi|Pechera|Pinsk|Piotrków Trybunalski|Płaszów|Płock|Plońsk|Praszka|Prienai|Prużana|Pruzhany|Przemyśl|Pułtusk|Radom|Radomyśl Wielki|Radun'|Rava-Rus'ka|Rawa Mazowiecka|Reghin|Ribnița|Riga|Rohatyn|Romanove Selo|Rozhyshche|Rudky|Rudnik nad Sanem|Rzeszów|Saharna|Šahy|Salgótarján|Sarny|Sátoraljaújhely|Schwientochlowitz|Senkevychivka|Sernyky|Sharhorod|Shchyrets'|Shepetivka|Shpola|Shumilino|Šiauliai|Siedlce|Siedliszcze|Sieradz|Sighetu Marmației|Skalat|Slobodka|Slonim|Slutsk|Smolensk|Sokołów Podlaski|Sokyrnytsia|Solotvyno|Soroca|Sosnowiec|Stalovichy|Stanislav|Stara Mohylʹnytsia|Starachowice|Starokostiantyniv|Stary Sącz|Stepan'|Stoczek Lukowski|Stolbëisy|Stolin|Sucha|Suchowola|Surazh|Švenčionys|Szarvas|Szczebrzeszyn|Szeged|Szolnok|Tarnogród|Tarnów|Telšiai|Terebovlia|Ternopol|Theresienstadt|Thessalonike|Timkovichi|Tlumach|Tolna|Tomaszów Mazowiecki|Torchyn|Trakai|Trebíč|Trnava|Tul'chyn|Tuliszków|Tyvriv|Uzda|Uzhhorod|Vác|Valozhyn|Velizh|Velykyĭ Bereznyĭ|Vilna|Vinnytsia|Vlonia|Volodymyr-Volyns'kyi|Vysokovskiy Rayon|Warka|Warsaw|Wisznice|Wrocław|Žagarė|Zamość|Zarichne|Zboriv|Zduńska Wola|Zhmerinka|Zhytomyr|Žiežmariai|Anyksciai|Arad|Ashmiany|Babruisk|Balassagyarmat|Baranavichy|Barysau|Bedzin|Bełzyce|Berdychiv|Berehove|Berestechko|Berezdiv|Berezhany|Berezne|Bershad'|Biała Podlaska|Biała Rawska|Białystok|Biaroza|Bibrka|Bielsko-Biała|Birzai|Bitola|Blazhiv|Bobowa|Bochnia|Bolekhiv|Borshchuv|Boryslav|Boskovice|Bransk|Bratslav|Brody|Brzesko|Buczacz|Budapest|Bus'k|Bychawa|Chashniki|Chrzanow|Ciechanow|Cieszanow|Cristuru Secuiesc|Czernowitz|Czestochowa|Czortkow|Dabrowa Gornicza|Dabrowa Tarnowska|Damashevichy|Daugavpils|Dokshytsy|Dombovar|Dombrowa|Drohobycz|Drzewica|Dubrovytsia|Dzialoszyce|Dziarechyn|Dziatlava|Glebokie|Gol'shany|Gora Kalwaria|Gorodnaia|Gostynin|Gyongyos|Hajduszoboszlo|Halushchyntsi|Halych|Hantsavichy|Haradnaia|Hatvan|Hlusk|Hlyniany|Homel'|Horodenka|Horokhiv|Hradzianka|Hrodna|Hvizdets'|Iaktoriv|Izbica Lubelska|Jozefow|Kalisz|Kałuszyn|Kam'iane Pole|Kamin'-Kashyrs'kyi|Katowice|Kecskemet|Kelme|Kharkiv|Khmel'nyts'ka oblast'|Khmel'nyts'kyi|Khust|Kielce|Kisvarda|Kletsk|Kobryn|Kolbuszowa|Kolozsvar|Komarow-Osada|Kopychyntsi|Korets'|Kosice|Koszeg|Kovel'|Kozienice|Krakow|Krasnik|Kretinga|Krosniewice|Krymne|Kryzhopil'|Kul'chyny|Kunhegyes|Kutno|Kysylyn|Ladyzhyn|Lakhva|Lask|Leczyca|Lesko|Lida|Liepaja|Lipinki|Lithakia|Litin|Litzmannstadt|Liubavichi|Łomza|Lubaczow|Lubartow|Lublin|Łuck|Lwow|Lyubcha|Mahiliou|Makow Mazowiecki|Marcinkonys|Matejovce nad Hornadom|Mateszalka|Miechow|Miedzyrzec Podlaski|Minsk|Mir|Miskolc|Modliborzyce|Mogilev|Monastyrok|Monor|Munkacs|Nadvirna|Nagyvarad|Navahrudak|Novomyrhorod|Nowy Sacz|Nyiregyhaza|Odessa|Oleyevo-Korolevka|Opatow|Opoczno|Opole|Opole Lubelskie|Orla|Orsha|Ostroh|Ostrowiec Swietokrzyski|Otwock|Ozarintsy|Ozorkow|Pabianice|Papul|Parichi|Pechera|Pinsk|Piotrkow Trybunalski|Płaszow|Płock|Plonsk|Praszka|Prienai|Pruzana|Pruzhany|Przemysl|Pułtusk|Radom|Radomysl Wielki|Radun'|Rava-Rus'ka|Rawa Mazowiecka|Reghin|Ribnita|Riga|Rohatyn|Romanove Selo|Rozhyshche|Rudky|Rudnik nad Sanem|Rzeszow|Saharna|Sahy|Salgotarjan|Sarny|Satoraljaujhely|Senkevychivka|Sernyky|Sharhorod|Shchyrets'|Shepetivka|Shpola|Shumilino|Siauliai|Siedlce|Siedliszcze|Sieradz|Sighetu Marmatiei|Skalat|Slobodka|Slonim|Slutsk|Smolensk|Sokołow Podlaski|Sokyrnytsia|Solotvyno|Soroca|Sosnowiec|Stalovichy|Stanislav|Stara Mohylʹnytsia|Starachowice|Starokostiantyniv|Stary Sacz|Stepan'|Stoczek Lukowski|Stolbeisy|Stolin|Sucha|Suchowola|Surazh|Svencionys|Szarvas|Szczebrzeszyn|Szeged|Szolnok|Tarnogrod|Tarnow|Telsiai|Terebovlia|Ternopol|Theresienstadt|Thessalonike|Timkovichi|Tlumach|Tolna|Tomaszow Mazowiecki|Torchyn|Trakai|Trebic|Trnava|Tul'chyn|Tuliszkow|Tyvriv|Uzda|Uzhhorod|Vac|Valozhyn|Velizh|Velykyi Bereznyi|Vilna|Vinnytsia|Vlonia|Volodymyr-Volyns'kyi|Vysokovskiy Rayon|Warka|Warsaw|Wisznice|Wrocław|Zagare|Zamosc|Zarichne|Zboriv|Zdunska Wola|Zhmerinka|Zhytomyr|Ziezmariai)\"\n", "@Language.component(\"find_ghettos\")\n", "def find_ghettos(doc):\n", " text = doc.text\n", " ghetto_ents = []\n", " gpe_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(ghetto_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " context = text[start-25:end+25]\n", " if \"ghetto\" in context.lower():\n", " if span is not None:\n", " ghetto_ents.append((span.start, span.end, span.text))\n", " \n", " else:\n", " if span is not None:\n", " gpe_ents.append((span.start, span.end, span.text))\n", " for ent in ghetto_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"GHETTO\")\n", " original_ents.append(per_ent)\n", " for ent in gpe_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"GPE\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)\n", "nlp.add_pipe(\"find_ghettos\", before=\"ner\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "second_ghettos_pattern = r\"[A-Z]\\w+((-| )*[A-Z]\\w+)* (g|G)hetto\"\n", "@Language.component(\"find_ghettos2\")\n", "def find_ghettos2(doc):\n", " fps = [\"That\", \"The\"]\n", " text = doc.text\n", " camp_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(second_ghettos_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end-7)\n", " if span is not None and span.text not in fps:\n", " if \"The \" in span.text:\n", " camp_ents.append((span.start+1, span.end, span.text))\n", " else:\n", " camp_ents.append((span.start, span.end, span.text))\n", " for ent in camp_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"GHETTO\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)\n", "nlp.add_pipe(\"find_ghettos2\", before=\"ner\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# import glob\n", "# hits = []\n", "# files = glob.glob(\"data/new_ocr/*trs_en.txt\")\n", "# all_data = {}\n", "# for file in files[:5]:\n", "# all_hits = []\n", "# with open (file, \"r\", encoding=\"utf-8\") as f:\n", "# print (file)\n", "# text = f.read()\n", "# doc = nlp(text)\n", "# for ent in doc.ents:\n", "# if ent.label_ == \"GHETTO\":\n", "# print (ent.text, ent.label_)\n", "# print (text[ent.start_char-100:ent.end_char+100])\n", "# print ()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the pipe is working. We've grabbed three instances of Warsaw in data/new_ocr\\RG-50.030.0001_trs_en.txt." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Pipe for Identifying a Person" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For PERSON, we will be leveraging spaCy's small model, but we can add some heuristics that will greatly improve the results. The heuristics here is any known salutation capitalized followed by a series of proper nouns. This RegEx \"(?:[A-Z]\\w+[ -]?)+)\" allows us to grab all continuous capital words and then break when it encounters a non capital letter followed by a space. In English, these will always be PERSON entities." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "people_pattern = r\"((((Mr|Mrs|Miss|Dr|Col|Adm|Lt|Cap|Cpt|Fr|Cl|Cln|Sgt)\\.)|(Frau|Herr|President|Rabbi|Queen|Prince|Princess|Pope|Father|Bishop|King|Cardinal|General|Liutenant|Colonel|Lieutenant Colonel|Private|Admiral|Captain|Sergeant|Sergeant First Class|Staff Sergeant|Sergeant Major|Corp Sergeant Major|Field Sergeant|Technical Sergeant|Corporal|Lance Corporal|Ensign|2nd Lieutenant|1st Lieutenant|Major|Hauptmann|Staff Captain|Oberst|Oberstlieutenant)) (?:[A-Z]\\w+[ -]?)+)(the [A-Z]\\w*|I\\w*|X\\w*|v\\w*)*\"\n", "@Language.component(\"find_people\")\n", "def find_people(doc):\n", " text = doc.text\n", " match_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(people_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if span is not None:\n", " match_ents.append((span.start, span.end, span.text))\n", " \n", " else:\n", " span = doc.char_span(start, end-1)\n", " if span is not None:\n", " match_ents.append((span.start, span.end, span.text))\n", "\n", " for ent in match_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"PERSON\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.add_pipe(\"find_people\", before=\"ner\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Pipe for Identifying Spouses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often times in historical documents the identity of people are referenced collectively. In some instances, such as those of spouses, this results in the name of the woman being attached to the name of her husband. The purpose of this SPOUSAL entity is to identify such constructs so that users can manipulate the output and reconstruct each individual singularly." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "spousal_pattern = r\"((Mr|Mrs|Miss|Dr)(\\.)* and (Mr|Mrs|Miss|Dr)(\\.)* (?:[A-Z]\\w+[ -]?)+)\"\n", "@Language.component(\"find_spousal\")\n", "def find_spousal(doc):\n", " text = doc.text\n", " new_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(spousal_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if span is not None:\n", " new_ents.append((span.start, span.end, span.text))\n", " else:\n", " span = doc.char_span(start, end-1)\n", " if span is not None:\n", " new_ents.append((span.start, span.end-1, span.text))\n", " for ent in new_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"SPOUSAL\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.add_pipe(\"find_spousal\", before=\"ner\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Concentration Camp Pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The correct identification of camp is one of the most important pipes in this pipeline. There are two concentration camp pipes. The first pipe looks at all known camps and subcamps and then looks for surrounding words to identify the context. The second pipe is less strict. It looks for all known main concentration camps without context. The reason for this is because sometimes the subcamps have the same names as frequently cited cities, e.g. Berlin or Neustadt. This is particularly true of the subcamps. The main camps, however, are frequently referenced to the camp itself. Both pipes are activated by default, but a user can deactivate one or other.\n", "\n", "Throughout this pipe, you will see many functions that contain the name \"getter\". These are custom functions that allow us to add special attributes to our entity spans. If you scroll down to the bottom of this section, you will see that we can use the HGC dataset for conentration camps to retrieve other salient information, such as the subcamp's main camp, its longitude and latitude, opening date, closing date, etc." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def subcamp_getter(hit):\n", " hit = hit.text\n", " df = pd.read_csv(\"data/hgc_data.csv\")\n", " subcamps = df.Main.tolist()\n", " camps = df.SubcampMattingly.tolist()\n", " i=0\n", " potential = []\n", " for c in camps:\n", " \n", " try:\n", " all_c = c.split(\"^\")\n", " for c in all_c:\n", " c = c.replace(\"\\(\", \"(\").replace(\"\\)\", \")\")\n", "# if c == \"Buna-Monowitz (Auschwitz III)\":\n", "# print (c)\n", " if hit.strip() == c.strip():\n", "# print (hit, c)\n", " if subcamps[i] not in potential:\n", " potential.append(subcamps[i])\n", " except:\n", " AttributeError\n", " i=i+1\n", " return (potential)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def date_open_getter(hit):\n", " hit = hit.text\n", " df = pd.read_csv(\"data/hgc_data.csv\")\n", " dates = df.Date_Open.tolist()\n", " camps = df.SubcampMattingly.tolist()\n", " i=0\n", " potential = []\n", " for c in camps:\n", " \n", " try:\n", " all_c = c.split(\"^\")\n", " for c in all_c:\n", " if hit == c:\n", " if dates[i] not in potential:\n", " potential.append(dates[i])\n", " except:\n", " AttributeError\n", " i=i+1\n", " return (potential)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def date_closed_getter(hit):\n", " hit = hit.text\n", " df = pd.read_csv(\"data/hgc_data.csv\")\n", " dates = df.Date_Close.tolist()\n", " camps = df.SubcampMattingly.tolist()\n", " i=0\n", " potential = []\n", " for c in camps:\n", " try:\n", " all_c = c.split(\"^\")\n", " for c in all_c:\n", " if hit == c:\n", " if dates[i] not in potential:\n", " potential.append(dates[i])\n", " except:\n", " AttributeError\n", " i=i+1\n", " return (potential)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def latlong_getter(hit):\n", " hit = hit.text\n", " df = pd.read_csv(\"data/hgc_data.csv\")\n", " lats = df.LAT.tolist()\n", " longs = df.LONG.tolist()\n", " camps = df.SubcampMattingly.tolist()\n", " i=0\n", " potential = []\n", " for c in camps:\n", " \n", " try:\n", " all_c = c.split(\"^\")\n", " for c in all_c:\n", " if hit == c:\n", " if lats[i] not in potential:\n", " potential.append((lats[i], longs[i]))\n", " except:\n", " AttributeError\n", " i=i+1\n", " return (potential)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def hgc_id_getter(hit):\n", " hit = hit.text\n", " df = pd.read_csv(\"data/hgc_data.csv\")\n", " ids = df.HGC_ID.tolist()\n", " camps = df.SubcampMattingly.tolist()\n", " i=0\n", " potential = []\n", " for c in camps:\n", " \n", " try:\n", " all_c = c.split(\"^\")\n", " for c in all_c:\n", " if hit == c:\n", " if ids[i] not in potential:\n", " potential.append(ids[i])\n", " except:\n", " AttributeError\n", " i=i+1\n", " return (potential)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def camp_type_getter(hit):\n", " hit = hit.text" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/hgc_data.csv\")\n", "camps = df.SubcampMattingly.tolist()\n", "subcamps = df.Main.tolist()\n", "i=0\n", "final_camps = []\n", "for c in camps:\n", " if c != \"nan\" and c != \"FALSE\":\n", " if subcamps[i] != \"nan\" and subcamps[i] != \"FALSE\":\n", " try:\n", " if c.split()[0] != \"\":\n", " c=c.replace(\"*\", \"\")\n", " for item in c.split(\"^\"):\n", " final_camps.append(item.replace(\"(\", \"\\(\").replace(\")\", \"\\)\").strip())\n", " except:\n", " AttributeError\n", " i=i+1\n", " \n", "final_camps.sort(key=len, reverse=True)\n", "final_list = \"|\".join(final_camps)\n", "strict_camps_pattern = r\"(\"+final_list+\")\"\n", "# print (strict_camps_pattern)\n", "@Language.component(\"find_camps_strict\")\n", "def find_camps_strict(doc):\n", " text = doc.text\n", " camp_ents = []\n", " original_ents = list(doc.ents)\n", " context_terms = [\"camp\", \"concentration\", \"labor\", \"forced\", \"gas\", \"chamber\"]\n", " for match in re.finditer(strict_camps_pattern, doc.text):\n", "# print (match)\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " context = text[start-100:end+100]\n", " if any(term in context.lower() for term in context_terms):\n", " if span is not None:\n", "# print (span)\n", " camp_ents.append((span.start, span.end, span.text))\n", " for ent in camp_ents:\n", "# print (ent)\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"CAMP\")\n", " per_ent.set_extension(\"subcamp\", getter=subcamp_getter, force=True)\n", " per_ent.set_extension(\"date_open\", getter=date_open_getter, force=True)\n", " per_ent.set_extension(\"date_closed\", getter=date_closed_getter, force=True)\n", " per_ent.set_extension(\"latlong\", getter=latlong_getter, force=True)\n", " per_ent.set_extension(\"hgc_id\", getter=hgc_id_getter, force=True)\n", " \n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "camps_pattern = r\"(Alderney|Amersfoort|Auschwitz|Banjica|Belzec|Bergen-Belsen|Bernburg|Bogdanovka|Bolzano|Bor|Breendonk|Breitenau|Buchenwald|Chelmno|Dachau|Drancy|Falstad|Flossenburg|Fort VII|Fossoli|Grini|Gross-Rosen|Herzogenbusch|Hinzert|Janowska|Jasenovac|Kaiserwald|Kaunas|Kemna|Klooga|Le Vernet|Majdanek|Malchow|Maly Trostenets|Mechelen|Mittelbau-Dora|Natzweiler-Struthof|Neuengamme|Niederhagen|Oberer Kuhberg|Oranienburg|Osthofen|Plaszow|Ravensbruck|Risiera di San Sabba|Sachsenhausen|Sajmište|Salaspils|Sobibor|Soldau|Stutthof|Theresienstadt|Trawniki|Treblinka|Vaivara)(-[A-Z]\\S+)*\"\n", "@Language.component(\"find_camps\")\n", "def find_camps(doc):\n", " text = doc.text\n", " camp_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(camps_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if span is not None:\n", " camp_ents.append((span.start, span.end, span.text))\n", " for ent in camp_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"CAMP\")\n", " per_ent.set_extension(\"subcamp\", getter=subcamp_getter, force=True)\n", " per_ent.set_extension(\"date_open\", getter=date_open_getter, force=True)\n", " per_ent.set_extension(\"date_closed\", getter=date_closed_getter, force=True)\n", " per_ent.set_extension(\"latlong\", getter=latlong_getter, force=True)\n", " per_ent.set_extension(\"hgc_id\", getter=hgc_id_getter, force=True)\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "second_camps_pattern = r\"[A-Z]\\w+((-| )*[A-Z]\\w+)* (c|C)oncentration (c|C)amp\"\n", "@Language.component(\"find_camps2\")\n", "def find_camps2(doc):\n", " text = doc.text\n", " camp_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(second_camps_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end-19)\n", " if span is not None:\n", " camp_ents.append((span.start, span.end, span.text))\n", " for ent in camp_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"CAMP\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.add_pipe(\"find_camps_strict\", before=\"ner\")\n", "nlp.add_pipe(\"find_camps\", before=\"ner\")\n", "nlp.add_pipe(\"find_camps2\", before=\"ner\")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# import glob\n", "# hits = []\n", "# files = glob.glob(\"data/new_ocr/*trs_en.txt\")\n", "# all_data = {}\n", "# for file in files[1:2]:\n", "# all_hits = []\n", "# with open (file, \"r\", encoding=\"utf-8\") as f:\n", "# print (file)\n", "# text = f.read()\n", "# doc = nlp(text)\n", "# for ent in doc.ents:\n", "# if ent.label_ == \"CAMP\":\n", "# print ((ent.text, ent.label_, ent._.subcamp, ent._.date_open, ent._.date_closed, ent._.latlong, ent._.hgc_id))\n", "# print (text[ent.start_char-100:ent.end_char+100])\n", "# print ()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Revolutionary Groups Pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The purpose of this pipe is to find known Revolutionary Groups. Again, this pipe is not an EntityRuler because I intend to do a few extra things with it in the future beyond the limitations of the EntityRuler." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "groups_pattern = r\"(Ethnikon Apeleutherotikon Metopon|Weisse Rose|Rote Kapelle|Affiche rouge|Edelweisspiraten|White Rose|Bielski|Nekamah|Voroshilov|OEuvre de secours aux enfants|Union des juifs pour la résistance et l'entraide|Zorin Unit|Komsomolski|Fareynikte|Korzh|Zhukov|Budenny|Parkhomenko|Sixième)((-)*[A-Z]\\S+)*( (Brigade|brothers|group))*\"\n", "@Language.component(\"find_groups\")\n", "def find_groups(doc):\n", " text = doc.text\n", " camp_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(groups_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if span is not None:\n", " camp_ents.append((span.start, span.end, span.text))\n", " for ent in camp_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"GROUP\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.add_pipe(\"find_groups\", before=\"ner\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Places Pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ML model will capture place fairly well. Nevertheless, if you can write a simple rule, write a simple rule." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "city_pattern = r\"(?:[A-Z]\\w+[ -]?)+, (Germany|Poland|England|Russia|Italy|USA|U.S.A.|United States|United States of America|America|United Kingdom|France|Spain|Ukraine|Romania|Netherlands|Belgium|Greece|Portugal|Sweden|Hungary|Austria|Belarus|Serbia|Switzerland|Bulgaria|Denmark|Finland|Slovakia|Norway|Ireland|Croatia|Moldova|Bosnia|Albania|Estonia|Malta|Iceland|Andorra|Luxembourg|Montenegro|Macedonia|San Marino|Lichtenstein|Monaco)\"\n", "country_pattern = r\"(Germany|Poland|England|Russia|Italy|USA|U.S.A.|United States|United States of America|America|United Kingdom|France|Spain|Ukraine|Romania|Netherlands|Belgium|Greece|Portugal|Sweden|Hungary|Austria|Belarus|Serbia|Switzerland|Bulgaria|Denmark|Finland|Slovakia|Norway|Ireland|Croatia|Moldova|Bosnia|Albania|Estonia|Malta|Iceland|Andorra|Luxembourg|Montenegro|Macedonia|San Marino|Lichtenstein|Monaco)\"\n", "@Language.component(\"find_places\")\n", "def find_places(doc):\n", " text = doc.text\n", " new_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(city_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if span is not None:\n", " new_ents.append((span.start, span.end, span.text))\n", " else:\n", " span = doc.char_span(start, end-1)\n", " if span is not None:\n", " new_ents.append((span.start, span.end-1, span.text))\n", " for ent in new_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"GPE\")\n", " if per_ent.text.split(\",\")[0] not in city_pattern:\n", " original_ents.append(per_ent)\n", " \n", " for match in re.finditer(country_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if span is not None:\n", " new_ents.append((span.start, span.end, span.text))\n", " else:\n", " span = doc.char_span(start, end-1)\n", " if span is not None:\n", " new_ents.append((span.start, span.end-1, span.text))\n", " for ent in new_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"GPE\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.add_pipe(\"find_places\", before=\"ner\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Geography Pipe" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "general_pattern = r\"([A-Z]\\w+) (River|Mountain|Mountains|Forest|Forests|Sea|Ocean)*\"\n", "river_pattern = \"(the|The) (Rhone|Volga|Danube|Ural|Dnieper|Don|Pechora|Kama|Oka|Belaya|Dniester|Rhine|Desna|Elbe|Donets|Vistula|Tagus|Daugava|Loire|Tisza|Ebro|Prut|Neman|Sava|Meuse|Kuban River|Douro|Mezen|Oder|Guadiana|Rhône|Kuma|Warta|Seine|Mureș|Northern Dvina|Vychegda|Drava|Po|Guadalquivir|Bolshoy Uzen|Siret|Maly Uzen|Terek|Olt|Vashka|Glomma|Garonne|Usa|Kemijoki|Great Morava|Moselle|Main 525|Torne|Dalälven|Inn|Maritsa|Marne|Neris|Júcar|Dordogne|Saône|Ume|Mur|Ångerman|Klarälven|Lule|Gauja|Weser|Kalix|Vindel River|Ljusnan|Indalsälven|Vltava|Ponoy|Ialomița|Onega|Somes|Struma|Adige|Skellefte|Tiber|Vah|Pite|Faxälven|Vardar|Shannon|Charente|Iskar|Tundzha|Ems|Tana|Scheldt|Timiș|Genil|Severn|Morava|Luga|Argeș|Ljungan|Minho|Venta|Thames|Drina|Jiu|Drin|Segura|Torne|Osam|Arda|Yantra|Kamchiya|Mesta)\"\n", "@Language.component(\"find_geography\")\n", "def find_geography(doc):\n", " text = doc.text\n", " river_ents = []\n", " general_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(river_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if span is not None:\n", " river_ents.append((span.start, span.end, span.text))\n", " for match in re.finditer(general_pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if span is not None:\n", " general_ents.append((span.start, span.end, span.text)) \n", " \n", "# all_ents = river_ents+general_ents \n", " for ent in river_ents:\n", " start, end, name = ent\n", " per_ent = Span(doc, start, end, label=\"RIVER\")\n", " original_ents.append(per_ent)\n", " \n", " for ent in general_ents:\n", " start, end, name = ent\n", " if \"River\" in name:\n", " per_ent = Span(doc, start, end, label=\"RIVER\")\n", " elif \"Mountain\" in name:\n", " per_ent = Span(doc, start, end, label=\"MOUNTAIN\")\n", " elif \"Sea\" in name:\n", " per_ent = Span(doc, start, end, label=\"SEA-OCEAN\")\n", " elif \"Forest\" in name:\n", " per_ent = Span(doc, start, end, label=\"FOREST\")\n", " original_ents.append(per_ent)\n", " filtered = filter_spans(original_ents)\n", " doc.ents = filtered\n", " return (doc)\n", "nlp.add_pipe(\"find_geography\", before=\"ner\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding Streets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Seeing the Pipes at Work" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have assembled all these pipes into a pipeline, let's see how they perform on a real testimony." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
This is \n", "\n", " Berlinstrasse\n", " STREET\n", "\n", ", which is also known as \n", "\n", " Berlin Street\n", " STREET\n", "\n", " or \n", "\n", " Berlin St.\n", " STREET\n", "\n", " That Ghetto and \n", "\n", " The Ghetto\n", " LOC\n", "\n", ". The \n", "\n", " Warsaw\n", " GHETTO\n", "\n", " Ghetto
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# import glob\n", "# from spacy import displacy\n", "# hits = []\n", "# files = glob.glob(\"data/new_ocr/*trs_en.txt\")\n", "# all_data = {}\n", "# for file in files[5:6]:\n", "# all_hits = []\n", "# with open (file, \"r\", encoding=\"utf-8\") as f:\n", "# print (file)\n", "# text = f.read().replace(\"(ph)\", \"\")\n", "# while \" \" in text:\n", "# text = text.replace(\" \", \" \")\n", "# doc = nlp(text)\n", "# colors = {\"CAMP\": \"#FF5733\", \"GHETTO\": \"#1F9D12\", \"SHIP\": \"#557DB4\", \"SPOUSAL\": \"#55B489\", \"GPE\":\"#17B4C2\", \"RIVER\": \"#9017C2\", \"MOUNTAIN\": \"#878787\", \"SEA-OCEAN\": \"#0A6DF5\", \"FOREST\": \"#1F541D\"}\n", "# options = {\"ents\": [\"CAMP\", \"GHETTO\", \"SHIP\", \"SPOUSAL\", \"GPE\", \"RIVER\", \"MOUNTAIN\", \"SEA-OCEAN\", \"FOREST\"], \"colors\":colors}\n", "# displacy.render(doc, style=\"ent\", jupyter=True)\n", "doc = nlp(\"This is Berlinstrasse, which is also known as Berlin Street or Berlin St. That Ghetto and The Ghetto. The Warsaw Ghetto\")\n", "from spacy import displacy\n", "displacy.render(doc, style=\"ent\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This result is not perfect, but again, that's not the point here. Here we are less interested in catching all entities as much as not catching any false positives. At a quick glance, we have achieved that. Now that were are tentatively happy, wan save our pipeline to disk, but first, let's add some metadata." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving a spaCy Model" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "nlp.meta[\"name\"] = \"ushmm\"\n", "nlp.meta[\"version\"] = '0.0.16'\n", "nlp.meta[\"author\"] = \"W.J.B. Mattingly\"\n", "nlp.meta[\"author_email\"] = \"wjbmattingly@gmail.com\"\n", "nlp.meta[\"description\"] = \"This is a pipeline of heuristics to help identify essential entities for research into the Holocaust.\"\n", "nlp.meta['url'] = \"www.wjbmattingly.com\"\n", "nlp.to_disk(\"models/rules_pipeline\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we are particularly happy with our results, we can even save the file to disk. It is important to note that we need to copy and paste all of our code into a python script that we can inject into the model. I'll cover this in greater detail in the next notebook as we start working on an ML model." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "running sdist\n", "running egg_info\n", "creating en_ushmm.egg-info\n", "writing en_ushmm.egg-info\\PKG-INFO\n", "writing dependency_links to en_ushmm.egg-info\\dependency_links.txt\n", "writing entry points to en_ushmm.egg-info\\entry_points.txt\n", "writing requirements to en_ushmm.egg-info\\requires.txt\n", "writing top-level names to en_ushmm.egg-info\\top_level.txt\n", "writing manifest file 'en_ushmm.egg-info\\SOURCES.txt'\n", "reading manifest file 'en_ushmm.egg-info\\SOURCES.txt'\n", "reading manifest template 'MANIFEST.in'\n", "writing manifest file 'en_ushmm.egg-info\\SOURCES.txt'\n", "running check\n", "creating en_ushmm-0.0.16\n", "creating en_ushmm-0.0.16\\en_ushmm\n", "creating en_ushmm-0.0.16\\en_ushmm.egg-info\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\.ipynb_checkpoints\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\attribute_ruler\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\lemmatizer\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\lemmatizer\\lookups\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\ner\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\parser\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\tagger\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\\model\n", "creating en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\vocab\n", "copying files to en_ushmm-0.0.16...\n", "copying MANIFEST.in -> en_ushmm-0.0.16\n", "copying README.md -> en_ushmm-0.0.16\n", "copying meta.json -> en_ushmm-0.0.16\n", "copying setup.py -> en_ushmm-0.0.16\n", "copying en_ushmm\\__init__.py -> en_ushmm-0.0.16\\en_ushmm\n", "copying en_ushmm\\components.py -> en_ushmm-0.0.16\\en_ushmm\n", "copying en_ushmm\\meta.json -> en_ushmm-0.0.16\\en_ushmm\n", "copying en_ushmm.egg-info\\PKG-INFO -> en_ushmm-0.0.16\\en_ushmm.egg-info\n", "copying en_ushmm.egg-info\\SOURCES.txt -> en_ushmm-0.0.16\\en_ushmm.egg-info\n", "copying en_ushmm.egg-info\\dependency_links.txt -> en_ushmm-0.0.16\\en_ushmm.egg-info\n", "copying en_ushmm.egg-info\\entry_points.txt -> en_ushmm-0.0.16\\en_ushmm.egg-info\n", "copying en_ushmm.egg-info\\not-zip-safe -> en_ushmm-0.0.16\\en_ushmm.egg-info\n", "copying en_ushmm.egg-info\\requires.txt -> en_ushmm-0.0.16\\en_ushmm.egg-info\n", "copying en_ushmm.egg-info\\top_level.txt -> en_ushmm-0.0.16\\en_ushmm.egg-info\n", "copying en_ushmm\\en_ushmm-0.0.16\\README.md -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\n", "copying en_ushmm\\en_ushmm-0.0.16\\config.cfg -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\n", "copying en_ushmm\\en_ushmm-0.0.16\\meta.json -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\n", "copying en_ushmm\\en_ushmm-0.0.16\\tokenizer -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\n", "copying en_ushmm\\en_ushmm-0.0.16\\.ipynb_checkpoints\\meta-checkpoint.json -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\.ipynb_checkpoints\n", "copying en_ushmm\\en_ushmm-0.0.16\\attribute_ruler\\patterns -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\attribute_ruler\n", "copying en_ushmm\\en_ushmm-0.0.16\\lemmatizer\\lookups\\lookups.bin -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\lemmatizer\\lookups\n", "copying en_ushmm\\en_ushmm-0.0.16\\ner\\cfg -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\ner\n", "copying en_ushmm\\en_ushmm-0.0.16\\ner\\model -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\ner\n", "copying en_ushmm\\en_ushmm-0.0.16\\ner\\moves -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\ner\n", "copying en_ushmm\\en_ushmm-0.0.16\\parser\\cfg -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\parser\n", "copying en_ushmm\\en_ushmm-0.0.16\\parser\\model -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\parser\n", "copying en_ushmm\\en_ushmm-0.0.16\\parser\\moves -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\parser\n", "copying en_ushmm\\en_ushmm-0.0.16\\tagger\\cfg -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\tagger\n", "copying en_ushmm\\en_ushmm-0.0.16\\tagger\\model -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\tagger\n", "copying en_ushmm\\en_ushmm-0.0.16\\transformer\\cfg -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\n", "copying en_ushmm\\en_ushmm-0.0.16\\transformer\\model\\config.json -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\\model\n", "copying en_ushmm\\en_ushmm-0.0.16\\transformer\\model\\merges.txt -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\\model\n", "copying en_ushmm\\en_ushmm-0.0.16\\transformer\\model\\pytorch_model.bin -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\\model\n", "copying en_ushmm\\en_ushmm-0.0.16\\transformer\\model\\special_tokens_map.json -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\\model\n", "copying en_ushmm\\en_ushmm-0.0.16\\transformer\\model\\tokenizer.json -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\\model\n", "copying en_ushmm\\en_ushmm-0.0.16\\transformer\\model\\tokenizer_config.json -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\\model\n", "copying en_ushmm\\en_ushmm-0.0.16\\transformer\\model\\vocab.json -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\transformer\\model\n", "copying en_ushmm\\en_ushmm-0.0.16\\vocab\\key2row -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\vocab\n", "copying en_ushmm\\en_ushmm-0.0.16\\vocab\\lookups.bin -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\vocab\n", "copying en_ushmm\\en_ushmm-0.0.16\\vocab\\strings.json -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\vocab\n", "copying en_ushmm\\en_ushmm-0.0.16\\vocab\\vectors -> en_ushmm-0.0.16\\en_ushmm\\en_ushmm-0.0.16\\vocab\n", "Writing en_ushmm-0.0.16\\setup.cfg\n", "creating dist\n", "Creating tar archive\n", "removing 'en_ushmm-0.0.16' (and everything under it)\n", "[i] Building package artifacts: sdist\n", "[+] Including 1 Python module(s) with custom code\n", "[+] Loaded meta.json from file\n", "models\\rules_pipeline\\meta.json\n", "[+] Generated README.md from meta.json\n", "[+] Successfully created package 'en_ushmm-0.0.16'\n", "models\\en_ushmm-0.0.16\n", "[+] Successfully created zipped Python package\n", "models\\en_ushmm-0.0.16\\dist\\en_ushmm-0.0.16.tar.gz\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-09-22 13:26:21.465308: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll\n", "c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages\\spacy\\util.py:730: UserWarning: [W095] Model 'en_ushmm' (0.0.16) was trained with spaCy v3.0 and may not be 100% compatible with the current version (3.1.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate\n", " warnings.warn(warn_msg)\n", "warning: no files found matching 'LICENSE'\n", "warning: no files found matching 'LICENSES_SOURCES'\n" ] } ], "source": [ "!spacy package models/rules_pipeline models --code data/components.py --force" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 }