{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Processing of Archival Data Sources\n", "### Using SpaCy NLP and Regular Expressions\n", "* Contributors: Emily Higgs, Gregory Jansen, Richard Marciano, & Dev Pradhan\n", "* Source Available: https://github.com/cases-umd/bdarchives-nlp\n", "* License: [Creative Commons - Attribute 4.0 Intl](https://creativecommons.org/licenses/by/4.0/)\n", "* [Lesson Plan for Instructors](./lesson-plan.ipynb) (TBA)\n", "\n", "## Introduction\n", "These notebooks showcase a workflow for performing text processing of a \"data sheet\" from\n", "archival source material.\n", "\n", "\"Describe\n", "\n", "### Further Information\n", "Learn more about this project at the [website](http://researchwebsite.example.org)\n", "\n", "## Objectives\n", "These notebooks will familiarize the reader with standard practices of NLP for extraction of topics and\n", "other data from text records.\n", "\n", "## Learning Goals\n", "* Computational Practices:\n", " * [Collecting Data](#collecting_data)\n", " * [Modeling Data](#modeling_Data)\n", "* Archival Practices:\n", " * ??" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Software and Tools\n", "These notebooks are written in Python 3 and use an NLP module called SpaCy to process text.\n", "\n", "* [SpaCy](https://spacy.io/): Industrial Strength Natural Language Processing (in Python)\n", "\n", "In order to use SpaCy in this Jupyter environment, you must first load the language model.\n", "This is a model of the language that you are processing. We use a small English model for \n", "these examples. All you have to do is execute the code block below and wait for the hourglass, shown\n", "in your browser tab, to disappear when the task is done." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting en_core_web_sm==2.2.5\n", " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)\n", "\u001b[K |████████████████████████████████| 12.0 MB 2.0 MB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied: spacy>=2.2.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from en_core_web_sm==2.2.5) (2.2.4)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (4.44.1)\n", "Requirement already satisfied: blis<0.5.0,>=0.4.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.4.1)\n", "Requirement already satisfied: numpy>=1.15.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.18.2)\n", "Requirement already satisfied: plac<1.2.0,>=0.9.6 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.1.3)\n", "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.0)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.23.0)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.2)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.0.3)\n", "Requirement already satisfied: setuptools in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (46.1.3)\n", "Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.6.0)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)\n", "Requirement already satisfied: thinc==7.4.0 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (7.4.0)\n", "Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)\n", "Requirement already satisfied: importlib-metadata>=0.20; python_version < \"3.8\" in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.6.0)\n", "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.25.8)\n", "Requirement already satisfied: chardet<4,>=3.0.2 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.4)\n", "Requirement already satisfied: idna<3,>=2.5 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2.9)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2019.11.28)\n", "Requirement already satisfied: zipp>=0.5 in /home/jansen/.local/share/virtualenvs/bdarchives-nlp-worlW0cl/lib/python3.6/site-packages (from importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.1.0)\n", "Building wheels for collected packages: en-core-web-sm\n", " Building wheel for en-core-web-sm (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.5-py3-none-any.whl size=12011738 sha256=71b0462de38b85f8316cda10a461572d1db5ea4889a6e3e37429e3b73fa0ff6e\n", " Stored in directory: /tmp/pip-ephem-wheel-cache-l_vwk4ot/wheels/b5/94/56/596daa677d7e91038cbddfcf32b591d0c915a1b3a3e3d3c79d\n", "Successfully built en-core-web-sm\n", "Installing collected packages: en-core-web-sm\n", "Successfully installed en-core-web-sm-2.2.5\n", "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", "You can now load the model via spacy.load('en_core_web_sm')\n" ] } ], "source": [ "import sys\n", "!{sys.executable} -m spacy download en_core_web_sm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Acquiring Data\n", "These example notebooks are configured to process plain text files in the `data` directory. However, you are free\n", "to supply your own interesting data. You may upload your files to a new folder and then point the notebooks are the new folder instead. You may wish to \n", "run the example code on the example data before taking this step.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebooks\n", "1. [TODO Data Overview and Exploration](data_overview.ipynb)\n", "1. [Spacy Text Processing for Named Entities](nlp_ner.ipynb)\n", "1. [Text Parsing with Regular Expressions](regex.ipynb)\n", "1. [TODO Visualization and Conclusion](notebook-4.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }