{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Neural NER with spacy\n", "In this hands-on, we use https://spacy.io, a framework for all basic NLP processing steps. It supports several languages out-of-the-box:\n", "\n", " - Tokenization/Word segmentation\n", " - Sentence splitting\n", " - Part-of-Speech tagging\n", " - Lemmatization\n", " - NER\n", " - Dependency Parsing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hands-on\n", "Downloading a small model for English trained on modern Web data. (Medium models end with `md`, large models ends with `lg`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m spacy download en_core_web_sm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setting up an NLP pipeline for English:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load(\"en_core_web_sm\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading and linguistic processing of a single sentence. \n", "\n", "More information on the meaning of the results can be found under https://spacy.io/usage/linguistic-features .\n", "Note that the human readable properties of word tokens always end with underscore." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')\n", "for token in doc:\n", " print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,\n", " token.shape_, token.is_alpha, token.is_stop)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The spacy default NLP pipeline for English includes NER. See https://spacy.io/usage/linguistic-features#named-entities for more information.\n", "\n", "Let's look at the text of all found named entities: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for ent in doc.ents:\n", " print(ent.text, ent.start_char, ent.end_char, ent.label_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing IOB information at the token level:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print([doc[0].text, doc[0].ent_iob_, doc[0].ent_type_])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print([doc[9].text, doc[9].ent_iob_, doc[9].ent_type_])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More information on accessing NER information: https://spacy.io/usage/linguistic-features#accessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with different language\n", "If you want to work on French data, use the following commands for setting up an NLP pipeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m spacy download fr_core_news_sm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# without this line spacy is not able to find the downloaded model\n", "\n", "! python -m spacy link --force fr_core_news_sm fr_core_news_sm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "nlp_fr = spacy.load(\"fr_core_news_sm\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See https://spacy.io/usage/models for more languages." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing the robustness\n", " - Try sentences from other domains than contemporary news.\n", " - Add noise (OCR errors, typos) to the text.\n", "\n", "How much do the results suffer?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc_fr = nlp_fr(u'''Apple envisage d'acheter une startup britannique pour 1 milliard de dollars''')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for ent in doc_fr.ents:\n", " print(ent.text, ent.start_char, ent.end_char, ent.label_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next step: Combining rule-based and statistical NER\n", "A tutorial how to use rule-based pattern matchers in spacy can be found here: https://spacy.io/usage/rule-based-matching#entityruler\n", "\n", "A nice example is the rule-based addition of person titles to named entities that have been recognized them by statistical NER: https://spacy.io/usage/rule-based-matching#models-rules-ner" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next step: Online training of existing models\n", "All spacy NER models can be updated easily by further training them on new labeled examples.\n", "The relevant documentation and sample code of spacy can be found here: https://spacy.io/usage/training#ner\n", "\n", "There is a step-by-step tutorial that shows how an existing model can be adapted to your own data:\n", "https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }