{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<small><small><i>\n", "All the IPython Notebooks in **[Python Natural Language Processing](https://github.com/milaan9/Python_Python_Natural_Language_Processing)** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9)**\n", "</i></small></small>" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "<a href=\"https://colab.research.google.com/github/milaan9/Python_Python_Natural_Language_Processing/blob/main/07_Sentence_Segmentation.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" ] }, { "cell_type": "markdown", "metadata": { "id": "72BScxzsCA86" }, "source": [ "# 07 Sentence Segmentation\n", "\n", "Sentence segmentation is the process of determining the longer processing units consisting of one or more words. This task involves identifying sentence boundaries between words in different sentences.\n", "\n", "In **spaCy Basics** we saw briefly how Doc objects are divided into sentences. In this section we'll learn how sentence segmentation works, and how to set our own segmentation rules." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "K5WFCfx7CA87" }, "outputs": [], "source": [ "# Perform standard imports\n", "import spacy\n", "nlp = spacy.load('en_core_web_sm')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HN_2-YB2CA9B", "outputId": "8c2a1fc6-3355-4c2b-e5b1-0e05c09698c3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is the first sentence.\n", "This is another sentence.\n", "This is the last sentence.\n" ] } ], "source": [ "# From Spacy Basics:\n", "doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')\n", "\n", "for sent in doc.sents:\n", " print(sent)" ] }, { "cell_type": "markdown", "metadata": { "id": "Mru6ngfWCA9G" }, "source": [ "### `Doc.sents` is a generator\n", "It is important to note that `doc.sents` is a *generator*. That is, a Doc is not segmented until `doc.sents` is called. This means that, where you could print the second Doc token with `print(doc[1])`, you can't call the \"second Doc sentence\" with `print(doc.sents[1])`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_ODLi_omCA9G", "outputId": "e245d23a-4f45-476c-8876-d009347d8197" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "is\n" ] } ], "source": [ "print(doc[1])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 163 }, "id": "OuxdqmH_CA9K", "outputId": "257aadeb-d728-466d-f442-6113b2985dc8" }, "outputs": [ { "ename": "TypeError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m<ipython-input-4-2bc012eee1da>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdoc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msents\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: 'generator' object is not subscriptable" ] } ], "source": [ "print(doc.sents[1])" ] }, { "cell_type": "markdown", "metadata": { "id": "S_Dp1yuFCA9O" }, "source": [ "However, you *can* build a sentence collection by running `doc.sents` and saving the result to a list:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bbWZJKLSCA9P", "outputId": "c907d46b-d0b5-4781-f810-730f73347b52" }, "outputs": [ { "data": { "text/plain": [ "[This is the first sentence.,\n", " This is another sentence.,\n", " This is the last sentence.]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc_sents = [sent for sent in doc.sents]\n", "doc_sents" ] }, { "cell_type": "markdown", "metadata": { "id": "ep0zsLHyCA9T" }, "source": [ "<font color=green>**NOTE**: `list(doc.sents)` also works. We show a list comprehension as it allows you to pass in conditionals.</font>" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JTTju4hiCA9T", "outputId": "34a1073d-a790-49bf-90ef-2fa786139c4c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is another sentence.\n" ] } ], "source": [ "# Now you can access individual sentences:\n", "print(doc_sents[1])" ] }, { "cell_type": "markdown", "metadata": { "id": "_8s4hUYHCA9Y" }, "source": [ "### `sents` are Spans\n", "At first glance it looks like each `sent` contains text from the original Doc object. In fact they're just Spans with start and end token pointers." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Pf7CtGexCA9Y", "outputId": "7d81f3e3-310a-4e7b-f8d4-532b8186544c" }, "outputs": [ { "data": { "text/plain": [ "spacy.tokens.span.Span" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(doc_sents[1])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "K7bvy3HJCA9d", "outputId": "3f1168a0-8db4-48d2-c55a-624abb1c18e0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6 11\n" ] } ], "source": [ "print(doc_sents[1].start, doc_sents[1].end)" ] }, { "cell_type": "markdown", "metadata": { "id": "NKE_3h9kCA9g" }, "source": [ "## Adding Rules\n", "spaCy's built-in `sentencizer` relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added *before* the creation of the Doc object, as that is where the parsing of segment start tokens happens:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tJPRiEwbCA9h", "outputId": "4dd9edfe-80ae-4eb6-e33f-44f5d3ae1e50" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True This\n", "None is\n", "None a\n", "None sentence\n", "None .\n", "True This\n", "None is\n", "None a\n", "None sentence\n", "None .\n", "True This\n", "None is\n", "None a\n", "None sentence\n", "None .\n" ] } ], "source": [ "# Parsing the segmentation start tokens happens during the nlp pipeline\n", "doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')\n", "\n", "for token in doc2:\n", " print(token.is_sent_start, ' '+token.text)" ] }, { "cell_type": "markdown", "metadata": { "id": "6ZNJtaY8CA9p" }, "source": [ "<font color=green>Notice we haven't run `doc2.sents`, and yet `token.is_sent_start` was set to True on two tokens in the Doc.</font>" ] }, { "cell_type": "markdown", "metadata": { "id": "qyo3h8YxCA9p" }, "source": [ "Let's add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon, the next token should start a new segment." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mEqnYvN3CA9q", "outputId": "96dac83c-733d-4f4c-8add-48e2a5f07625" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"Management is doing things right; leadership is doing the right things.\"\n", "-Peter\n", "Drucker\n" ] } ], "source": [ "# SPACY'S DEFAULT BEHAVIOR\n", "doc3 = nlp(u'\"Management is doing things right; leadership is doing the right things.\" -Peter Drucker')\n", "\n", "for sent in doc3.sents:\n", " print(sent)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mM0uFppqCA9t", "outputId": "4c48100f-cf00-40bd-dd28-074fe27f714a" }, "outputs": [ { "data": { "text/plain": [ "['tagger', 'set_custom_boundaries', 'parser', 'ner']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ADD A NEW RULE TO THE PIPELINE\n", "def set_custom_boundaries(doc):\n", " for token in doc[:-1]:\n", " if token.text == ';':\n", " doc[token.i+1].is_sent_start = True\n", " return doc\n", "\n", "nlp.add_pipe(set_custom_boundaries, before='parser')\n", "\n", "nlp.pipe_names" ] }, { "cell_type": "markdown", "metadata": { "id": "zUSJaV-JCA9w" }, "source": [ "<font color=green>The new rule has to run before the document is parsed. Here we can either pass the argument `before='parser'` or `first=True`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cDaDbRpaCA9w", "outputId": "ebdbf8de-5071-489e-87e0-dbde607158a8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"Management is doing things right;\n", "leadership is doing the right things.\"\n", "-Peter\n", "Drucker\n" ] } ], "source": [ "# Re-run the Doc object creation:\n", "doc4 = nlp(u'\"Management is doing things right; leadership is doing the right things.\" -Peter Drucker')\n", "\n", "for sent in doc4.sents:\n", " print(sent)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zzSLf3MPCA9z", "outputId": "63411db1-c2dc-4985-c9cd-84551b4081ae" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"Management is doing things right; leadership is doing the right things.\"\n", "-Peter\n", "Drucker\n" ] } ], "source": [ "# And yet the new rule doesn't apply to the older Doc object:\n", "for sent in doc3.sents:\n", " print(sent)" ] }, { "cell_type": "markdown", "metadata": { "id": "WStY2VAUCA92" }, "source": [ "### Why not change the token directly?\n", "Why not simply set the `.is_sent_start` value to True on existing tokens?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UOZK5mx1CA93", "outputId": "eb56b3a3-85b9-4609-f61b-749eed8d7a86" }, "outputs": [ { "data": { "text/plain": [ "leadership" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find the token we want to change:\n", "doc3[7]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 231 }, "id": "KmU9zBADCA96", "outputId": "d5d54b23-202f-4eb5-acb2-054ffbfcb129" }, "outputs": [ { "ename": "ValueError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m<ipython-input-15-bcec3fe6a9a2>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Try to change the .is_sent_start attribute:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdoc3\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m7\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_sent_start\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32mtoken.pyx\u001b[0m in \u001b[0;36mspacy.tokens.token.Token.is_sent_start.__set__\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state." ] } ], "source": [ "# Try to change the .is_sent_start attribute:\n", "doc3[7].is_sent_start = True" ] }, { "cell_type": "markdown", "metadata": { "id": "oYoKtWkYCA99" }, "source": [ "<font color=green>spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.</font>" ] }, { "cell_type": "markdown", "metadata": { "id": "clDt_qfcCA99" }, "source": [ "## Changing the Rules\n", "In some cases we want to *replace* spaCy's default sentencizer with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentencizer that breaks on linebreaks." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5pdvHYbrCA9-", "outputId": "4cf44438-8b17-44f0-aee8-62ff9579ebbc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['This', 'is', 'a', 'sentence', '.']\n", "['This', 'is', 'another', '.', '\\n\\n']\n", "['This', 'is', 'a', '\\n', 'third', 'sentence', '.']\n" ] } ], "source": [ "nlp = spacy.load('en_core_web_sm') # reset to the original\n", "\n", "mystring = u\"This is a sentence. This is another.\\n\\nThis is a \\nthird sentence.\"\n", "\n", "# SPACY DEFAULT BEHAVIOR:\n", "doc = nlp(mystring)\n", "\n", "for sent in doc.sents:\n", " print([token.text for token in sent])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "HnOtIrNxCA-B" }, "outputs": [], "source": [ "# CHANGING THE RULES\n", "from spacy.pipeline import SentenceSegmenter\n", "\n", "def split_on_newlines(doc):\n", " start = 0\n", " seen_newline = False\n", " for word in doc:\n", " if seen_newline:\n", " yield doc[start:word.i]\n", " start = word.i\n", " seen_newline = False\n", " elif word.text.startswith('\\n'): # handles multiple occurrences\n", " seen_newline = True\n", " yield doc[start:] # handles the last group of tokens\n", "\n", "\n", "sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)\n", "nlp.add_pipe(sbd)" ] }, { "cell_type": "markdown", "metadata": { "id": "hM_IV7i_CA-D" }, "source": [ "<font color=green>While the function `split_on_newlines` can be named anything we want, it's important to use the name `sbd` for the SentenceSegmenter.</font>" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "l5FncSQMCA-E", "outputId": "4ae99939-65cd-4f15-eac9-367f5a7168ac" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', '.', '\\n\\n']\n", "['This', 'is', 'a', '\\n']\n", "['third', 'sentence', '.']\n" ] } ], "source": [ "doc = nlp(mystring)\n", "for sent in doc.sents:\n", " print([token.text for token in sent])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DUWmJ9ErSggm" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "7_Sentence_Segmentation.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 1 }