{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Welcome\n", "\n", "
| Erik Arakelyan | \n", "Nadav Borenstein | \n", "Ruixiang Cui | \n", "Karolina Stańczak | \n", "
![]() | \n",
" ![]() | \n",
" ![]() | \n",
" ![]() | \n",
"
docker ps -q docker exec -it _container-id_ _command_ docker exec -it 8c16b8de4771 python --version\n",
"\n",
"\n",
"### Managing your changes\n",
"\n",
"There are several ways to keep your changes within the official repo organised. Some of them are:\n",
"* Create your own [fork](https://help.github.com/en/articles/fork-a-repo)\n",
"of the repo. The fork can be [synced](https://help.github.com/en/articles/syncing-a-fork?query=f) with the official course repo when new changes are available. Meanwhile, you can also maintain your changes in your forked repo.\n",
"* Another option is to keep your changes only in a local branch (git checkout -b _your-branch-name_) on your computer. Each time there is a change in the course repo, you can pull the repo and merge the changes in your branch (git merge origin/master)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"\n",
"## Tokenisation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tokenisation is an important pre-processing step for NLP models. \n",
"\n",
"You can tokenise text at different levels - split to sentences, tokens, subwords, etc. \n",
"\n",
"There are a lot of corner cases, language-specific and/or domain-specific cases, which have to handled in different ways.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['The office is open between 10 a',\n",
" '.',\n",
" 'm',\n",
" '.',\n",
" ' and 1 p',\n",
" '.',\n",
" 'm',\n",
" '.',\n",
" ' every day',\n",
" '.',\n",
" '',\n",
" '.',\n",
" '',\n",
" '.',\n",
" ' Please, be respective of the hours',\n",
" '.',\n",
" '']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"\n",
"text_sentences = \"The office is open between 10 a.m. and 1 p.m. every day... Please, be respective of the hours.\"\n",
"re.split('(\\.|!|\\?)', text_sentences)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Luckily, there are libraries providing tokenisation functionalities that handle most of the cases. Let's look two of the most common libraries for tokenisation:\n",
"\n",
"### Spacy"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0mCollecting en-core-web-sm==3.2.0\n",
" Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.9/13.9 MB\u001b[0m \u001b[31m20.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: spacy<3.3.0,>=3.2.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from en-core-web-sm==3.2.0) (3.2.0)\n",
"Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.7.4.3)\n",
"Requirement already satisfied: numpy>=1.15.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.20.1)\n",
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.3)\n",
"Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.22.0)\n",
"Requirement already satisfied: typer<0.5.0,>=0.3.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.4.0)\n",
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.6)\n",
"Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.9.0)\n",
"Requirement already satisfied: setuptools in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (60.5.0)\n",
"Requirement already satisfied: pathy>=0.3.5 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.6.1)\n",
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.2)\n",
"Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.8)\n",
"Requirement already satisfied: thinc<8.1.0,>=8.0.12 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (8.0.13)\n",
"Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.8.2)\n",
"Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.3.0)\n",
"Requirement already satisfied: jinja2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.11.1)\n",
"Requirement already satisfied: blis<0.8.0,>=0.4.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.7.5)\n",
"Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.1)\n",
"Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.6)\n",
"Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.4.2)\n",
"Requirement already satisfied: packaging>=20.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (20.1)\n",
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (4.42.1)\n",
"Requirement already satisfied: zipp>=0.5 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from catalogue<2.1.0,>=2.0.6->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.2.0)\n",
"Requirement already satisfied: six in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from packaging>=20.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.14.0)\n",
"Requirement already satisfied: pyparsing>=2.0.2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from packaging>=20.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.4.6)\n",
"Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from pathy>=0.3.5->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (5.2.1)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2019.11.28)\n",
"Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.4)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.25.8)\n",
"Requirement already satisfied: idna<2.9,>=2.5 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.8)\n",
"Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (8.0.3)\n",
"Requirement already satisfied: MarkupSafe>=0.23 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from jinja2->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.1.1)\n",
"Requirement already satisfied: importlib-metadata in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from click<9.0.0,>=7.1.1->typer<0.5.0,>=0.3.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.5.0)\n",
"\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
"You can now load the package via spacy.load('en_core_web_sm')\n",
"\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0mCollecting fr-core-news-sm==3.2.0\n",
" Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.2.0/fr_core_news_sm-3.2.0-py3-none-any.whl (17.4 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.4/17.4 MB\u001b[0m \u001b[31m33.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: spacy<3.3.0,>=3.2.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from fr-core-news-sm==3.2.0) (3.2.0)\n",
"Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (1.0.1)\n",
"Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (2.4.2)\n",
"Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (3.3.0)\n",
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (3.0.6)\n",
"Requirement already satisfied: blis<0.8.0,>=0.4.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (0.7.5)\n",
"Requirement already satisfied: numpy>=1.15.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (1.20.1)\n",
"Requirement already satisfied: packaging>=20.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (20.1)\n",
"Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (1.8.2)\n",
"Requirement already satisfied: jinja2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (2.11.1)\n",
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (2.0.3)\n",
"Requirement already satisfied: thinc<8.1.0,>=8.0.12 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (8.0.13)\n",
"Requirement already satisfied: typer<0.5.0,>=0.3.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (0.4.0)\n",
"Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (3.7.4.3)\n",
"Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (2.0.6)\n",
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (1.0.2)\n",
"Requirement already satisfied: pathy>=0.3.5 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (0.6.1)\n",
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (4.42.1)\n",
"Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (2.22.0)\n",
"Requirement already satisfied: setuptools in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (60.5.0)\n",
"Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (3.0.8)\n",
"Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (0.9.0)\n",
"Requirement already satisfied: zipp>=0.5 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from catalogue<2.1.0,>=2.0.6->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (2.2.0)\n",
"Requirement already satisfied: six in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from packaging>=20.0->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (1.14.0)\n",
"Requirement already satisfied: pyparsing>=2.0.2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from packaging>=20.0->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (2.4.6)\n",
"Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from pathy>=0.3.5->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (5.2.1)\n",
"Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (3.0.4)\n",
"Requirement already satisfied: idna<2.9,>=2.5 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (2.8)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (1.25.8)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (2019.11.28)\n",
"Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (8.0.3)\n",
"Requirement already satisfied: MarkupSafe>=0.23 in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from jinja2->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (1.1.1)\n",
"Requirement already satisfied: importlib-metadata in /Users/tks522/opt/anaconda3/lib/python3.7/site-packages (from click<9.0.0,>=7.1.1->typer<0.5.0,>=0.3.0->spacy<3.3.0,>=3.2.0->fr-core-news-sm==3.2.0) (1.5.0)\n",
"\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ornado (/Users/tks522/opt/anaconda3/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
"\u001b[0m\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
"You can now load the package via spacy.load('fr_core_news_sm')\n"
]
}
],
"source": [
"# download the language models, this can be done for other languages as well\n",
"!python -m spacy download en_core_web_sm # You might have to restart the notebook if the file cannot be found\n",
"!python -m spacy download fr_core_news_sm"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[The office is open between 10 a.m. and 1 p.m. every day...,\n",
" Please, be respective of the hours.]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import spacy\n",
"\n",
"nlp = spacy.load(\"en_core_web_sm\")\n",
"doc = nlp(text_sentences)\n",
"list(doc.sents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### NLTK"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['The office is open between 10 a.m. and 1 p.m. every day...',\n",
" 'Please, be respective of the hours.']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import nltk\n",
"\n",
"nltk.tokenize.sent_tokenize(text_sentences)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Word-level tokenisation"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Common English tokenisation\n",
"['Mr.', \"O'Neill\", 'thinks', 'that', 'the', 'boys', \"'\", 'stories', 'about', 'Chile', \"'s\", 'capital', 'are', \"n't\", 'amusing', '...', 'Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '!', '!', '!', 'Thanks..']\n",
"['Mr.', \"O'Neill\", 'thinks', 'that', 'the', 'boys', \"'\", 'stories', 'about', 'Chile', \"'s\", 'capital', 'are', \"n't\", 'amusing', '...', 'Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '!', '!', '!', 'Thanks', '..']\n",
"\n",
"Tweet tokenisation\n",
"['https', ':', '//t.co/9z2J3P33Uc', 'Hey', '@', 'NLPer', '!', 'This', 'is', 'a', '#', 'NLProc', 'tweet', ':', '-D']\n",
"['https://t.co/9z2J3P33Uc', 'Hey', '@NLPer', '!', 'This', 'is', 'a', '#', 'NLProc', 'tweet', ':-D']\n",
"\n",
"Tokenisation of a noisy tweet\n",
"['UserAnonym123', 'What', \"'s\", 'your', 'timezone_', '!', '@', '#', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_+', '0123456']\n",
"['UserAnonym123', 'What', \"'s\", 'your', 'timezone_!@', '#', '!', '@#$%^&*()_+', '0123456']\n"
]
}
],
"source": [
"text = \"Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing... Good muffins cost $3.88 in New York. Please buy me two of them!!! Thanks..\"\n",
"text_tweet = \"https://t.co/9z2J3P33Uc Hey @NLPer! This is a #NLProc tweet :-D\"\n",
"noisy_tweet = \"UserAnonym123 What's your timezone_!@# !@#$%^&*()_+ 0123456\"\n",
"\n",
"print('Common English tokenisation')\n",
"print(nltk.word_tokenize(text))\n",
"print([token.text for token in nlp(text)])\n",
"\n",
"print('\\nTweet tokenisation')\n",
"print(nltk.word_tokenize(text_tweet))\n",
"print([token.text for token in nlp(text_tweet)])\n",
"\n",
"print('\\nTokenisation of a noisy tweet')\n",
"print(nltk.word_tokenize(noisy_tweet))\n",
"print([token.text for token in nlp(noisy_tweet)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both libraries perform almost similar for tokenising English common text, so it depends which library you'll use for other features. \n",
"\n",
"When it comes to tweets, the nltk default tokeniser performs bad, but NLTK also provides the TweetTokenizer that is suited for tweet tokenisation."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['https://t.co/9z2J3P33Uc', 'Hey', '@NLPer', '!', 'This', 'is', 'a', '#NLProc', 'tweet', ':-D']\n",
"['UserAnonym', '123', \"What's\", 'your', 'timezone', '_', '!', '@', '#', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '0123456']\n"
]
}
],
"source": [
"tweet_tokenizer = nltk.tokenize.TweetTokenizer()\n",
"print(tweet_tokenizer.tokenize(text_tweet))\n",
"print(tweet_tokenizer.tokenize(noisy_tweet))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you saw, the above tokenisers tokenise negation contractions like \"are\", \"n't\", which is per the the Penn Treebank guidelines. Such tokenisation can be useful when building sentiment classification or information extraction. \n",
"\n",
"Question:\n",
"- How should we split \"I bought a 12-ft boat!\"? In 1, 2, or 3 tokens?\n",
"- How should we tokenise \"It is a 2850m distance flight.\", \"The maximum speed on the autobahn is 130km/h.\"? \n",
"\n",
"There is again a rule that units are split from numerical values. Let's test the performance of the tokenisers:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Spacy tokeniser\n",
"['I', 'bought', 'a', '12', '-', 'ft', 'boat', '!']\n",
"['It', 'is', 'a', '2850', 'm', 'distance', 'flight', '.']\n",
"['The', 'maximum', 'speed', 'on', 'the', 'autobahn', 'is', '130', 'km/h', '.']\n",
"\n",
"NLTK simple tokeniser\n",
"[['I', 'bought', 'a', '12-ft', 'boat', '!']]\n",
"[['It', 'is', 'a', '2850m', 'distance', 'flight', '.']]\n",
"[['The', 'maximum', 'speed', 'on', 'the', 'autobahn', 'is', '130km/h', '.']]\n"
]
}
],
"source": [
"print('Spacy tokeniser')\n",
"print([token.text for token in nlp(\"I bought a 12-ft boat!\")])\n",
"print([token.text for token in nlp(\"It is a 2850m distance flight.\")])\n",
"print([token.text for token in nlp(\"The maximum speed on the autobahn is 130km/h.\")])\n",
"\n",
"print('\\nNLTK simple tokeniser')\n",
"print([nltk.tokenize.word_tokenize(\"I bought a 12-ft boat!\")])\n",
"print([nltk.tokenize.word_tokenize(\"It is a 2850m distance flight.\")])\n",
"print([nltk.tokenize.word_tokenize(\"The maximum speed on the autobahn is 130km/h.\")])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Language dependent tokenisation\n",
"\n",
"While some languages have similar rules for tokenisation, other languages are quite different.\n",
"In French, words originally composed of more than one lexical unit that nowadays form a single lexical unit and should thus be recognized as a single token, where an apostrophe should be used to split the word in some cases, but not in all. \n",
"\n",
"The following sentence \"On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.\", which means \"We are told that this is the case today, it still needs to be assessed.\" has the following correct tokenisation:\n",
"\n",
"'On', 'nous', 'dit', 'qu’', 'aujourd’hui', 'c’', 'est', 'le', 'cas', ',', 'encore', 'faudra', '-t-il', 'l’', 'évaluer', '.'\n",
"\n",
"Explanation:\n",
"- words originally composed of more than one lexical unit that nowadays form a single lexical unit and should thus be recognized as a single token like 'aujourd’hui'\n",
"- qu’aujourd’hui (that today) - today is in contracted form (qu’) and has to be separated from the rest of the word\n",
"- c'est (this is) is ce (C') combined with est and has to be split in two words\n",
"- l’évaluer (evaluate it) is two words, where one is in contracted form and has to be separated\n",
"- faudra-t-il (will it take) - consists of will (faudra), -t is used to prevent two vowels from clashing and should not be tokenised"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['On', 'nous', 'dit', 'qu', '’', 'aujourd', '’', 'hui', 'c', '’', 'est', 'le', 'cas', ',', 'encore', 'faudra-t-il', 'l', '’', 'évaluer', '.']]\n",
"['On', 'nous', 'dit', 'qu’aujourd’hui', 'c’est', 'le', 'cas', ',', 'encore', 'faudra', '-', 't', '-', 'il', 'l’évaluer', '.']\n"
]
}
],
"source": [
"print([nltk.tokenize.word_tokenize(\"On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.\")])\n",
"print([token.text for token in nlp(\"On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.\")])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use the language-specific tokenisation:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['On', 'nous', 'dit', 'qu’', 'aujourd’hui', 'c’', 'est', 'le', 'cas', ',', 'encore', 'faudra', '-t', '-il', 'l’', 'évaluer', '.']\n"
]
},
{
"data": {
"text/plain": [
"['On',\n",
" 'nous',\n",
" 'dit',\n",
" 'qu',\n",
" '’',\n",
" 'aujourd',\n",
" '’',\n",
" 'hui',\n",
" 'c',\n",
" '’',\n",
" 'est',\n",
" 'le',\n",
" 'cas',\n",
" ',',\n",
" 'encore',\n",
" 'faudra-t-il',\n",
" 'l',\n",
" '’',\n",
" 'évaluer',\n",
" '.']"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nlp_fr = spacy.load(\"fr_core_news_sm\")\n",
"print([token.text for token in nlp_fr(\"On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.\")])\n",
"nltk.tokenize.word_tokenize(\"On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.\", language='french')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transformers\n",
"\n",
"[HuggingFace's](https://huggingface.co/docs/transformers/index) \"transformers\" is a python package for training, using and deploying Transformer-based models (more on that in future lectures). Each transformer model (e.g. BERT, RoBERTa) has its own tokenisation module that should be used together with the model. That is, to use the transformer model \"BERT\", one must tokenise its inputs with the BERT-tokeniser. "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
"bert_tokeniser = AutoTokenizer.from_pretrained(\"bert-base-uncased\") # The tokeniser of the model \"bert-base-uncased\""
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['the', 'maximum', 'speed', 'on', 'the', 'auto', '##bahn', 'is', '130', '##km', '/', 'h', '.']\n"
]
}
],
"source": [
"tokens = bert_tokeniser.tokenize(\"The maximum speed on the autobahn is 130km/h.\")\n",
"print(tokens)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The \"bert-base-uncased\" tokeniser works differently from the tokenisers of nltk and spacy. Instead of splitting a sentence following a set of rules, it uses a (learned) vocabulary, a set of words that it knows. The tokeniser tries to break the sentence into tokens from its vocabulary. If the tokeniser encounters a work that does not appear in the vocabulary, the work will be split into \"word-pieces\", where each word piece belongs to the vocabulary. For example, the word \"autobahn\" is not part of the vocabulary, so the tokeniser split it into \"auto\" and \"bahn\" (the \"##\" means that \"bahn\" should be merged with the token that comes before it when reconstructing the original sentence from the tokens)."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False\n",
"True\n",
"True\n"
]
}
],
"source": [
"print(\"autobahn\" in bert_tokeniser.vocab)\n",
"print(\"auto\" in bert_tokeniser.vocab)\n",
"print(\"bahn\" in bert_tokeniser.vocab)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\"bert-base-uncased\" is an English only model, so it can't deal well with sentences in other languages:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['on', 'no', '##us', 'di', '##t', 'qu', '’', 'au', '##jou', '##rd', '’', 'hui', 'c', '’', 'est', 'le', 'cas', ',', 'encore', 'fa', '##ud', '##ra', '-', 't', '-', 'il', 'l', '’', 'eva', '##lu', '##er', '.']\n"
]
}
],
"source": [
"tokens = bert_tokeniser.tokenize(\"On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.\")\n",
"print(tokens)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"CamemBERT, however, is a French language model, so it can tokenise the sentence in a more meaningful (but far from perfect) way.\n",
"Here, '-' means that this token starts a new word."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['▁On', '▁nous', '▁dit', '▁qu', '’', 'aujourd', '’', 'hui', '▁c', '’', 'est', '▁le', '▁cas', ',', '▁encore', '▁faudra', '-', 't', '-', 'il', '▁l', '’', 'évaluer', '.']\n"
]
}
],
"source": [
"camembert_tokeniser = AutoTokenizer.from_pretrained(\"camembert-base\")\n",
"tokens = camembert_tokeniser.tokenize(\"On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.\")\n",
"print(tokens)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some models are multilingual, and their tokenisers can process sentences from several languages. \"bert-base-multilingual-uncased\" (M-BERT) and \"xlm-roberta-base\" (XLM-RoBERTa) were trained on over 100 different languages!"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"en: ['the', 'maximum', 'speed', 'on', 'the', 'autobahn', 'is', '130', '##km', '/', 'h', '.']\n",
"fr: ['on', 'nous', 'dit', 'qu', '[UNK]', 'aujourd', '[UNK]', 'hui', 'c', '[UNK]', 'est', 'le', 'cas', ',', 'encore', 'fa', '##udra', '-', 't', '-', 'il', 'l', '[UNK]', 'eva', '##lue', '##r', '.']\n",
"heb: ['אחד', 'הדבר', '##ים', 'ש', '##אני', 'ה', '##כי', 'או', '##ה', '##ב', 'ב', '##קו', '##פנה', '##גן', 'זה', 'מ', '##א', '##פים', 'עם', 'ה', '##ל']\n"
]
}
],
"source": [
"mbert_tokeniser = AutoTokenizer.from_pretrained(\"bert-base-multilingual-uncased\")\n",
"en_tokens = mbert_tokeniser.tokenize(\"The maximum speed on the autobahn is 130km/h.\")\n",
"\n",
"fr_tokens = mbert_tokeniser.tokenize(\"On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.\")\n",
"\n",
"heb_tokens = mbert_tokeniser.tokenize(\"אחד הדברים שאני הכי אוהב בקופנהגן זה מאפים עם הל\")\n",
"print(\"en:\", en_tokens)\n",
"print(\"fr:\", fr_tokens)\n",
"print(\"heb:\", heb_tokens)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"en: ['▁The', '▁maximum', '▁speed', '▁on', '▁the', '▁auto', 'bahn', '▁is', '▁130', 'km', '/', 'h', '.']\n",
"fr: ['▁On', '▁nous', '▁dit', '▁qu', '’', 'aujourd', '’', 'hui', '▁c', '’', 'est', '▁le', '▁cas', ',', '▁encore', '▁faudra', '-', 't', '-', 'il', '▁l', '’', 'évaluer', '.']\n",
"heb: ['▁אחד', '▁הדברים', '▁שאני', '▁הכי', '▁אוהב', '▁בקו', 'פנה', 'גן', '▁זה', '▁מא', 'פים', '▁עם', '▁הל']\n"
]
}
],
"source": [
"xlm_tokeniser = AutoTokenizer.from_pretrained(\"xlm-roberta-base\")\n",
"en_tokens = xlm_tokeniser.tokenize(\"The maximum speed on the autobahn is 130km/h.\")\n",
"\n",
"fr_tokens = xlm_tokeniser.tokenize(\"On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.\")\n",
"\n",
"heb_tokens = xlm_tokeniser.tokenize(\"אחד הדברים שאני הכי אוהב בקופנהגן זה מאפים עם הל\")\n",
"print(\"en:\", en_tokens)\n",
"print(\"fr:\", fr_tokens)\n",
"print(\"heb:\", heb_tokens)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### References:\n",
"- Introduction to Spacy and its features: https://spacy.io/usage/spacy-101\n",
"- NLTK tokenisation functionalities: https://www.nltk.org/api/nltk.tokenize.html\n",
"- HuggingFace's transformers and tokenisers: https://huggingface.co/docs/transformers/main_classes/tokenizer\n",
"- On rules and different languages: http://ceur-ws.org/Vol-2226/paper9.pdf\n",
"- Why do we need language-specific tokenisation: https://stackoverflow.com/questions/17314506/why-do-i-need-a-tokenizer-for-each-language"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Introduction to PyTorch https://pytorch.org/"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"