{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NLP Preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.text import * \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`text.tranform` contains the functions that deal behind the scenes with the two main tasks when preparing texts for modelling: *tokenization* and *numericalization*.\n", "\n", "*Tokenization* splits the raw texts into tokens (which can be words, or punctuation signs...). The most basic way to do this would be to separate according to spaces, but it's possible to be more subtle; for instance, the contractions like \"isn't\" or \"don't\" should be split in \\[\"is\",\"n't\"\\] or \\[\"do\",\"n't\"\\]. By default fastai will use the powerful [spacy tokenizer](https://spacy.io/api/tokenizer).\n", "\n", "*Numericalization* is easier as it just consists in attributing a unique id to each token and mapping each of those tokens to their respective ids." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This step is actually divided in two phases: first, we apply a certain list of `rules` to the raw texts as preprocessing, then we use the tokenizer to split them in lists of tokens. Combining together those `rules`, the `tok_func`and the `lang` to process the texts is the role of the [`Tokenizer`](/text.transform.html#Tokenizer) class." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
class Tokenizer[source]Tokenizer(**`tok_func`**:`Callable`=***`'SpacyTokenizer'`***, **`lang`**:`str`=***`'en'`***, **`pre_rules`**:`ListRules`=***`None`***, **`post_rules`**:`ListRules`=***`None`***, **`special_cases`**:`StrList`=***`None`***, **`n_cpus`**:`int`=***`None`***)\n",
"\n",
"Put together rules and a tokenizer function to tokenize text with multiprocessing. "
],
"text/plain": [
"process_text[source]process_text(**`t`**:`str`, **`tok`**:[`BaseTokenizer`](/text.transform.html#BaseTokenizer)) → `List`\\[`str`\\]\n",
"\n",
"Process one text `t` with tokenizer `tok`. "
],
"text/plain": [
"process_all[source]process_all(**`texts`**:`StrList`) → `List`\\[`List`\\[`str`\\]\\]\n",
"\n",
"Process a list of `texts`. "
],
"text/plain": [
"class BaseTokenizer[source]BaseTokenizer(**`lang`**:`str`)\n",
"\n",
"Basic class for a tokenizer function. "
],
"text/plain": [
"tokenizer[source]tokenizer(**`t`**:`str`) → `List`\\[`str`\\]"
],
"text/plain": [
"add_special_cases[source]add_special_cases(**`toks`**:`StrList`)"
],
"text/plain": [
"class SpacyTokenizer[source]SpacyTokenizer(**`lang`**:`str`) :: [`BaseTokenizer`](/text.transform.html#BaseTokenizer)\n",
"\n",
"Wrapper around a spacy tokenizer to make it a [`BaseTokenizer`](/text.transform.html#BaseTokenizer). "
],
"text/plain": [
"deal_caps[source]deal_caps(**`x`**:`StrList`) → `StrList`"
],
"text/plain": [
"fix_html[source]fix_html(**`x`**:`str`) → `str`"
],
"text/plain": [
"replace_all_caps[source]replace_all_caps(**`x`**:`StrList`) → `StrList`\n",
"\n",
"Add `TK_UP` for words in all caps in `x`. "
],
"text/plain": [
"replace_rep[source]replace_rep(**`t`**:`str`) → `str`"
],
"text/plain": [
"replace_wrep[source]replace_wrep(**`t`**:`str`) → `str`"
],
"text/plain": [
"rm_useless_spaces[source]rm_useless_spaces(**`t`**:`str`) → `str`\n",
"\n",
"Remove multiple spaces in `t`. "
],
"text/plain": [
"spec_add_spaces[source]spec_add_spaces(**`t`**:`str`) → `str`\n",
"\n",
"Add spaces around / and # in `t`. "
],
"text/plain": [
"class Vocab[source]Vocab(**`itos`**:`StrList`)\n",
"\n",
"Contain the correspondence between numbers and tokens and numericalize. "
],
"text/plain": [
"create[source]create(**`tokens`**:`Tokens`, **`max_vocab`**:`int`, **`min_freq`**:`int`) → `Vocab`\n",
"\n",
"Create a vocabulary from a set of `tokens`. "
],
"text/plain": [
"numericalize[source]numericalize(**`t`**:`StrList`) → `List`\\[`int`\\]\n",
"\n",
"Convert a list of tokens `t` to their ids. "
],
"text/plain": [
"textify[source]textify(**`nums`**:`Collection`\\[`int`\\], **`sep`**=***`' '`***) → `List`\\[`str`\\]\n",
"\n",
"Convert a list of `nums` to their tokens. "
],
"text/plain": [
"tokenizer[source]tokenizer(**`t`**:`str`) → `List`\\[`str`\\]"
],
"text/plain": [
"add_special_cases[source]add_special_cases(**`toks`**:`StrList`)"
],
"text/plain": [
"