{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NLP Preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.text import * \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`text.tranform` contains the functions that deal behind the scenes with the two main tasks when preparing texts for modelling: *tokenization* and *numericalization*.\n", "\n", "*Tokenization* splits the raw texts into tokens (which can be words, or punctuation signs...). The most basic way to do this would be to separate according to spaces, but it's possible to be more subtle; for instance, the contractions like \"isn't\" or \"don't\" should be split in \\[\"is\",\"n't\"\\] or \\[\"do\",\"n't\"\\]. By default fastai will use the powerful [spacy tokenizer](https://spacy.io/api/tokenizer).\n", "\n", "*Numericalization* is easier as it just consists in attributing a unique id to each token and mapping each of those tokens to their respective ids." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This step is actually divided in two phases: first, we apply a certain list of `rules` to the raw texts as preprocessing, then we use the tokenizer to split them in lists of tokens. Combining together those `rules`, the `tok_func`and the `lang` to process the texts is the role of the [`Tokenizer`](/text.transform.html#Tokenizer) class." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
class
Tokenizer
[source][test]Tokenizer
(**`tok_func`**:`Callable`=***`'SpacyTokenizer'`***, **`lang`**:`str`=***`'en'`***, **`pre_rules`**:`ListRules`=***`None`***, **`post_rules`**:`ListRules`=***`None`***, **`special_cases`**:`StrList`=***`None`***, **`n_cpus`**:`int`=***`None`***)\n",
"\n",
"Tests found for Tokenizer
:
pytest -sv tests/test_text_transform.py::test_tokenize
[source]pytest -sv tests/test_text_transform.py::test_tokenize_handles_empty_lines
[source]pytest -sv tests/test_text_transform.py::test_tokenize_ignores_extraneous_space
[source]To run tests please refer to this guide.
process_text
[source][test]process_text
(**`t`**:`str`, **`tok`**:[`BaseTokenizer`](/text.transform.html#BaseTokenizer)) → `List`\\[`str`\\]\n",
"\n",
"No tests found for process_text
. To contribute a test please refer to this guide and this discussion.
process_all
[source][test]process_all
(**`texts`**:`StrList`) → `List`\\[`List`\\[`str`\\]\\]\n",
"\n",
"Tests found for process_all
:
Some other tests where process_all
is used:
pytest -sv tests/test_text_transform.py::test_tokenize
[source]pytest -sv tests/test_text_transform.py::test_tokenize_handles_empty_lines
[source]pytest -sv tests/test_text_transform.py::test_tokenize_ignores_extraneous_space
[source]To run tests please refer to this guide.
class
BaseTokenizer
[source][test]BaseTokenizer
(**`lang`**:`str`)\n",
"\n",
"No tests found for BaseTokenizer
. To contribute a test please refer to this guide and this discussion.
tokenizer
[source][test]tokenizer
(**`t`**:`str`) → `List`\\[`str`\\]\n",
"\n",
"Tests found for tokenizer
:
Some other tests where tokenizer
is used:
pytest -sv tests/test_text_transform.py::test_tokenize
[source]pytest -sv tests/test_text_transform.py::test_tokenize_handles_empty_lines
[source]pytest -sv tests/test_text_transform.py::test_tokenize_ignores_extraneous_space
[source]To run tests please refer to this guide.
add_special_cases
[source][test]add_special_cases
(**`toks`**:`StrList`)\n",
"\n",
"No tests found for add_special_cases
. To contribute a test please refer to this guide and this discussion.
class
SpacyTokenizer
[source][test]SpacyTokenizer
(**`lang`**:`str`) :: [`BaseTokenizer`](/text.transform.html#BaseTokenizer)\n",
"\n",
"No tests found for SpacyTokenizer
. To contribute a test please refer to this guide and this discussion.
deal_caps
[source][test]deal_caps
(**`x`**:`StrList`) → `StrList`\n",
"\n",
"fix_html
[source][test]fix_html
(**`x`**:`str`) → `str`\n",
"\n",
"replace_all_caps
[source][test]replace_all_caps
(**`x`**:`StrList`) → `StrList`\n",
"\n",
"replace_rep
[source][test]replace_rep
(**`t`**:`str`) → `str`\n",
"\n",
"replace_wrep
[source][test]replace_wrep
(**`t`**:`str`) → `str`\n",
"\n",
"rm_useless_spaces
[source][test]rm_useless_spaces
(**`t`**:`str`) → `str`\n",
"\n",
"spec_add_spaces
[source][test]spec_add_spaces
(**`t`**:`str`) → `str`\n",
"\n",
"class
Vocab
[source][test]Vocab
(**`itos`**:`StrList`)\n",
"\n",
"No tests found for Vocab
. To contribute a test please refer to this guide and this discussion.
create
[source][test]create
(**`tokens`**:`Tokens`, **`max_vocab`**:`int`, **`min_freq`**:`int`) → `Vocab`\n",
"\n",
"No tests found for create
. To contribute a test please refer to this guide and this discussion.
numericalize
[source][test]numericalize
(**`t`**:`StrList`) → `List`\\[`int`\\]\n",
"\n",
"No tests found for numericalize
. To contribute a test please refer to this guide and this discussion.
textify
[source][test]textify
(**`nums`**:`Collection`\\[`int`\\], **`sep`**=***`' '`***) → `List`\\[`str`\\]\n",
"\n",
"No tests found for textify
. To contribute a test please refer to this guide and this discussion.
tokenizer
[source][test]tokenizer
(**`t`**:`str`) → `List`\\[`str`\\]\n",
"\n",
"Tests found for tokenizer
:
Some other tests where tokenizer
is used:
pytest -sv tests/test_text_transform.py::test_tokenize
[source]pytest -sv tests/test_text_transform.py::test_tokenize_handles_empty_lines
[source]pytest -sv tests/test_text_transform.py::test_tokenize_ignores_extraneous_space
[source]To run tests please refer to this guide.
add_special_cases
[source][test]add_special_cases
(**`toks`**:`StrList`)\n",
"\n",
"No tests found for add_special_cases
. To contribute a test please refer to this guide and this discussion.