{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NLP Preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.text import * \n", "from fastai import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`text.tranform` contains the functions that deal behind the scenes with the two main tasks when preparing texts for modelling: *tokenization* and *numericalization*.\n", "\n", "*Tokenization* splits the raw texts into tokens (wich can be words, or punctuation signs...). The most basic way to do this would be to separate according to spaces, but it's possible to be more subtle; for instance, the contractions like \"isn't\" or \"don't\" should be split in \\[\"is\",\"n't\"\\] or \\[\"do\",\"n't\"\\]. By default fastai will use the powerful [spacy tokenizer](https://spacy.io/api/tokenizer).\n", "\n", "*Numericalization* is easier as it just consists in attributing a unique id to each token and mapping each of those tokens to their respective ids." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This step is actually divided in two phases: first, we apply a certain list of `rules` to the raw texts as preprocessing, then we use the tokenizer to split them in lists of tokens. Combining together those `rules`, the `tok_func`and the `lang` to process the texts is the role of the [`Tokenizer`](/text.transform.html#Tokenizer) class." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
class
Tokenizer
[source]Tokenizer
(`tok_func`:`Callable`=`'SpacyTokenizer'`, `lang`:`str`=`'en'`, `rules`:`ListRules`=`None`, `special_cases`:`StrList`=`None`, `n_cpus`:`int`=`None`)"
],
"text/plain": [
"process_text
[source]process_text
(`t`:`str`, `tok`:[`BaseTokenizer`](/text.transform.html#BaseTokenizer)) → `List`\\[`str`\\]\n",
"\n",
"Processe one text `t` with tokenizer `tok`. "
],
"text/plain": [
"process_all
[source]process_all
(`texts`:`StrList`) → `List`\\[`List`\\[`str`\\]\\]\n",
"\n",
"Process a list of `texts`. "
],
"text/plain": [
"class
BaseTokenizer
[source]BaseTokenizer
(`lang`:`str`)\n",
"\n",
"Basic class for a tokenizer function. "
],
"text/plain": [
"tokenizer
[source]tokenizer
(`t`:`str`) → `List`\\[`str`\\]"
],
"text/plain": [
"add_special_cases
[source]add_special_cases
(`toks`:`StrList`)"
],
"text/plain": [
"class
SpacyTokenizer
[source]SpacyTokenizer
(`lang`:`str`) :: [`BaseTokenizer`](/text.transform.html#BaseTokenizer)\n",
"\n",
"Wrapper around a spacy tokenizer to make it a [`BaseTokenizer`](/text.transform.html#BaseTokenizer). "
],
"text/plain": [
"deal_caps
[source]deal_caps
(`t`:`str`) → `str`"
],
"text/plain": [
"fix_html
[source]fix_html
(`x`:`str`) → `str`"
],
"text/plain": [
"replace_rep
[source]replace_rep
(`t`:`str`) → `str`"
],
"text/plain": [
"replace_wrep
[source]replace_wrep
(`t`:`str`) → `str`"
],
"text/plain": [
"rm_useless_spaces
[source]rm_useless_spaces
(`t`:`str`) → `str`\n",
"\n",
"Remove multiple spaces in `t`. "
],
"text/plain": [
"spec_add_spaces
[source]spec_add_spaces
(`t`:`str`) → `str`\n",
"\n",
"Add spaces around / and # in `t`. "
],
"text/plain": [
"class
Vocab
[source]Vocab
(`itos`:`Dict`\\[`int`, `str`\\])"
],
"text/plain": [
"create
[source]create
(`tokens`:`Tokens`, `max_vocab`:`int`, `min_freq`:`int`) → `Vocab`"
],
"text/plain": [
"numericalize
[source]numericalize
(`t`:`StrList`) → `List`\\[`int`\\]\n",
"\n",
"Convert a list of tokens `t` to their ids. "
],
"text/plain": [
"textify
[source]textify
(`nums`:`Collection`\\[`int`\\], `sep`=`' '`) → `List`\\[`str`\\]\n",
"\n",
"Convert a list of `nums` to their tokens. "
],
"text/plain": [
"tokenizer
[source]tokenizer
(`t`:`str`) → `List`\\[`str`\\]"
],
"text/plain": [
"add_special_cases
[source]add_special_cases
(`toks`:`StrList`)"
],
"text/plain": [
"