{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NLP datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [],
   "source": [
    "from fastai.gen_doc.nbdoc import *\n",
    "from fastai.text import * \n",
    "from fastai.gen_doc.nbdoc import *\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This module contains the [`TextDataset`](/text.data.html#TextDataset) class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in [`text.transform`](/text.transform.html#text.transform). It also contains all the functions to quickly get a [`TextDataBunch`](/text.data.html#TextDataBunch) ready."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quickly assemble your data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get your data in one of the following formats to make the most of the fastai library and use one of the factory methods of one of the [`TextDataBunch`](/text.data.html#TextDataBunch) classes:\n",
    "- raw text files in folders train, valid, test in an ImageNet style,\n",
    "- a csv where some column(s) gives the label(s) and the following one the associated text,\n",
    "- a dataframe structured the same way,\n",
    "- tokens and labels arrays,\n",
    "- ids, vocabulary (correspondence id to word) and labels.\n",
    "\n",
    "If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a [`DataBunch`](/basic_data.html#DataBunch) with one of those functions, your data will be preprocessed automatically. You can save it, so that the next time you call it is almost instantaneous. \n",
    "\n",
    "Below are the classes that help assembling the raw data in a [`DataBunch`](/basic_data.html#DataBunch) suitable for NLP."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextLMDataBunch\" class=\"doc_header\"><code>class</code> <code>TextLMDataBunch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L239\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextLMDataBunch-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h3>\n",
       "\n",
       "> <code>TextLMDataBunch</code>(**`train_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`valid_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`fix_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)=***`None`***, **`test_dl`**:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***) :: [`TextDataBunch`](/text.data.html#TextDataBunch)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextLMDataBunch-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextLMDataBunch-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>TextLMDataBunch</code>:</p><p>Some other tests where <code>TextLMDataBunch</code> is used:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_from_csv_and_from_df</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L57\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_1</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L83\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_2</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L99\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Create a [`TextDataBunch`](/text.data.html#TextDataBunch) suitable for training a language model.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextLMDataBunch, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All the texts in the [`datasets`](/datasets.html#datasets) are concatenated and the labels are ignored. Instead, the target is the next word in the sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextLMDataBunch.create\" class=\"doc_header\"><code>create</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L241\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextLMDataBunch-create-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>create</code>(**`train_ds`**, **`valid_ds`**, **`test_ds`**=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`no_check`**:`bool`=***`False`***, **`bs`**=***`64`***, **`val_bs`**:`int`=***`None`***, **`num_workers`**:`int`=***`0`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`bptt`**:`int`=***`70`***, **`backwards`**:`bool`=***`False`***, **\\*\\*`dl_kwargs`**) → [`DataBunch`](/basic_data.html#DataBunch)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextLMDataBunch-create-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextLMDataBunch-create-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>create</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Create a [`TextDataBunch`](/text.data.html#TextDataBunch) in `path` from the `datasets` for language modelling. Passes `**dl_kwargs` on to `DataLoader()`  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextLMDataBunch.create)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextClasDataBunch\" class=\"doc_header\"><code>class</code> <code>TextClasDataBunch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L254\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextClasDataBunch-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h3>\n",
       "\n",
       "> <code>TextClasDataBunch</code>(**`train_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`valid_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`fix_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)=***`None`***, **`test_dl`**:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***) :: [`TextDataBunch`](/text.data.html#TextDataBunch)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextClasDataBunch-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextClasDataBunch-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>TextClasDataBunch</code>:</p><p>Some other tests where <code>TextClasDataBunch</code> is used:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_backwards_cls_databunch</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L110\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_from_csv_and_from_df</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L57\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_from_ids_works_for_equally_length_sentences</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L173\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_from_ids_works_for_variable_length_sentences</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L181\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_load_and_save_test</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L129\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Create a [`TextDataBunch`](/text.data.html#TextDataBunch) suitable for training an RNN classifier.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextClasDataBunch, title_level=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextClasDataBunch.create\" class=\"doc_header\"><code>create</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L256\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextClasDataBunch-create-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>create</code>(**`train_ds`**, **`valid_ds`**, **`test_ds`**=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`bs`**:`int`=***`32`***, **`val_bs`**:`int`=***`None`***, **`pad_idx`**=***`1`***, **`pad_first`**=***`True`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`no_check`**:`bool`=***`False`***, **`backwards`**:`bool`=***`False`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **\\*\\*`dl_kwargs`**) → [`DataBunch`](/basic_data.html#DataBunch)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextClasDataBunch-create-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextClasDataBunch-create-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>create</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Function that transform the `datasets` in a [`DataBunch`](/basic_data.html#DataBunch) for classification. Passes `**dl_kwargs` on to `DataLoader()`  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextClasDataBunch.create)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All the texts are grouped by length (with a bit of randomness for the training set) then padded so that the samples have the same length to get in a batch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextDataBunch\" class=\"doc_header\"><code>class</code> <code>TextDataBunch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L146\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h3>\n",
       "\n",
       "> <code>TextDataBunch</code>(**`train_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`valid_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`fix_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)=***`None`***, **`test_dl`**:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***) :: [`DataBunch`](/basic_data.html#DataBunch)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextDataBunch-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>TextDataBunch</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "General class to get a [`DataBunch`](/basic_data.html#DataBunch) for NLP. Subclassed by [`TextLMDataBunch`](/text.data.html#TextLMDataBunch) and [`TextClasDataBunch`](/text.data.html#TextClasDataBunch).  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch, title_level=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<div markdown=\"span\" class=\"alert alert-danger\" role=\"alert\"><i class=\"fa fa-danger-circle\"></i> <b>Warning: </b>This class can only work directly if all the texts have the same length.</div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "jekyll_warn(\"This class can only work directly if all the texts have the same length.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Factory methods (TextDataBunch)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All those classes have the following factory methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_folder\" class=\"doc_header\"><code>from_folder</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L225\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_folder-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>from_folder</code>(**`path`**:`PathOrStr`, **`train`**:`str`=***`'train'`***, **`valid`**:`str`=***`'valid'`***, **`test`**:`Optional`\\[`str`\\]=***`None`***, **`classes`**:`ArgStar`=***`None`***, **`tokenizer`**:[`Tokenizer`](/text.transform.html#Tokenizer)=***`None`***, **`vocab`**:[`Vocab`](/text.transform.html#Vocab)=***`None`***, **`chunksize`**:`int`=***`10000`***, **`max_vocab`**:`int`=***`60000`***, **`min_freq`**:`int`=***`2`***, **`mark_fields`**:`bool`=***`False`***, **`include_bos`**:`bool`=***`True`***, **`include_eos`**:`bool`=***`False`***, **\\*\\*`kwargs`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextDataBunch-from_folder-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_folder-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>from_folder</code>:</p><p>Some other tests where <code>from_folder</code> is used:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_filter_classes</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L42\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_from_folder</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L30\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from text files in folders.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The floders are scanned in `path` with a <code>train</code>, `valid` and maybe `test` folders. Text files in the <code>train</code> and `valid` folders should be places in subdirectories according to their classes (not applicable for a language model). `tokenizer` will be used to parse those texts into tokens.\n",
    "\n",
    "You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_csv\" class=\"doc_header\"><code>from_csv</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L208\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_csv-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>from_csv</code>(**`path`**:`PathOrStr`, **`csv_name`**, **`valid_pct`**:`float`=***`0.2`***, **`test`**:`Optional`\\[`str`\\]=***`None`***, **`tokenizer`**:[`Tokenizer`](/text.transform.html#Tokenizer)=***`None`***, **`vocab`**:[`Vocab`](/text.transform.html#Vocab)=***`None`***, **`classes`**:`StrList`=***`None`***, **`delimiter`**:`str`=***`None`***, **`header`**=***`'infer'`***, **`text_cols`**:`IntsOrStrs`=***`1`***, **`label_cols`**:`IntsOrStrs`=***`0`***, **`label_delim`**:`str`=***`None`***, **`chunksize`**:`int`=***`10000`***, **`max_vocab`**:`int`=***`60000`***, **`min_freq`**:`int`=***`2`***, **`mark_fields`**:`bool`=***`False`***, **`include_bos`**:`bool`=***`True`***, **`include_eos`**:`bool`=***`False`***, **\\*\\*`kwargs`**) → [`DataBunch`](/basic_data.html#DataBunch)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextDataBunch-from_csv-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_csv-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>from_csv</code>:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_from_csv_and_from_df</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L57\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from texts in csv files. `kwargs` are passed to the dataloader creation.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method will look for `csv_name`, and optionally a `test` csv file, in `path`. These will be opened with [`header`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas-read-csv), using `delimiter`. You can specify which are the `text_cols` and `label_cols`; by default a single label column is assumed to come before a single text column. If your csv has no header, you must specify these as indices. If you're training a language model and don't have labels, you must specify the `text_cols`. If there are several `text_cols`, the texts will be concatenated together with an optional field token. If there are several `label_cols`, the labels will be assumed to be one-hot encoded and `classes` will default to `label_cols` (you can ignore that argument for a language model). `label_delim` can be used to specify the separator between multiple labels in a column.\n",
    "\n",
    "You can pass a `tokenizer` to be used to parse the texts into tokens and/or a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). Otherwise you can specify parameters such as `max_vocab`, `min_freq`, `chunksize` for the Tokenizer and Numericalizer (processors). Other parameters (e.g. `bs`, `val_bs` and `num_workers`, etc.) will be passed to [`LabelLists.databunch()`](/data_block.html#LabelLists.databunch) documentation) (see the LM data and classifier data sections for more info)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_df\" class=\"doc_header\"><code>from_df</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L189\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_df-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>from_df</code>(**`path`**:`PathOrStr`, **`train_df`**:`DataFrame`, **`valid_df`**:`DataFrame`, **`test_df`**:`OptDataFrame`=***`None`***, **`tokenizer`**:[`Tokenizer`](/text.transform.html#Tokenizer)=***`None`***, **`vocab`**:[`Vocab`](/text.transform.html#Vocab)=***`None`***, **`classes`**:`StrList`=***`None`***, **`text_cols`**:`IntsOrStrs`=***`1`***, **`label_cols`**:`IntsOrStrs`=***`0`***, **`label_delim`**:`str`=***`None`***, **`chunksize`**:`int`=***`10000`***, **`max_vocab`**:`int`=***`60000`***, **`min_freq`**:`int`=***`2`***, **`mark_fields`**:`bool`=***`False`***, **`include_bos`**:`bool`=***`True`***, **`include_eos`**:`bool`=***`False`***, **\\*\\*`kwargs`**) → [`DataBunch`](/basic_data.html#DataBunch)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextDataBunch-from_df-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_df-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>from_df</code>:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_from_csv_and_from_df</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L57\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>Some other tests where <code>from_df</code> is used:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_backwards_cls_databunch</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L110\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_load_and_save_test</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L129\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_regression</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L189\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_1</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L83\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_2</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L99\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from DataFrames. `kwargs` are passed to the dataloader creation.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method will use `train_df`, `valid_df` and optionally `test_df` to build the [`TextDataBunch`](/text.data.html#TextDataBunch) in `path`. You can specify `text_cols` and `label_cols`; by default a single label column comes before a single text column. If you're training a language model and don't have labels, you must specify the `text_cols`. If there are several `text_cols`, the texts will be concatenated together with an optional field token. If there are several `label_cols`, the labels will be assumed to be one-hot encoded and `classes` will default to `label_cols` (you can ignore that argument for a language model).\n",
    "\n",
    "You can pass a `tokenizer` to be used to parse the texts into tokens and/or a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). Otherwise you can specify parameters such as `max_vocab`, `min_freq`, `chunksize` for the default Tokenizer and Numericalizer (processors). Other parameters (e.g. `bs`, `val_bs` and `num_workers`, etc.) will be passed to [`LabelLists.databunch()`](/data_block.html#LabelLists.databunch) documentation) (see the LM data and classifier data sections for more info)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_tokens\" class=\"doc_header\"><code>from_tokens</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L176\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_tokens-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>from_tokens</code>(**`path`**:`PathOrStr`, **`trn_tok`**:`Tokens`, **`trn_lbls`**:`Collection`\\[`Union`\\[`int`, `float`\\]\\], **`val_tok`**:`Tokens`, **`val_lbls`**:`Collection`\\[`Union`\\[`int`, `float`\\]\\], **`vocab`**:[`Vocab`](/text.transform.html#Vocab)=***`None`***, **`tst_tok`**:`Tokens`=***`None`***, **`classes`**:`ArgStar`=***`None`***, **`max_vocab`**:`int`=***`60000`***, **`min_freq`**:`int`=***`3`***, **\\*\\*`kwargs`**) → [`DataBunch`](/basic_data.html#DataBunch)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextDataBunch-from_tokens-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_tokens-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>from_tokens</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from tokens and labels. `kwargs` are passed to the dataloader creation.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function will create a [`DataBunch`](/basic_data.html#DataBunch) from `trn_tok`, `trn_lbls`, `val_tok`, `val_lbls` and maybe `tst_tok`.\n",
    "\n",
    "You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels`, `tok_suff` and `lbl_suff` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_ids\" class=\"doc_header\"><code>from_ids</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L149\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_ids-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>from_ids</code>(**`path`**:`PathOrStr`, **`vocab`**:[`Vocab`](/text.transform.html#Vocab), **`train_ids`**:`Collection`\\[`Collection`\\[`int`\\]\\], **`valid_ids`**:`Collection`\\[`Collection`\\[`int`\\]\\], **`test_ids`**:`Collection`\\[`Collection`\\[`int`\\]\\]=***`None`***, **`train_lbls`**:`Collection`\\[`Union`\\[`int`, `float`\\]\\]=***`None`***, **`valid_lbls`**:`Collection`\\[`Union`\\[`int`, `float`\\]\\]=***`None`***, **`classes`**:`ArgStar`=***`None`***, **`processor`**:[`PreProcessor`](/data_block.html#PreProcessor)=***`None`***, **\\*\\*`kwargs`**) → [`DataBunch`](/basic_data.html#DataBunch)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextDataBunch-from_ids-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-from_ids-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>from_ids</code>:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_from_ids_works_for_equally_length_sentences</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L173\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_from_ids_works_for_variable_length_sentences</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L181\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from ids, labels and a `vocab`. `kwargs` are passed to the dataloader creation.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Texts are already preprocessed into `train_ids`, `train_lbls`, `valid_ids`, `valid_lbls` and maybe `test_ids`. You can specify the corresponding `classes` if applicable. You must specify a `path` and the `vocab` so that the [`RNNLearner`](/text.learner.html#RNNLearner) class can later infer the corresponding sizes in the model it will create. kwargs will be passed to the class initialization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load and save"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To avoid losing time preprocessing the text data more than once, you should save and load your [`TextDataBunch`](/text.data.html#TextDataBunch) using [`DataBunch.save`](/basic_data.html#DataBunch.save) and [`load_data`](/basic_data.html#load_data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.load\" class=\"doc_header\"><code>load</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L163\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-load-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>load</code>(**`path`**:`PathOrStr`, **`cache_name`**:`PathOrStr`=***`'tmp'`***, **`processor`**:[`PreProcessor`](/data_block.html#PreProcessor)=***`None`***, **\\*\\*`kwargs`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextDataBunch-load-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextDataBunch-load-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>load</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Load a [`TextDataBunch`](/text.data.html#TextDataBunch) from `path/cache_name`. `kwargs` are passed to the dataloader creation.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.load)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<div markdown=\"span\" class=\"alert alert-danger\" role=\"alert\"><i class=\"fa fa-danger-circle\"></i> <b>Warning: </b>This method should only be used to load back `TextDataBunch` saved in v1.0.43 or before, it is now deprecated.</div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "jekyll_warn(\"This method should only be used to load back `TextDataBunch` saved in v1.0.43 or before, it is now deprecated.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Untar the IMDB sample dataset if not already done:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PosixPath('/home/ubuntu/.fastai/data/imdb_sample')"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = untar_data(URLs.IMDB_SAMPLE)\n",
    "path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since it comes in the form of csv files, we will use the corresponding `text_data` method. Here is an overview of what your file you should look like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "      <th>is_valid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>negative</td>\n",
       "      <td>Un-bleeping-believable! Meg Ryan doesn't even ...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>positive</td>\n",
       "      <td>This is a extremely well-made film. The acting...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>negative</td>\n",
       "      <td>Every once in a long while a movie will come a...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>positive</td>\n",
       "      <td>Name just says it all. I watched this movie wi...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>negative</td>\n",
       "      <td>This movie succeeds at being one of the most u...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                               text  is_valid\n",
       "0  negative  Un-bleeping-believable! Meg Ryan doesn't even ...     False\n",
       "1  positive  This is a extremely well-made film. The acting...     False\n",
       "2  negative  Every once in a long while a movie will come a...     False\n",
       "3  positive  Name just says it all. I watched this movie wi...     False\n",
       "4  negative  This movie succeeds at being one of the most u...     False"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(path/'texts.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And here is a simple way of creating your [`DataBunch`](/basic_data.html#DataBunch) for language modelling or classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_lm = TextLMDataBunch.from_csv(Path(path), 'texts.csv')\n",
    "data_clas = TextClasDataBunch.from_csv(Path(path), 'texts.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The TextList input classes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Behind the scenes, the previous functions will create a training, validation and maybe test [`TextList`](/text.data.html#TextList) that will be tokenized and numericalized (if needed) using [`PreProcessor`](/data_block.html#PreProcessor)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"Text\" class=\"doc_header\"><code>class</code> <code>Text</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L277\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#Text-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h3>\n",
       "\n",
       "> <code>Text</code>(**`ids`**, **`text`**) :: [`ItemBase`](/core.html#ItemBase)\n",
       "\n",
       "<div class=\"collapse\" id=\"Text-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#Text-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>Text</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Basic item for <code>text</code> data in numericalized `ids`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(Text, title_level=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextList\" class=\"doc_header\"><code>class</code> <code>TextList</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L316\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextList-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h3>\n",
       "\n",
       "> <code>TextList</code>(**`items`**:`Iterator`\\[`T_co`\\], **`vocab`**:[`Vocab`](/text.transform.html#Vocab)=***`None`***, **`pad_idx`**:`int`=***`1`***, **\\*\\*`kwargs`**) :: [`ItemList`](/data_block.html#ItemList)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextList-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextList-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>TextList</code>:</p><p>Some other tests where <code>TextList</code> is used:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_filter_classes</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L42\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_from_folder</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L30\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_regression</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L189\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Basic [`ItemList`](/data_block.html#ItemList) for text data.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextList, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`vocab` contains the correspondence between ids and tokens, `pad_idx` is the id used for padding. You can pass a custom `processor` in the `kwargs` to change the defaults for tokenization or numericalization. It should have the following form:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = Tokenizer(SpacyTokenizer, 'en')\n",
    "processor = [TokenizeProcessor(tokenizer=tokenizer), NumericalizeProcessor(max_vocab=30000)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use sentencepiece instead of space (requires to install sentencepiece separately) you would pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processor = SPProcessor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See below for all the arguments those tokenizers can take."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextList.label_for_lm\" class=\"doc_header\"><code>label_for_lm</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L331\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextList-label_for_lm-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>label_for_lm</code>(**\\*\\*`kwargs`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextList-label_for_lm-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextList-label_for_lm-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>label_for_lm</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "A special labelling method for language models.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextList.label_for_lm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextList.from_folder\" class=\"doc_header\"><code>from_folder</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L342\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextList-from_folder-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>from_folder</code>(**`path`**:`PathOrStr`=***`'.'`***, **`extensions`**:`StrList`=***`{'.txt'}`***, **`vocab`**:[`Vocab`](/text.transform.html#Vocab)=***`None`***, **`processor`**:[`PreProcessor`](/data_block.html#PreProcessor)=***`None`***, **\\*\\*`kwargs`**) → `TextList`\n",
       "\n",
       "<div class=\"collapse\" id=\"TextList-from_folder-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextList-from_folder-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>from_folder</code>:</p><p>Some other tests where <code>from_folder</code> is used:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_filter_classes</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L42\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_from_folder</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L30\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Get the list of files in `path` that have a text suffix. [`recurse`](/core.html#recurse) determines if we search subfolders.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextList.from_folder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextList.show_xys\" class=\"doc_header\"><code>show_xys</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L349\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextList-show_xys-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>show_xys</code>(**`xs`**, **`ys`**, **`max_len`**:`int`=***`70`***)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextList-show_xys-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextList-show_xys-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>show_xys</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Show the `xs` (inputs) and `ys` (targets). `max_len` is the maximum number of tokens displayed.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextList.show_xys)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextList.show_xyzs\" class=\"doc_header\"><code>show_xyzs</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L362\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextList-show_xyzs-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>show_xyzs</code>(**`xs`**, **`ys`**, **`zs`**, **`max_len`**:`int`=***`70`***)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextList-show_xyzs-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextList-show_xyzs-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>show_xyzs</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Show `xs` (inputs), `ys` (targets) and `zs` (predictions). `max_len` is the maximum number of tokens displayed.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextList.show_xyzs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"OpenFileProcessor\" class=\"doc_header\"><code>class</code> <code>OpenFileProcessor</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L311\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#OpenFileProcessor-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h3>\n",
       "\n",
       "> <code>OpenFileProcessor</code>(**`ds`**:`Collection`\\[`T_co`\\]=***`None`***) :: [`PreProcessor`](/data_block.html#PreProcessor)\n",
       "\n",
       "<div class=\"collapse\" id=\"OpenFileProcessor-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#OpenFileProcessor-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>OpenFileProcessor</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "[`PreProcessor`](/data_block.html#PreProcessor) that opens the filenames and read the texts.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(OpenFileProcessor, title_level=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"open_text\" class=\"doc_header\"><code>open_text</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L273\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#open_text-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>open_text</code>(**`fn`**:`PathOrStr`, **`enc`**=***`'utf-8'`***)\n",
       "\n",
       "<div class=\"collapse\" id=\"open_text-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#open_text-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>open_text</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Read the text in `fn`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(open_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TokenizeProcessor\" class=\"doc_header\"><code>class</code> <code>TokenizeProcessor</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L282\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TokenizeProcessor-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h3>\n",
       "\n",
       "> <code>TokenizeProcessor</code>(**`ds`**:[`ItemList`](/data_block.html#ItemList)=***`None`***, **`tokenizer`**:[`Tokenizer`](/text.transform.html#Tokenizer)=***`None`***, **`chunksize`**:`int`=***`10000`***, **`mark_fields`**:`bool`=***`False`***, **`include_bos`**:`bool`=***`True`***, **`include_eos`**:`bool`=***`False`***) :: [`PreProcessor`](/data_block.html#PreProcessor)\n",
       "\n",
       "<div class=\"collapse\" id=\"TokenizeProcessor-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TokenizeProcessor-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>TokenizeProcessor</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "[`PreProcessor`](/data_block.html#PreProcessor) that tokenizes the texts in `ds`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TokenizeProcessor, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`tokenizer` is used on bits of `chunksize`. If `mark_fields=True`, add field tokens between each parts of the texts (given when the texts are read in several columns of a dataframe). Depending on `include_bos` and `include_eos`, `BOS` and `EOS` will be automatically added at the beginning or the end of each text. See more about tokenizers in the [transform documentation](/text.transform.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"NumericalizeProcessor\" class=\"doc_header\"><code>class</code> <code>NumericalizeProcessor</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L299\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#NumericalizeProcessor-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h3>\n",
       "\n",
       "> <code>NumericalizeProcessor</code>(**`ds`**:[`ItemList`](/data_block.html#ItemList)=***`None`***, **`vocab`**:[`Vocab`](/text.transform.html#Vocab)=***`None`***, **`max_vocab`**:`int`=***`60000`***, **`min_freq`**:`int`=***`3`***) :: [`PreProcessor`](/data_block.html#PreProcessor)\n",
       "\n",
       "<div class=\"collapse\" id=\"NumericalizeProcessor-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#NumericalizeProcessor-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>NumericalizeProcessor</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "[`PreProcessor`](/data_block.html#PreProcessor) that numericalizes the tokens in `ds`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(NumericalizeProcessor, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uses `vocab` for this (if not None), otherwise create one with `max_vocab` and `min_freq` from tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"SPProcessor\" class=\"doc_header\"><code>class</code> <code>SPProcessor</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L436\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#SPProcessor-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h3>\n",
       "\n",
       "> <code>SPProcessor</code>(**`ds`**:[`ItemList`](/data_block.html#ItemList)=***`None`***, **`pre_rules`**:`ListRules`=***`None`***, **`post_rules`**:`ListRules`=***`None`***, **`vocab_sz`**:`int`=***`None`***, **`max_vocab_sz`**:`int`=***`30000`***, **`model_type`**:`str`=***`'unigram'`***, **`max_sentence_len`**:`int`=***`20480`***, **`lang`**=***`'en'`***, **`char_coverage`**=***`None`***, **`tmp_dir`**=***`'tmp'`***, **`mark_fields`**:`bool`=***`False`***, **`include_bos`**:`bool`=***`True`***, **`include_eos`**:`bool`=***`False`***, **`sp_model`**=***`None`***, **`sp_vocab`**=***`None`***) :: [`PreProcessor`](/data_block.html#PreProcessor)\n",
       "\n",
       "<div class=\"collapse\" id=\"SPProcessor-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#SPProcessor-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>SPProcessor</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "[`PreProcessor`](/data_block.html#PreProcessor) that tokenize and numericalizes with `sentencepiece`  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(SPProcessor, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pre_rules` and `post_rules` default to `defaults.text_pre_rules` and `defaults.text_post_rules` respectively, `vocab_sz` defaults to the minimum between `max_vocab_sz` and one quarter of the number of words in the training texts (rounded to the nearest multiple of 8). [`model_type`](/torch_core.html#model_type) is passed to sentencepiece, so can be `unigram` (default), `bpe`, `char`, or `word`. Other sentencepiece parameters are `lang`m `max_sentence_len` and `char_coverage` (default to 1. for European languages and 0.99 for others).\n",
    "\n",
    "`mark_fields=True` will add fields tokens between each text columns (if they are in several columns of a dataframe) and depending on `include_bos` and `include_eos`, `BOS` and `EOS` will be automatically added at the beginning or the end of each text. The sentencepiece model used for tokenization will be saved in `path/tmp_dir` where `path` will be given by the data this processor is applied to.\n",
    "\n",
    "If you already have a trained tokenizer, you can passa long the model and vocab files with `sp_model` and `sp_vocab`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Language Model data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A language model is trained to guess what the next word is inside a flow of words. We don't feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into `bs` chunks of continuous texts. Note that in all NLP tasks, we don't use the usual convention of sequence length being the first dimension so batch size is the first dimension and sequence length is the second. Here you can read the chunks of texts in lines. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "      <th>15</th>\n",
       "      <th>16</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>crew</td>\n",
       "      <td>that</td>\n",
       "      <td>he</td>\n",
       "      <td>can</td>\n",
       "      <td>trust</td>\n",
       "      <td>to</td>\n",
       "      <td>help</td>\n",
       "      <td>him</td>\n",
       "      <td>pull</td>\n",
       "      <td>it</td>\n",
       "      <td>off</td>\n",
       "      <td>and</td>\n",
       "      <td>get</td>\n",
       "      <td>his</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>want</td>\n",
       "      <td>a</td>\n",
       "      <td>good</td>\n",
       "      <td>family</td>\n",
       "      <td>movie</td>\n",
       "      <td>,</td>\n",
       "      <td>this</td>\n",
       "      <td>might</td>\n",
       "      <td>do</td>\n",
       "      <td>.</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>it</td>\n",
       "      <td>is</td>\n",
       "      <td>clean</td>\n",
       "      <td>.</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>director</td>\n",
       "      <td>of</td>\n",
       "      <td>many</td>\n",
       "      <td>bad</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>)</td>\n",
       "      <td>tries</td>\n",
       "      <td>to</td>\n",
       "      <td>cover</td>\n",
       "      <td>the</td>\n",
       "      <td>info</td>\n",
       "      <td>up</td>\n",
       "      <td>,</td>\n",
       "      <td>but</td>\n",
       "      <td>goo</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>film</td>\n",
       "      <td>,</td>\n",
       "      <td>and</td>\n",
       "      <td>the</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>of</td>\n",
       "      <td>the</td>\n",
       "      <td>villain</td>\n",
       "      <td>,</td>\n",
       "      <td>humorous</td>\n",
       "      <td>or</td>\n",
       "      <td>not</td>\n",
       "      <td>,</td>\n",
       "      <td>are</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>cole</td>\n",
       "      <td>in</td>\n",
       "      <td>the</td>\n",
       "      <td>beginning</td>\n",
       "      <td>are</td>\n",
       "      <td>meant</td>\n",
       "      <td>to</td>\n",
       "      <td>draw</td>\n",
       "      <td>comparisons</td>\n",
       "      <td>which</td>\n",
       "      <td>leave</td>\n",
       "      <td>the</td>\n",
       "      <td>audience</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>.</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>witness</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>brian</td>\n",
       "      <td>dealing</td>\n",
       "      <td>with</td>\n",
       "      <td>his</td>\n",
       "      <td>situation</td>\n",
       "      <td>through</td>\n",
       "      <td>first</td>\n",
       "      <td>,</td>\n",
       "      <td>primitive</td>\n",
       "      <td>means</td>\n",
       "      <td>,</td>\n",
       "      <td>and</td>\n",
       "      <td>then</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>film</td>\n",
       "      <td>,</td>\n",
       "      <td>or</td>\n",
       "      <td>not</td>\n",
       "      <td>.</td>\n",
       "      <td>\\n</td>\n",
       "      <td>\\n</td>\n",
       "      <td></td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>this</td>\n",
       "      <td>film</td>\n",
       "      <td>.</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>film</td>\n",
       "      <td>?</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>this</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>xxunk</td>\n",
       "      <td>sitting</td>\n",
       "      <td>through</td>\n",
       "      <td>this</td>\n",
       "      <td>bomb</td>\n",
       "      <td>.</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>the</td>\n",
       "      <td>crew</td>\n",
       "      <td>member</td>\n",
       "      <td>who</td>\n",
       "      <td>was</td>\n",
       "      <td>in</td>\n",
       "      <td>charge</td>\n",
       "      <td>of</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>this</td>\n",
       "      <td>film</td>\n",
       "      <td>is</td>\n",
       "      <td>viewed</td>\n",
       "      <td>as</td>\n",
       "      <td>non</td>\n",
       "      <td>xxup</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>but</td>\n",
       "      <td>there</td>\n",
       "      <td>is</td>\n",
       "      <td>a</td>\n",
       "      <td>speech</td>\n",
       "      <td>by</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>mention</td>\n",
       "      <td>the</td>\n",
       "      <td>pace</td>\n",
       "      <td>of</td>\n",
       "      <td>the</td>\n",
       "      <td>movie</td>\n",
       "      <td>.</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>to</td>\n",
       "      <td>my</td>\n",
       "      <td>mind</td>\n",
       "      <td>,</td>\n",
       "      <td>this</td>\n",
       "      <td>new</td>\n",
       "      <td>version</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>of</td>\n",
       "      <td>yours</td>\n",
       "      <td>!</td>\n",
       "      <td>'</td>\n",
       "      <td>\\n</td>\n",
       "      <td>\\n</td>\n",
       "      <td></td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>director</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>,</td>\n",
       "      <td>who</td>\n",
       "      <td>is</td>\n",
       "      <td>xxunk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>pair</td>\n",
       "      <td>,</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>harry</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>michell</td>\n",
       "      <td>as</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>harry</td>\n",
       "      <td>,</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>rosie</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>michell</td>\n",
       "      <td>as</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>cares</td>\n",
       "      <td>who</td>\n",
       "      <td>lives</td>\n",
       "      <td>and</td>\n",
       "      <td>who</td>\n",
       "      <td>dies</td>\n",
       "      <td>,</td>\n",
       "      <td>i</td>\n",
       "      <td>'ll</td>\n",
       "      <td>be</td>\n",
       "      <td>shocked</td>\n",
       "      <td>.</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>the</td>\n",
       "      <td>same</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>is</td>\n",
       "      <td>incredibly</td>\n",
       "      <td>stupid</td>\n",
       "      <td>,</td>\n",
       "      <td>with</td>\n",
       "      <td>a</td>\n",
       "      <td>detective</td>\n",
       "      <td>trying</td>\n",
       "      <td>to</td>\n",
       "      <td>track</td>\n",
       "      <td>down</td>\n",
       "      <td>a</td>\n",
       "      <td>suspected</td>\n",
       "      <td>serial</td>\n",
       "      <td>killer</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>independent</td>\n",
       "      <td>film</td>\n",
       "      <td>was</td>\n",
       "      <td>one</td>\n",
       "      <td>of</td>\n",
       "      <td>the</td>\n",
       "      <td>best</td>\n",
       "      <td>films</td>\n",
       "      <td>at</td>\n",
       "      <td>the</td>\n",
       "      <td>tall</td>\n",
       "      <td>grass</td>\n",
       "      <td>film</td>\n",
       "      <td>festival</td>\n",
       "      <td>that</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             0           1        2          3      4        5          6   \\\n",
       "0          crew        that       he        can  trust       to       help   \n",
       "1          want           a     good     family  movie        ,       this   \n",
       "2      director          of     many        bad  xxunk        )      tries   \n",
       "3          film           ,      and        the  xxunk    xxunk         of   \n",
       "4          cole          in      the  beginning    are    meant         to   \n",
       "5       witness       xxmaj    brian    dealing   with      his  situation   \n",
       "6          film           ,       or        not      .       \\n         \\n   \n",
       "7         xxunk     sitting  through       this   bomb        .      xxmaj   \n",
       "8          this        film       is     viewed     as      non       xxup   \n",
       "9       mention         the     pace         of    the    movie          .   \n",
       "10           of       yours        !          '     \\n       \\n              \n",
       "11         pair           ,    xxmaj      harry  xxmaj  michell         as   \n",
       "12        cares         who    lives        and    who     dies          ,   \n",
       "13           is  incredibly   stupid          ,   with        a  detective   \n",
       "14  independent        film      was        one     of      the       best   \n",
       "\n",
       "         7            8       9          10     11         12        13  \\\n",
       "0       him         pull      it        off    and        get       his   \n",
       "1     might           do       .      xxmaj     it         is     clean   \n",
       "2        to        cover     the       info     up          ,       but   \n",
       "3       the      villain       ,   humorous     or        not         ,   \n",
       "4      draw  comparisons   which      leave    the   audience     xxunk   \n",
       "5   through        first       ,  primitive  means          ,       and   \n",
       "6                  xxmaj    this       film      .      xxmaj      film   \n",
       "7       the         crew  member        who    was         in    charge   \n",
       "8     xxunk          but   there         is      a     speech        by   \n",
       "9     xxmaj           to      my       mind      ,       this       new   \n",
       "10    xxmaj     director   xxmaj      xxunk  xxmaj      xxunk         ,   \n",
       "11    xxmaj        harry       ,      xxmaj  rosie      xxmaj   michell   \n",
       "12        i          'll      be    shocked      .      xxmaj       the   \n",
       "13   trying           to   track       down      a  suspected    serial   \n",
       "14    films           at     the       tall  grass       film  festival   \n",
       "\n",
       "         14     15     16  \n",
       "0     xxunk   None   None  \n",
       "1         .   None   None  \n",
       "2       goo   None   None  \n",
       "3       are   None   None  \n",
       "4         .   None   None  \n",
       "5      then   None   None  \n",
       "6         ?  xxmaj   this  \n",
       "7        of   None   None  \n",
       "8     xxmaj   None   None  \n",
       "9   version   None   None  \n",
       "10      who     is  xxunk  \n",
       "11       as   None   None  \n",
       "12     same   None   None  \n",
       "13   killer   None   None  \n",
       "14     that   None   None  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = untar_data(URLs.IMDB_SAMPLE)\n",
    "data = TextLMDataBunch.from_csv(path, 'texts.csv')\n",
    "x,y = next(iter(data.train_dl))\n",
    "example = x[:15,:15].cpu()\n",
    "texts = pd.DataFrame([data.train_ds.vocab.textify(l).split(' ') for l in example])\n",
    "texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<div markdown=\"span\" class=\"alert alert-danger\" role=\"alert\"><i class=\"fa fa-danger-circle\"></i> <b>Warning: </b>If you are used to another convention, beware! fastai always uses batch as a first dimension, even in NLP.</div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "jekyll_warn(\"If you are used to another convention, beware! fastai always uses batch as a first dimension, even in NLP.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is all done internally when we use [`TextLMDataBunch`](/text.data.html#TextLMDataBunch), by wrapping the dataset in the following pre-loader before calling a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"LanguageModelPreLoader\" class=\"doc_header\"><code>class</code> <code>LanguageModelPreLoader</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L16\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
       "\n",
       "> <code>LanguageModelPreLoader</code>(**`dataset`**:[`LabelList`](/data_block.html#LabelList), **`lengths`**:`Collection`\\[`int`\\]=***`None`***, **`bs`**:`int`=***`32`***, **`bptt`**:`int`=***`70`***, **`backwards`**:`bool`=***`False`***, **`shuffle`**:`bool`=***`False`***) :: [`Callback`](/callback.html#Callback)\n",
       "\n",
       "<div class=\"collapse\" id=\"LanguageModelPreLoader-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>LanguageModelPreLoader</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Transforms the tokens in `dataset` to a stream of contiguous batches for language modelling.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LanguageModelPreLoader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LanguageModelPreLoader is an internal class uses for training a language model. It takes the sentences passed as a jagged array of numericalised sentences in `dataset` and returns contiguous batches to the pytorch dataloader with batch size `bs` and a sequence length `bptt`. \n",
    "- `lengths` can be provided for the jagged training data else lengths is calculated internally \n",
    "- `backwards=True` will reverses the sentences. \n",
    "- `shuffle=True`, will shuffle the order of the sentences, at the start of each epoch - except the first\n",
    "\n",
    "The following description is usefull for understanding the implementation of [`LanguageModelPreLoader`](/text.data.html#LanguageModelPreLoader):\n",
    "- idx: instance of CircularIndex that indexes items while taking the following into account 1) shuffle, 2) direction of indexing, 3) wraps around to head (reading forward) or tail (reading backwards) of the ragged array as needed in order to fill the last batch(s)\n",
    "\n",
    "- ro: index of the first rag of each row in the batch to be extract. Returns as index to the next rag to be extracted\n",
    "\n",
    "- ri: Reading forward: index to the first token to be extracted in the current rag (ro). Reading backwards: one position after the last token to be extracted in the rag\n",
    "\n",
    "- overlap: overlap between batches is 1, because we only predict the next token        \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classifier data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don't all have the same length, we can't easily collate them together in batches. To help with this we use two different techniques:\n",
    "- padding: each text is padded with the `PAD` token to get all the ones we picked to the same size\n",
    "- sorting the texts (ish): to avoid having together a very long text with a very short one (which would then have a lot of `PAD` tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.\n",
    "\n",
    "Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       "        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],\n",
       "       device='cuda:0')"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = untar_data(URLs.IMDB_SAMPLE)\n",
    "data = TextClasDataBunch.from_csv(path, 'texts.csv')\n",
    "iter_dl = iter(data.train_dl)\n",
    "_ = next(iter_dl)\n",
    "x,y = next(iter_dl)\n",
    "x[-10:,:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is all done internally when we use [`TextClasDataBunch`](/text.data.html#TextClasDataBunch), by using the following classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"SortSampler\" class=\"doc_header\"><code>class</code> <code>SortSampler</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L99\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#SortSampler-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
       "\n",
       "> <code>SortSampler</code>(**`data_source`**:`NPArrayList`, **`key`**:`KeyFunc`) :: [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler)\n",
       "\n",
       "<div class=\"collapse\" id=\"SortSampler-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#SortSampler-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>SortSampler</code>:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_sampler</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L158\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_text_data.py::test_sort_sampler</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L158\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Go through the text data by order of length.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(SortSampler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This pytorch [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) is used for the validation and (if applicable) the test set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"SortishSampler\" class=\"doc_header\"><code>class</code> <code>SortishSampler</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L107\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#SortishSampler-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
       "\n",
       "> <code>SortishSampler</code>(**`data_source`**:`NPArrayList`, **`key`**:`KeyFunc`, **`bs`**:`int`) :: [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler)\n",
       "\n",
       "<div class=\"collapse\" id=\"SortishSampler-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#SortishSampler-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>SortishSampler</code>:</p><ul><li><code>pytest -sv tests/test_text_data.py::test_sortish_sampler</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_text_data.py#L143\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Go through the text data by order of length with a bit of randomness.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(SortishSampler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This pytorch [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) is generally used for the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"pad_collate\" class=\"doc_header\"><code>pad_collate</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L128\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#pad_collate-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>pad_collate</code>(**`samples`**:`BatchSamples`, **`pad_idx`**:`int`=***`1`***, **`pad_first`**:`bool`=***`True`***, **`backwards`**:`bool`=***`False`***) → `Tuple`\\[`LongTensor`, `LongTensor`\\]\n",
       "\n",
       "<div class=\"collapse\" id=\"pad_collate-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#pad_collate-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>pad_collate</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Function that collect samples and adds padding. Flips token order if needed  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(pad_collate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will collate the `samples` in batches while adding padding with `pad_idx`. If `pad_first=True`, padding is applied at the beginning (before the sentence starts) otherwise it's applied at the end."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Undocumented Methods - Methods moved below this line will intentionally be hidden"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"ItemList.new\" class=\"doc_header\"><code>new</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/data_block.py#L101\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#ItemList-new-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>new</code>(**`items`**:`Iterator`\\[`T_co`\\], **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **\\*\\*`kwargs`**) → `ItemList`\n",
       "\n",
       "<div class=\"collapse\" id=\"ItemList-new-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#ItemList-new-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>new</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Create a new [`ItemList`](/data_block.html#ItemList) from `items`, keeping the same attributes.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextList.new)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextList.get\" class=\"doc_header\"><code>get</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L327\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextList-get-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>get</code>(**`i`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextList-get-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextList-get-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>get</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Subclass if you want to customize how to create item `i` from `self.items`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextList.get)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TokenizeProcessor.process_one\" class=\"doc_header\"><code>process_one</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L289\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TokenizeProcessor-process_one-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>process_one</code>(**`item`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"TokenizeProcessor-process_one-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TokenizeProcessor-process_one-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>process_one</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TokenizeProcessor.process_one)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TokenizeProcessor.process\" class=\"doc_header\"><code>process</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L292\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TokenizeProcessor-process-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>process</code>(**`ds`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"TokenizeProcessor-process-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TokenizeProcessor-process-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>process</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TokenizeProcessor.process)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"OpenFileProcessor.process_one\" class=\"doc_header\"><code>process_one</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L313\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#OpenFileProcessor-process_one-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>process_one</code>(**`item`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"OpenFileProcessor-process_one-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#OpenFileProcessor-process_one-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>process_one</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(OpenFileProcessor.process_one)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"NumericalizeProcessor.process\" class=\"doc_header\"><code>process</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L306\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#NumericalizeProcessor-process-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>process</code>(**`ds`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"NumericalizeProcessor-process-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#NumericalizeProcessor-process-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>process</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(NumericalizeProcessor.process)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"NumericalizeProcessor.process_one\" class=\"doc_header\"><code>process_one</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L305\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#NumericalizeProcessor-process_one-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>process_one</code>(**`item`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"NumericalizeProcessor-process_one-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#NumericalizeProcessor-process_one-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>process_one</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(NumericalizeProcessor.process_one)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextList.reconstruct\" class=\"doc_header\"><code>reconstruct</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L337\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TextList-reconstruct-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>reconstruct</code>(**`t`**:`Tensor`)\n",
       "\n",
       "<div class=\"collapse\" id=\"TextList-reconstruct-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TextList-reconstruct-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>reconstruct</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Reconstruct one of the underlying item for its data `t`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextList.reconstruct)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"LanguageModelPreLoader.on_epoch_begin\" class=\"doc_header\"><code>on_epoch_begin</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L53\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-on_epoch_begin-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>on_epoch_begin</code>(**\\*\\*`kwargs`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"LanguageModelPreLoader-on_epoch_begin-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-on_epoch_begin-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>on_epoch_begin</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "At the beginning of each epoch.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LanguageModelPreLoader.on_epoch_begin)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"LanguageModelPreLoader.on_epoch_end\" class=\"doc_header\"><code>on_epoch_end</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L70\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-on_epoch_end-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>on_epoch_end</code>(**\\*\\*`kwargs`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"LanguageModelPreLoader-on_epoch_end-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-on_epoch_end-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>on_epoch_end</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Called at the end of an epoch.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LanguageModelPreLoader.on_epoch_end)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## New Methods - Please document or move to the undocumented section"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"LMLabelList\" class=\"doc_header\"><code>class</code> <code>LMLabelList</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L374\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#LMLabelList-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
       "\n",
       "> <code>LMLabelList</code>(**`items`**:`Iterator`\\[`T_co`\\], **\\*\\*`kwargs`**) :: [`EmptyLabelList`](/data_block.html#EmptyLabelList)\n",
       "\n",
       "<div class=\"collapse\" id=\"LMLabelList-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#LMLabelList-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>LMLabelList</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Basic [`ItemList`](/data_block.html#ItemList) for dummy labels.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LMLabelList)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"LanguageModelPreLoader.allocate_buffers\" class=\"doc_header\"><code>allocate_buffers</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L42\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-allocate_buffers-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>allocate_buffers</code>()\n",
       "\n",
       "<div class=\"collapse\" id=\"LanguageModelPreLoader-allocate_buffers-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-allocate_buffers-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>allocate_buffers</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Create the ragged array that will be filled when we ask for items.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LanguageModelPreLoader.allocate_buffers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"LanguageModelPreLoader.CircularIndex.shuffle\" class=\"doc_header\"><code>shuffle</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L25\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-CircularIndex-shuffle-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>shuffle</code>()\n",
       "\n",
       "<div class=\"collapse\" id=\"LanguageModelPreLoader-CircularIndex-shuffle-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-CircularIndex-shuffle-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>shuffle</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LanguageModelPreLoader.CircularIndex.shuffle)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"LanguageModelPreLoader.fill_row\" class=\"doc_header\"><code>fill_row</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L80\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-fill_row-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>fill_row</code>(**`forward`**, **`items`**, **`idx`**, **`row`**, **`ro`**, **`ri`**, **`overlap`**, **`lengths`**)\n",
       "\n",
       "<div class=\"collapse\" id=\"LanguageModelPreLoader-fill_row-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#LanguageModelPreLoader-fill_row-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>fill_row</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Fill the row with tokens from the ragged array. --OBS-- overlap != 1 has not been implemented  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LanguageModelPreLoader.fill_row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "jekyll": {
   "keywords": "fastai",
   "summary": "Basic dataset for NLP tasks and helper functions to create a DataBunch",
   "title": "text.data"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}