{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLP datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "hide_input": true
   },
   "outputs": [],
   "source": [
    "from fastai.gen_doc.nbdoc import *\n",
    "from fastai.text import * \n",
    "from fastai.gen_doc.nbdoc import *\n",
    "from fastai import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This module contains the [`TextDataset`](/text.data.html#TextDataset) class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in [`text.transform`](/text.transform.html#text.transform). It also contains all the functions to quickly get a [`TextDataBunch`](/text.data.html#TextDataBunch) ready."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quickly assemble your data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get your data in one of the following formats to make the most of the fastai library and use one of the factory methods of one of the [`TextDataBunch`](/text.data.html#TextDataBunch) classes:\n",
    "- raw text files in folders train, valid, test in an ImageNet style,\n",
    "- a csv where some column(s) gives the label(s) and the folowwing one the associated text,\n",
    "- a dataframe structured the same way,\n",
    "- tokens and labels arrays,\n",
    "- ids, vocabulary (correspondance id to word) and labels.\n",
    "\n",
    "If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a [`DataBunch`](/basic_data.html#DataBunch) with one of those functions, your data will be preprocessed automatically. You can save it, so that the next time you call it is almost instantaneous. \n",
    "\n",
    "Below are the classes that help assembling the raw data in a [`DataBunch`](/basic_data.html#DataBunch) suitable for NLP."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextLMDataBunch\"><code>class</code> <code>TextLMDataBunch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L359\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>TextLMDataBunch</code>(`train_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `valid_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `test_dl`:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=`None`, `device`:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=`None`, `tfms`:`Optional`\\[`Collection`\\[`Callable`\\]\\]=`None`, `path`:`PathOrStr`=`'.'`, `collate_fn`:`Callable`=`'data_collate'`) :: [`TextDataBunch`](/text.data.html#TextDataBunch)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextLMDataBunch, title_level=3, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a [`DataBunch`](/basic_data.html#DataBunch) suitable for language modeling: all the texts in the [`datasets`](/datasets.html#datasets) are concatenated and the labels are ignored. Instead, the target is the next word in the sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextLMDataBunch.show_batch\"><code>show_batch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L369\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>show_batch</code>(`sep`=`' '`, `ds_type`:[`DatasetType`](/basic_data.html#DatasetType)=`<DatasetType.Train: 1>`, `rows`:`int`=`10`, `max_len`:`int`=`100`)\n",
       "\n",
       "Show `rows` texts from a batch of `ds_type`, tokens are joined with `sep`, truncated at `max_len`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextLMDataBunch.show_batch)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextClasDataBunch\"><code>class</code> <code>TextClasDataBunch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L380\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>TextClasDataBunch</code>(`train_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `valid_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `test_dl`:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=`None`, `device`:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=`None`, `tfms`:`Optional`\\[`Collection`\\[`Callable`\\]\\]=`None`, `path`:`PathOrStr`=`'.'`, `collate_fn`:`Callable`=`'data_collate'`) :: [`TextDataBunch`](/text.data.html#TextDataBunch)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextClasDataBunch, title_level=3, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a [`DataBunch`](/basic_data.html#DataBunch) suitable for a text classifier: all the texts are grouped by length (with a bit of randomness for the training set) then padded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextClasDataBunch.show_batch\"><code>show_batch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L397\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>show_batch</code>(`sep`=`' '`, `ds_type`:[`DatasetType`](/basic_data.html#DatasetType)=`<DatasetType.Train: 1>`, `rows`:`int`=`10`, `max_len`:`int`=`100`)\n",
       "\n",
       "Show `rows` texts from a batch of `ds_type`, tokens are joined with `sep`, truncated at `max_len`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextClasDataBunch.show_batch)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextDataBunch\"><code>class</code> <code>TextDataBunch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L256\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>TextDataBunch</code>(`train_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `valid_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `test_dl`:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=`None`, `device`:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=`None`, `tfms`:`Optional`\\[`Collection`\\[`Callable`\\]\\]=`None`, `path`:`PathOrStr`=`'.'`, `collate_fn`:`Callable`=`'data_collate'`) :: [`DataBunch`](/basic_data.html#DataBunch)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch, title_level=3, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a [`DataBunch`](/basic_data.html#DataBunch) with the raw texts. This is only going to work if they all ahve the same lengths."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Factory methods (TextDataBunch)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All those classes have the following factory methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_folder\"><code>from_folder</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L330\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_folder</code>(`path`:`PathOrStr`, `train`:`str`=`'train'`, `valid`:`str`=`'valid'`, `test`:`Optional`\\[`str`\\]=`None`, `tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `kwargs`)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_folder, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function will create a [`DataBunch`](/basic_data.html#DataBunch) from texts placed in `path` in a [`train`](/train.html#train), `valid` and maybe `test` folders. Text files in the [`train`](/train.html#train) and `valid` folders should be places in subdirectories according to their classes (always the same for a language model) and the ones for the `test` folder should all be placed there directly. `tokenizer` will be used to parse those texts into tokens. The `shuffle` flag will optionally shuffle the texts found.\n",
    "\n",
    "You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_csv\"><code>from_csv</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L319\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_csv</code>(`path`:`PathOrStr`, `csv_name`, `valid_pct`:`float`=`0.2`, `test`:`Optional`\\[`str`\\]=`None`, `tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `classes`:`StrList`=`None`, `header`=`'infer'`, `kwargs`) → [`DataBunch`](/basic_data.html#DataBunch)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_csv, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function will create a [`DataBunch`](/basic_data.html#DataBunch) from texts placed in `path` in a csv file and maybe `test` csv file opened with `header`. You can specify `txt_cols` and `lbl_cols` or just an integer `n_labels` in which case the label(s) should be the first column(s). `tokenizer` will be used to parse those texts into tokens.\n",
    "\n",
    "You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_df\"><code>from_df</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L304\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_df</code>(`path`:`PathOrStr`, `train_df`:`DataFrame`, `valid_df`:`DataFrame`, `test_df`:`OptDataFrame`=`None`, `tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `classes`:`StrList`=`None`, `kwargs`) → [`DataBunch`](/basic_data.html#DataBunch)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_df, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function will create a [`DataBunch`](/basic_data.html#DataBunch) in `path` from texts in `train_df`, `valid_df` and maybe `test_df`. By default, those are opened with `header=infer` but you can specify another value in the kwargs. You can specify `txt_cols` and `lbl_cols` or just an integer `n_labels` in which case the label(s) should be the first column(s). `tokenizer` will be used to parse those texts into tokens.\n",
    "\n",
    "You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "hide_input": true,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_tokens\"><code>from_tokens</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L293\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_tokens</code>(`path`:`PathOrStr`, `trn_tok`:`Tokens`, `trn_lbls`:`Collection`\\[`Union`\\[`int`, `float`\\]\\], `val_tok`:`Tokens`, `val_lbls`:`Collection`\\[`Union`\\[`int`, `float`\\]\\], `vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `tst_tok`:`Tokens`=`None`, `classes`:`ArgStar`=`None`, `kwargs`) → [`DataBunch`](/basic_data.html#DataBunch)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_tokens, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function will create a [`DataBunch`](/basic_data.html#DataBunch) from `trn_tok`, `trn_lbls`, `val_tok`, `val_lbls` and maybe `tst_tok`.\n",
    "\n",
    "You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels`, `tok_suff` and `lbl_suff` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.from_ids\"><code>from_ids</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L272\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_ids</code>(`path`:`PathOrStr`, `vocab`:[`Vocab`](/text.transform.html#Vocab), `trn_ids`:`Collection`\\[`Collection`\\[`int`\\]\\], `val_ids`:`Collection`\\[`Collection`\\[`int`\\]\\], `tst_ids`:`Collection`\\[`Collection`\\[`int`\\]\\]=`None`, `trn_lbls`:`Collection`\\[`Union`\\[`int`, `float`\\]\\]=`None`, `val_lbls`:`Collection`\\[`Union`\\[`int`, `float`\\]\\]=`None`, `classes`:`ArgStar`=`None`, `kwargs`) → [`DataBunch`](/basic_data.html#DataBunch)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.from_ids, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function will create a [`DataBunch`](/basic_data.html#DataBunch) in `path` from texts already processed into `trn_ids`, `trn_lbls`, `val_ids`, `val_lbls` and maybe `tst_ids`. You can specify the corresponding `classes` if applciable. You must specify the `vocab` so that the [`RNNLearner`](/text.learner.html#RNNLearner) class can later infer the corresponding sizes in the model it will create. kwargs will be passed to the class initialization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load and save"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To avoid losing time preprocessing the text data more than once, you should save/load your [`TextDataBunch`](/text.data.html#TextDataBunch) using thse methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.load\"><code>load</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L282\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>load</code>(`path`:`PathOrStr`, `cache_name`:`PathOrStr`=`'tmp'`, `kwargs`)\n",
       "\n",
       "Load a [`TextDataBunch`](/text.data.html#TextDataBunch) from `path/cache_name`. `kwargs` are passed to the dataloader creation.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.load)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataBunch.save\"><code>save</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L260\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>save</code>(`cache_name`:`PathOrStr`=`'tmp'`)\n",
       "\n",
       "Save the [`DataBunch`](/basic_data.html#DataBunch) in `self.path/cache_name` folder.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataBunch.save)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Untar the IMDB sample dataset if not already done:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PosixPath('/home/ubuntu/.fastai/data/imdb_sample')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = untar_data(URLs.IMDB_SAMPLE)\n",
    "path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Since it comes in the form of csv files, we will use the corresponding `text_data` method. Here is an overview of what your file you should look like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "      <th>is_valid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>negative</td>\n",
       "      <td>Un-bleeping-believable! Meg Ryan doesn't even ...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>positive</td>\n",
       "      <td>This is a extremely well-made film. The acting...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>negative</td>\n",
       "      <td>Every once in a long while a movie will come a...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>positive</td>\n",
       "      <td>Name just says it all. I watched this movie wi...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>negative</td>\n",
       "      <td>This movie succeeds at being one of the most u...</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                               text  is_valid\n",
       "0  negative  Un-bleeping-believable! Meg Ryan doesn't even ...     False\n",
       "1  positive  This is a extremely well-made film. The acting...     False\n",
       "2  negative  Every once in a long while a movie will come a...     False\n",
       "3  positive  Name just says it all. I watched this movie wi...     False\n",
       "4  negative  This movie succeeds at being one of the most u...     False"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(path/'texts.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "And here is a simple way of creating your [`DataBunch`](/basic_data.html#DataBunch) for language modelling or classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "data_lm = TextLMDataBunch.from_csv(Path(path), 'texts.csv')\n",
    "data_clas = TextClasDataBunch.from_csv(Path(path), 'texts.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The TextBase dataset classes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Behind the scenes, the previous functions will create a training, validation and maybe test [`TextDataset`](/text.data.html#TextDataset) which will then be transformed in a [`TokenizedDataset`](/text.data.html#TokenizedDataset) then a [`NumericalizedDataset`](/text.data.html#NumericalizedDataset). Those are all subclasses of [`TextBase`](/text.data.html#TextBase)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextBase\"><code>class</code> <code>TextBase</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L39\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>TextBase</code>(`x`:`ArgStar`, `labels`:`Collection`\\[`Union`\\[`int`, `float`\\]\\]=`None`, `classes`:`ArgStar`=`None`, `encode_classes`:`bool`=`True`) :: [`LabelDataset`](/basic_data.html#LabelDataset)\n",
       "\n",
       "Base class for fastai datasets that do classification, mapped according to `classes`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextBase, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`x` is an array representing the inputs (filenames, texts, tokens or ids) with certain `labels` (default to all zeros if not specified). `classes` can be passed and if `encode_classes`, the `labels` are changed from their class to the corresponding index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextDataset\"><code>class</code> <code>TextDataset</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L106\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>TextDataset</code>(`texts`:`StrList`, `labels`:`ArgStar`=`None`, `classes`:`ArgStar`=`None`, `mark_fields`:`bool`=`True`, `encode_classes`:`bool`=`True`, `is_fnames`:`bool`=`False`) :: [`TextBase`](/text.data.html#TextBase)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataset, doc_string=False, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a [`TextBase`](/text.data.html#TextBase) dataset of `texts` with `labels` belonging to `classes`. The `texts` are joined in the column dimension and if `mark_fields`, field markers are added in-between. If `encode_classes` the `labels` are changed from their class to the corresponding index. If `is_fnames`, the filenames in `texts` are read to pull the texts. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "hide_input": true,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataset.from_folder\"><code>from_folder</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L140\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_folder</code>(`path`:`PathOrStr`, `classes`:`ArgStar`=`None`, `valid_pct`:`float`=`0.0`, `extensions`:`StrList`=`['.txt']`, `mark_fields`:`bool`=`True`) → `TextDataset`"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataset.from_folder, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a [`TextDataset`](/text.data.html#TextDataset) by scanning the subfolders in `path` for files with `extensions`. Only keep the ones with labels in `classes` if it's specified. If `valid_pct` is not 0., returns two datasets randomly split. `mark_fields` is passed to the initialization. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "hide_input": true,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataset.from_one_folder\"><code>from_one_folder</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L157\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_one_folder</code>(`path`:`PathOrStr`, `classes`:`ArgStar`, `extensions`:`StrList`=`['.txt']`, `mark_fields`:`bool`=`True`) → `TextDataset`"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataset.from_one_folder, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Primarly used for the test set. Create a [`TextDataset`](/text.data.html#TextDataset) by scanning the subfolders in `path` for files with `extensions`. Labels all of them for `classes[0]`.  `mark_fields` is passed to the initialization. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "hide_input": true,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataset.from_df\"><code>from_df</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L117\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_df</code>(`df`:`DataFrame`, `classes`:`ArgStar`=`None`, `n_labels`:`int`=`1`, `txt_cols`:`Collection`\\[`Union`\\[`int`, `str`\\]\\]=`None`, `label_cols`:`Collection`\\[`Union`\\[`int`, `str`\\]\\]=`None`, `mark_fields`:`bool`=`True`) → `TextDataset`\n",
       "\n",
       "Create a [`TextDataset`](/text.data.html#TextDataset) from the texts in a dataframe  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataset.from_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextDataset.tokenize\"><code>tokenize</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L166\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>tokenize</code>(`tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `chunksize`:`int`=`10000`) → `TokenizedDataset`\n",
       "\n",
       "Tokenize the texts with `tokenizer` by bits of `chunksize`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextDataset.tokenize)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TokenizedDataset\"><code>class</code> <code>TokenizedDataset</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L79\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>TokenizedDataset</code>(`tokens`:`Tokens`, `labels`:`Collection`\\[`Union`\\[`int`, `float`\\]\\]=`None`, `classes`:`ArgStar`=`None`, `encode_classes`:`bool`=`True`) :: [`TextBase`](/text.data.html#TextBase)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TokenizedDataset, doc_string=False, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a [`TextBase`](/text.data.html#TextBase) dataset of `tokens` with `labels` belonging to `classes`. If `encode_classes` the `labels` are changed from their class to the corresponding index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TokenizedDataset.save\"><code>save</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L85\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>save</code>(`path`:`Path`, `name`:`str`)\n",
       "\n",
       "Save the dataset in `path` with `name`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TokenizedDataset.save)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TokenizedDataset.numericalize\"><code>numericalize</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L92\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>numericalize</code>(`vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `max_vocab`:`int`=`60000`, `min_freq`:`int`=`2`) → `NumericalizedDataset`\n",
       "\n",
       "Numericalize the tokens with `vocab` (if not None) otherwise create one with `max_vocab` and `min_freq` from tokens.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TokenizedDataset.numericalize)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"NumericalizedDataset\"><code>class</code> <code>NumericalizedDataset</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L50\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>NumericalizedDataset</code>(`vocab`:[`Vocab`](/text.transform.html#Vocab), `ids`:`Collection`\\[`Collection`\\[`int`\\]\\], `labels`:`Collection`\\[`Union`\\[`int`, `float`\\]\\]=`None`, `classes`:`ArgStar`=`None`, `encode_classes`:`bool`=`True`) :: [`TextBase`](/text.data.html#TextBase)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(NumericalizedDataset, doc_string=False, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a [`TextBase`](/text.data.html#TextBase) dataset of `ids` with `labels` belonging to `classes`. `vocab` contains the correspondance between ids an tokens. If `encode_classes` the `labels` are changed from their class to the corresponding index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"NumericalizedDataset.get_text_item\"><code>get_text_item</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L58\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>get_text_item</code>(`idx`, `sep`=`' '`, `max_len`:`int`=`None`)\n",
       "\n",
       "Return the text in `idx`, tokens separated by `sep` and cutting at `max_len`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(NumericalizedDataset.get_text_item)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "hide_input": true,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"NumericalizedDataset.save\"><code>save</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L63\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>save</code>(`path`:`Path`, `name`:`str`)\n",
       "\n",
       "Save the dataset in `path` with `name`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(NumericalizedDataset.save)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"NumericalizedDataset.load\"><code>load</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L71\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>load</code>(`path`:`Path`, `name`:`str`)\n",
       "\n",
       "Load a [`NumericalizedDataset`](/text.data.html#NumericalizedDataset) from `path` in `name`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(NumericalizedDataset.load)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Language Model data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A language model is trained to guess what the next word is inside a flow of words. We don't feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into `bs` chuncks of continuous texts. Note that in all NLP tasks, we use the pytoch convention of sequence length being the first dimension (and batch size being the second one) so we transpose that array so that we can read the chunks of texts in columns. Here is an example of batch from our imdb sample dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>xxfld</td>\n",
       "      <td>the</td>\n",
       "      <td>xxfld</td>\n",
       "      <td>what</td>\n",
       "      <td>this</td>\n",
       "      <td>i</td>\n",
       "      <td>his</td>\n",
       "      <td>\"</td>\n",
       "      <td>)</td>\n",
       "      <td>out</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>first</td>\n",
       "      <td>2</td>\n",
       "      <td>makes</td>\n",
       "      <td>.</td>\n",
       "      <td>ever</td>\n",
       "      <td>work</td>\n",
       "      <td>entertainment</td>\n",
       "      <td>and</td>\n",
       "      <td>of</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>this</td>\n",
       "      <td>things</td>\n",
       "      <td>false</td>\n",
       "      <td>more</td>\n",
       "      <td>i</td>\n",
       "      <td>saw</td>\n",
       "      <td>.</td>\n",
       "      <td>\"</td>\n",
       "      <td>the</td>\n",
       "      <td>their</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>is</td>\n",
       "      <td>i</td>\n",
       "      <td>xxfld</td>\n",
       "      <td>interesting</td>\n",
       "      <td>also</td>\n",
       "      <td>outside</td>\n",
       "      <td>jerry</td>\n",
       "      <td>.</td>\n",
       "      <td>next</td>\n",
       "      <td>xxup</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a</td>\n",
       "      <td>noticed</td>\n",
       "      <td>1</td>\n",
       "      <td>hollywood</td>\n",
       "      <td>wish</td>\n",
       "      <td>of</td>\n",
       "      <td>van</td>\n",
       "      <td>10</td>\n",
       "      <td>he</td>\n",
       "      <td>dvd</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>very</td>\n",
       "      <td>was</td>\n",
       "      <td>ask</td>\n",
       "      <td>movies</td>\n",
       "      <td>they</td>\n",
       "      <td>star</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>/</td>\n",
       "      <td>'s</td>\n",
       "      <td>collection</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>old</td>\n",
       "      <td>,</td>\n",
       "      <td>yourself</td>\n",
       "      <td>,</td>\n",
       "      <td>'d</td>\n",
       "      <td>wars</td>\n",
       "      <td>'s</td>\n",
       "      <td>10</td>\n",
       "      <td>trying</td>\n",
       "      <td>.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>and</td>\n",
       "      <td>during</td>\n",
       "      <td>where</td>\n",
       "      <td>even</td>\n",
       "      <td>done</td>\n",
       "      <td>.</td>\n",
       "      <td>splendid</td>\n",
       "      <td>xxfld</td>\n",
       "      <td>to</td>\n",
       "      <td>this</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>cheaply</td>\n",
       "      <td>winston</td>\n",
       "      <td>she</td>\n",
       "      <td>today</td>\n",
       "      <td>some</td>\n",
       "      <td>since</td>\n",
       "      <td>score</td>\n",
       "      <td>2</td>\n",
       "      <td>beat</td>\n",
       "      <td>may</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>made</td>\n",
       "      <td>'s</td>\n",
       "      <td>got</td>\n",
       "      <td>.</td>\n",
       "      <td>self</td>\n",
       "      <td>then</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>false</td>\n",
       "      <td>up</td>\n",
       "      <td>give</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>film</td>\n",
       "      <td>day</td>\n",
       "      <td>the</td>\n",
       "      <td>p.s</td>\n",
       "      <td>-</td>\n",
       "      <td>i</td>\n",
       "      <td>as</td>\n",
       "      <td>xxfld</td>\n",
       "      <td>protée</td>\n",
       "      <td>you</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>--</td>\n",
       "      <td>to</td>\n",
       "      <td>gun</td>\n",
       "      <td>.</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>have</td>\n",
       "      <td>the</td>\n",
       "      <td>1</td>\n",
       "      <td>!</td>\n",
       "      <td>an</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>a</td>\n",
       "      <td>day</td>\n",
       "      <td>?</td>\n",
       "      <td>i</td>\n",
       "      <td>humor</td>\n",
       "      <td>become</td>\n",
       "      <td>viewer</td>\n",
       "      <td>pixar</td>\n",
       "      <td>i</td>\n",
       "      <td>idea</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>typical</td>\n",
       "      <td>life</td>\n",
       "      <td>remember</td>\n",
       "      <td>spent</td>\n",
       "      <td>about</td>\n",
       "      <td>a</td>\n",
       "      <td>is</td>\n",
       "      <td>has</td>\n",
       "      <td>could</td>\n",
       "      <td>that</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>low</td>\n",
       "      <td>in</td>\n",
       "      <td>what</td>\n",
       "      <td>10</td>\n",
       "      <td>the</td>\n",
       "      <td>very</td>\n",
       "      <td>thrown</td>\n",
       "      <td>had</td>\n",
       "      <td>only</td>\n",
       "      <td>scarface</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>-</td>\n",
       "      <td>his</td>\n",
       "      <td>she</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>changes</td>\n",
       "      <td>big</td>\n",
       "      <td>from</td>\n",
       "      <td>massive</td>\n",
       "      <td>guess</td>\n",
       "      <td>is</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>budget</td>\n",
       "      <td>work</td>\n",
       "      <td>was</td>\n",
       "      <td>of</td>\n",
       "      <td>-</td>\n",
       "      <td>ewan</td>\n",
       "      <td>one</td>\n",
       "      <td>success</td>\n",
       "      <td>as</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>b</td>\n",
       "      <td>,</td>\n",
       "      <td>taught</td>\n",
       "      <td>20</td>\n",
       "      <td>like</td>\n",
       "      <td>mcgregor</td>\n",
       "      <td>bizarre</td>\n",
       "      <td>over</td>\n",
       "      <td>to</td>\n",
       "      <td>\"</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>-</td>\n",
       "      <td>his</td>\n",
       "      <td>about</td>\n",
       "      <td>)</td>\n",
       "      <td>on</td>\n",
       "      <td>fan</td>\n",
       "      <td>xxunk</td>\n",
       "      <td>the</td>\n",
       "      <td>what</td>\n",
       "      <td>gangster</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>western</td>\n",
       "      <td>conversations</td>\n",
       "      <td>the</td>\n",
       "      <td>and</td>\n",
       "      <td>\"</td>\n",
       "      <td>but</td>\n",
       "      <td>to</td>\n",
       "      <td>years</td>\n",
       "      <td>motivated</td>\n",
       "      <td>movie</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0              1         2            3        4         5  \\\n",
       "0     xxfld            the     xxfld         what     this         i   \n",
       "1         1          first         2        makes        .      ever   \n",
       "2      this         things     false         more        i       saw   \n",
       "3        is              i     xxfld  interesting     also   outside   \n",
       "4         a        noticed         1    hollywood     wish        of   \n",
       "5      very            was       ask       movies     they      star   \n",
       "6       old              ,  yourself            ,       'd      wars   \n",
       "7       and         during     where         even     done         .   \n",
       "8   cheaply        winston       she        today     some     since   \n",
       "9      made             's       got            .     self      then   \n",
       "10     film            day       the          p.s        -         i   \n",
       "11       --             to       gun            .    xxunk      have   \n",
       "12        a            day         ?            i    humor    become   \n",
       "13  typical           life  remember        spent    about         a   \n",
       "14      low             in      what           10      the      very   \n",
       "15        -            his       she        xxunk  changes       big   \n",
       "16   budget           work       was           of        -      ewan   \n",
       "17        b              ,    taught           20     like  mcgregor   \n",
       "18        -            his     about            )       on       fan   \n",
       "19  western  conversations       the          and        \"       but   \n",
       "\n",
       "           6              7          8           9  \n",
       "0        his              \"          )         out  \n",
       "1       work  entertainment        and          of  \n",
       "2          .              \"        the       their  \n",
       "3      jerry              .       next        xxup  \n",
       "4        van             10         he         dvd  \n",
       "5      xxunk              /         's  collection  \n",
       "6         's             10     trying           .  \n",
       "7   splendid          xxfld         to        this  \n",
       "8      score              2       beat         may  \n",
       "9      xxunk          false         up        give  \n",
       "10        as          xxfld     protée         you  \n",
       "11       the              1          !          an  \n",
       "12    viewer          pixar          i        idea  \n",
       "13        is            has      could        that  \n",
       "14    thrown            had       only    scarface  \n",
       "15      from        massive      guess          is  \n",
       "16       one        success         as           a  \n",
       "17   bizarre           over         to           \"  \n",
       "18     xxunk            the       what    gangster  \n",
       "19        to          years  motivated       movie  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = untar_data(URLs.IMDB_SAMPLE)\n",
    "data = TextLMDataBunch.from_csv(path, 'texts.csv')\n",
    "x,y = next(iter(data.train_dl))\n",
    "example = x[:20,:10].cpu()\n",
    "texts = pd.DataFrame([data.train_ds.vocab.textify(l).split(' ') for l in example])\n",
    "texts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, as suggested in [this article](https://arxiv.org/abs/1708.02182) from Stephen Merity et al., we don't use a fixed `bptt` through the different batches but slightly change it from batch to batch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([68, 64])\n",
      "torch.Size([64, 64])\n",
      "torch.Size([57, 64])\n",
      "torch.Size([76, 64])\n",
      "torch.Size([70, 64])\n"
     ]
    }
   ],
   "source": [
    "iter_dl = iter(data.train_dl)\n",
    "for _ in range(5):\n",
    "    x,y = next(iter_dl)\n",
    "    print(x.size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is all done internally when we use [`TextLMDataBunch`](/text.data.html#TextLMDataBunch), by creating [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) using the following class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"LanguageModelLoader\"><code>class</code> <code>LanguageModelLoader</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L174\" class=\"source_link\">[source]</a></h2>\n",
       "\n",
       "> <code>LanguageModelLoader</code>(`dataset`:[`TextDataset`](/text.data.html#TextDataset), `bs`:`int`=`64`, `bptt`:`int`=`70`, `backwards`:`bool`=`False`, `shuffle`:`bool`=`False`)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LanguageModelLoader, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Takes the texts from `dataset` and concatenate them all, then create a big array with `bs` columns (transposed from the data source so that we read the texts in the columns). Spits batches with a size approximately equal to `bptt` but changing at every batch. If `backwards` is True, reverses the original text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"LanguageModelLoader.batchify\"><code>batchify</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L200\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>batchify</code>(`data`:`ndarray`) → `LongTensor`"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LanguageModelLoader.batchify, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Called at the inialization to create the big array of text ids from the [`data`](/text.data.html#text.data) array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"LanguageModelLoader.get_batch\"><code>get_batch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L207\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>get_batch</code>(`i`:`int`, `seq_len`:`int`) → `Tuple`\\[`LongTensor`, `LongTensor`\\]\n",
       "\n",
       "Create a batch at `i` of a given `seq_len`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(LanguageModelLoader.get_batch)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classifier data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don't all have the same length, we can't easily collate them together in batches. To help with this we use two different techniques:\n",
    "- padding: each text is padded with the `PAD` token to get all the ones we picked to the same size\n",
    "- sorting the texts (ish): to avoid having together a very long text with a very short one (which would then have a lot of `PAD` tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.\n",
    "\n",
    "Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [  20,   20,    1,    1,    1,    1,    1,    1,    1,    1],\n",
       "        [  42,   42,   20,   20,    1,    1,    1,    1,    1,    1],\n",
       "        [  70,   94,   42,   42,    1,    1,    1,    1,    1,    1],\n",
       "        [  14, 1662,   53, 2822,   20,    1,    1,    1,    1,    1],\n",
       "        [ 935, 2061,    9,    3,   42,    1,    1,    1,    1,    1],\n",
       "        [ 101,  269,  199, 3848,   23,    1,    1,    1,    1,    1],\n",
       "        [2911,  212,  907,    7,    6,   20,    1,    1,    1,    1]],\n",
       "       device='cuda:0')"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = untar_data(URLs.IMDB_SAMPLE)\n",
    "data = TextClasDataBunch.from_csv(path, 'texts.csv')\n",
    "iter_dl = iter(data.train_dl)\n",
    "_ = next(iter_dl)\n",
    "x,y = next(iter_dl)\n",
    "x[:20,-10:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is all done internally when we use [`TextClasDataBunch`](/text.data.html#TextClasDataBunch), by using the following classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"SortSampler\"><code>class</code> <code>SortSampler</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L212\" class=\"source_link\">[source]</a></h2>\n",
       "\n",
       "> <code>SortSampler</code>(`data_source`:`NPArrayList`, `key`:`KeyFunc`) :: [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(SortSampler, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pytorch [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) to batchify the `data_source` by order of length of the texts. Used for the validation and (if applicable) the test set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"SortishSampler\"><code>class</code> <code>SortishSampler</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L220\" class=\"source_link\">[source]</a></h2>\n",
       "\n",
       "> <code>SortishSampler</code>(`data_source`:`NPArrayList`, `key`:`KeyFunc`, `bs`:`int`) :: [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(SortishSampler, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pytorch [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) to batchify with size `bs` the `data_source` by order of length of the texts with a bit of randomness. Used for the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"pad_collate\"><code>pad_collate</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L241\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>pad_collate</code>(`samples`:`BatchSamples`, `pad_idx`:`int`=`1`, `pad_first`:`bool`=`True`) → `Tuple`\\[`LongTensor`, `LongTensor`\\]"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(pad_collate, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Function used by the pytorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) to collate the `samples` in batches while adding padding with `pad_idx`. If `pad_first` is True, padding is applied at the beginning (before the sentence starts) otherwise it's applied at the end."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data block API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data block API works for the text application too. Here are a few subclasses of the usual objects to implement the parts speficic to the text application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"TextFileList\"><code>class</code> <code>TextFileList</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L14\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>TextFileList</code>(`items`:`Iterator`, `path`:`PathOrStr`=`'.'`) :: [`InputList`](/data_block.html#InputList)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextFileList, doc_string=False, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This subclasses [`InputList`](/data_block.html#InputList) just to change the defulat extentions in `from_folder` to text extensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextFileList.from_folder\"><code>from_folder</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L16\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_folder</code>(`path`:`PathOrStr`=`'.'`, `extensions`:`StrList`=`['.txt']`, `recurse`=`True`) → `ImageFileList`\n",
       "\n",
       "Get the list of files in `path` that have a suffix in `extensions`. `recurse` determines if we search subfolders.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextFileList.from_folder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"SplitDatasetsText\"><code>class</code> <code>SplitDatasetsText</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L21\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>SplitDatasetsText</code>(`path`:`PathOrStr`, `train_ds`:[`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset), `valid_ds`:[`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset), `test_ds`:`Optional`\\[[`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)\\]=`None`) :: [`SplitDatasets`](/data_block.html#SplitDatasets)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(SplitDatasetsText, doc_string=False, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A subclass of [`SplitDatasets`](/data_block.html#SplitDatasets) that implements methods specific to texts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"SplitDatasetsText.tokenize\"><code>tokenize</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L22\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>tokenize</code>(`tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `chunksize`:`int`=`10000`)\n",
       "\n",
       "Tokenize `self.datasets` with `tokenizer` by bits of `chunksize`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(SplitDatasetsText.tokenize)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"SplitDatasetsText.numericalize\"><code>numericalize</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L27\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>numericalize</code>(`vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `max_vocab`:`int`=`60000`, `min_freq`:`int`=`2`)\n",
       "\n",
       "Numericalize `self.datasets` with `vocab` or by creating one on the training set with `max_vocab` and `min_freq`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(SplitDatasetsText.numericalize)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"SplitDatasetsText.databunch\"><code>databunch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L34\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>databunch</code>(`cls_func`, `path`:`PathOrStr`=`None`, `kwargs`)\n",
       "\n",
       "Create an `cls_func` from self, `path` will override `self.path`, `kwargs` are passed to `cls_func.create`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(SplitDatasetsText.databunch)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Enums"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"TextMtd\">`TextMtd`</h2>\n",
       "\n",
       "> <code>Enum</code> = [DF, TOK, IDS]\n",
       "\n",
       "[`TextDataset`](/text.data.html#TextDataset) enum to keep track of what data needs to be processed (dataframe, csv, tokens, ids) "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextMtd, alt_doc_string='`TextDataset` enum to keep track of what data needs to be processed (dataframe, csv, tokens, ids)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Undocumented Methods - Methods moved below this line will intentionally be hidden"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextLMDataBunch.create\"><code>create</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L361\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>create</code>(`train_ds`, `valid_ds`, `test_ds`=`None`, `path`:`PathOrStr`=`'.'`, `kwargs`) → [`DataBunch`](/basic_data.html#DataBunch)\n",
       "\n",
       "Create a [`TextDataBunch`](/text.data.html#TextDataBunch) in `path` from the `datasets` for language modelling.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextLMDataBunch.create)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TextClasDataBunch.create\"><code>create</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L382\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>create</code>(`train_ds`, `valid_ds`, `test_ds`=`None`, `path`:`PathOrStr`=`'.'`, `bs`=`64`, `pad_idx`=`1`, `pad_first`=`True`, `kwargs`) → [`DataBunch`](/basic_data.html#DataBunch)\n",
       "\n",
       "Function that transform the `datasets` in a [`DataBunch`](/basic_data.html#DataBunch) for classification.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TextClasDataBunch.create)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## New Methods - Please document or move to the undocumented section"
   ]
  }
 ],
 "metadata": {
  "jekyll": {
   "keywords": "fastai",
   "summary": "Basic dataset for NLP tasks and helper functions to create a DataBunch",
   "title": "text.data"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}