{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocess text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "from exp.nb_11a import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the IMDB dataset that consists of 50,000 labeled reviews of movies (positive or negative) and 50,000 unlabelled ones."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 12 video](https://course19.fast.ai/videos/?lesson=12&t=4964)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = datasets.untar_data(datasets.URLs.IMDB)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[PosixPath('/home/jupyter/.fastai/data/imdb/unsup'),\n",
       " PosixPath('/home/jupyter/.fastai/data/imdb/tmp_clas'),\n",
       " PosixPath('/home/jupyter/.fastai/data/imdb/ld.pkl'),\n",
       " PosixPath('/home/jupyter/.fastai/data/imdb/test'),\n",
       " PosixPath('/home/jupyter/.fastai/data/imdb/train'),\n",
       " PosixPath('/home/jupyter/.fastai/data/imdb/README'),\n",
       " PosixPath('/home/jupyter/.fastai/data/imdb/models'),\n",
       " PosixPath('/home/jupyter/.fastai/data/imdb/tmp_lm'),\n",
       " PosixPath('/home/jupyter/.fastai/data/imdb/imdb.vocab')]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path.ls()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define a subclass of `ItemList` that will read the texts in the corresponding filenames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "def read_file(fn): \n",
    "    with open(fn, 'r', encoding = 'utf8') as f: return f.read()\n",
    "    \n",
    "class TextList(ItemList):\n",
    "    @classmethod\n",
    "    def from_files(cls, path, extensions='.txt', recurse=True, include=None, **kwargs):\n",
    "        return cls(get_files(path, extensions, recurse=recurse, include=include), path, **kwargs)\n",
    "    \n",
    "    def get(self, i):\n",
    "        if isinstance(i, Path): return read_file(i)\n",
    "        return i"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just in case there are some text log files, we restrict the ones we take to the training, test, and unsupervised folders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "il = TextList.from_files(path, include=['train', 'test', 'unsup'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We should expect a total of 100,000 texts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100000"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(il.items)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the first one as an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Comedian Adam Sandler\\'s last theatrical release \"I Now Pronounce You Chuck and Larry\" served as a loud and proud plea for tolerance of the gay community. The former \"Saturday Night Live\" funnyman\\'s new movie \"You Don\\'t Mess with the Zohan\" (*** out of ****) constitutes his plea for tolerance toward Israeli and Palestinian immigrants in America. These unfortunate people are often punished in America for the crimes of their counterparts in the war-ravaged Middle East. Although \"Zohan\" advocates a lofty cause, Sandler doesn\\'t let his political agenda overshadow his usual crude, below-the-belt, juvenile shenanigans that rely on obscene bodily functions, promiscuous sex, and far-fetched harebrained idiocy. Indeed, the hysterical horseplay that Sandler and company revel in may distract you from the plight of these uprooted, misplaced misfits that had fled to Uncle Sam\\'s shores because they believe America is a Utopia. Interestingly, Sandler plays a Jewish counterterrorist agent of the Mossad, Israel\\'s secret police, with a hopelessly corny accent. Zohan\\'s exploits appear to foreshadow Will Smith\\'s upcoming \"Hancock.\" Zohan is the best Jewish secret agent in the whole wide world. He is literally indestructible. He catches bullets in his nose. He can swim faster than a dolphin, and a razor-toothed piranha fish in his bikini swim trunks amuses him.<br /><br />Zohan (Adam Sandler) is cooking fish at the beach when his superiors interrupt his vacation and inform him that the dreaded Arab terrorist, the Phantom (John Turturro of \"Transformers\"), is up to his old tricks again. Naturally, Zohan is furious! Actually, Zohan captured the Phantom three months ago, but the politicians have exchanged the Phantom for political prisoners. Now, Zohan must nab his nemesis again! The Phantom and Zohan tangle in a spectacular fight in the sea and Zohan doesn\\'t survive. In reality, Zohan deliberately fakes his death so that he can immigrate to New York City and realize his life-long dream of cutting hair for Paul Mitchell. Zohan gives himself an obsolete Frankie Avalon haircut, trims his beard, and smuggles himself onto a plane bound for America. If what happens before his flight seems outlandish, once he is on the jet, he spends his time in the cargo hold with two fluffy dogs named \"Scrappy\" and \"Coco.\" Zohan styles their hair from photos in his Paul Mitchell haircut book.<br /><br />At first, Zohan has no luck getting a job with Paul Mitchell, much less cutting hair. Zohan defends Michael (Nick Swardson of \"Reno 911, The Movie\") in a street brawl after a motorist blames Michael for his accident with a delivery truck. A grateful Michael invites Zohan to stay with his mother, Gail (Lainie Kazan of \"Dayton\\'s Devils\"), and him. Zohan practices cutting Gail\\'s hair when he isn\\'t having in lusty sex with her. Eventually, Zohan gets a job sweeping up hair at a salon owned by Dalia (Emmanuelle Chriqui of \"Wrong Turn\") who as it turns out is a Palestinian. Indeed, Zohan knows about her heritage but doesn\\'t let it bother him. One day when one of Dalia\\'s hair stylists doesn\\'t show up, Zohan takes advantage of her absence to cut hair. Much to Dalia\\'s surprise, Zohan wins the allegiance of the over sixty crowd. Older woman line up around the block to have him fashion their hair. After each session, Zohan takes each older lady in the back and assuages their sexual appetites.<br /><br />Meanwhile, a millionaire real estate developer Walbridge (Michael Buffer of \"Rocky Balboa\") hikes the rent to force Dalia and others like her out of her store to make way for his mall with a roller-coaster. Zohan surprises both Dalia and Walbridge\\'s people and forks over the money for her to pay the rent. An angry Walbridge contacts a white supremacy group to ignite a neighborhood war between the Israelis and Palestinians. This happens about the same time that Zohan falls in love with Dalia. Perennial Sandler cohort Rob Schneider of \"Deuce Bigalow\" appears as a cretinous Palestinian named Salim who doesn\\'t know the difference between nitroglycerin and Neosporin. He tries to blow up Zohan for an old grudge. It seems Zohan beat Salim up and stole his goat.<br /><br />\"You Don\\'t Mess with the Zohan\" qualifies as a surreal comedy. Scenarists Robert Smigel of \"Saturday Night Live,\" Judd Apatow of \"The 40-Year Old Virgin,\" and Sandler himself vigorously ignore the laws of logic in this zany comedy. The movie that most closely resembles \"Zohan\" is \"Little Nicky,\" because both characters boast supernatural abilities. \"You Don\\'t Mess with the Zohan\" will keep Adam Sandler fans in stitches.'"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "txt = il[0]\n",
    "txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For text classification, we will split by the grand parent folder as before, but for language modeling, we take all the texts and just put 10% aside."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sd = SplitData.split_by_func(il, partial(random_splitter, p_valid=0.1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SplitData\n",
       "Train: TextList (89885 items)\n",
       "[PosixPath('/home/jupyter/.fastai/data/imdb/unsup/30860_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/36250_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/24690_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/21770_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/9740_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/40778_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/44512_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/22672_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/25946_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/40866_0.txt')...]\n",
       "Path: /home/jupyter/.fastai/data/imdb\n",
       "Valid: TextList (10115 items)\n",
       "[PosixPath('/home/jupyter/.fastai/data/imdb/unsup/1041_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/38186_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/16367_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/47167_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/58_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/49861_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/306_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/18238_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/34952_0.txt'), PosixPath('/home/jupyter/.fastai/data/imdb/unsup/24288_0.txt')...]\n",
       "Path: /home/jupyter/.fastai/data/imdb"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tokenizing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to tokenize the dataset first, which is splitting a sentence in individual tokens. Those tokens are the basic words or punctuation signs with a few tweaks: don't for instance is split between do and n't. We will use a processor for this, in conjunction with the [spacy library](https://spacy.io/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 12 video](https://course19.fast.ai/videos/?lesson=12&t=5070)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "import spacy,html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before even tokenizeing, we will apply a bit of preprocessing on the texts to clean them up (we saw the one up there had some HTML code). These rules are applied before we split the sentences in tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "#special tokens\n",
    "UNK, PAD, BOS, EOS, TK_REP, TK_WREP, TK_UP, TK_MAJ = \"xxunk xxpad xxbos xxeos xxrep xxwrep xxup xxmaj\".split()\n",
    "\n",
    "def sub_br(t):\n",
    "    \"Replaces the <br /> by \\n\"\n",
    "    re_br = re.compile(r'<\\s*br\\s*/?>', re.IGNORECASE)\n",
    "    return re_br.sub(\"\\n\", t)\n",
    "\n",
    "def spec_add_spaces(t):\n",
    "    \"Add spaces around / and #\"\n",
    "    return re.sub(r'([/#])', r' \\1 ', t)\n",
    "\n",
    "def rm_useless_spaces(t):\n",
    "    \"Remove multiple spaces\"\n",
    "    return re.sub(' {2,}', ' ', t)\n",
    "\n",
    "def replace_rep(t):\n",
    "    \"Replace repetitions at the character level: cccc -> TK_REP 4 c\"\n",
    "    def _replace_rep(m:Collection[str]) -> str:\n",
    "        c,cc = m.groups()\n",
    "        return f' {TK_REP} {len(cc)+1} {c} '\n",
    "    re_rep = re.compile(r'(\\S)(\\1{3,})')\n",
    "    return re_rep.sub(_replace_rep, t)\n",
    "    \n",
    "def replace_wrep(t):\n",
    "    \"Replace word repetitions: word word word -> TK_WREP 3 word\"\n",
    "    def _replace_wrep(m:Collection[str]) -> str:\n",
    "        c,cc = m.groups()\n",
    "        return f' {TK_WREP} {len(cc.split())+1} {c} '\n",
    "    re_wrep = re.compile(r'(\\b\\w+\\W+)(\\1{3,})')\n",
    "    return re_wrep.sub(_replace_wrep, t)\n",
    "\n",
    "def fixup_text(x):\n",
    "    \"Various messy things we've seen in documents\"\n",
    "    re1 = re.compile(r'  +')\n",
    "    x = x.replace('#39;', \"'\").replace('amp;', '&').replace('#146;', \"'\").replace(\n",
    "        'nbsp;', ' ').replace('#36;', '$').replace('\\\\n', \"\\n\").replace('quot;', \"'\").replace(\n",
    "        '<br />', \"\\n\").replace('\\\\\"', '\"').replace('<unk>',UNK).replace(' @.@ ','.').replace(\n",
    "        ' @-@ ','-').replace('\\\\', ' \\\\ ')\n",
    "    return re1.sub(' ', html.unescape(x))\n",
    "    \n",
    "default_pre_rules = [fixup_text, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces, sub_br]\n",
    "default_spec_tok = [UNK, PAD, BOS, EOS, TK_REP, TK_WREP, TK_UP, TK_MAJ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' xxrep 4 c '"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "replace_rep('cccc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' xxwrep 5 word  '"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "replace_wrep('word word word word word ')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These rules are applies after the tokenization on the list of tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "def replace_all_caps(x):\n",
    "    \"Replace tokens in ALL CAPS by their lower version and add `TK_UP` before.\"\n",
    "    res = []\n",
    "    for t in x:\n",
    "        if t.isupper() and len(t) > 1: res.append(TK_UP); res.append(t.lower())\n",
    "        else: res.append(t)\n",
    "    return res\n",
    "\n",
    "def deal_caps(x):\n",
    "    \"Replace all Capitalized tokens in by their lower version and add `TK_MAJ` before.\"\n",
    "    res = []\n",
    "    for t in x:\n",
    "        if t == '': continue\n",
    "        if t[0].isupper() and len(t) > 1 and t[1:].islower(): res.append(TK_MAJ)\n",
    "        res.append(t.lower())\n",
    "    return res\n",
    "\n",
    "def add_eos_bos(x): return [BOS] + x + [EOS]\n",
    "\n",
    "default_post_rules = [deal_caps, replace_all_caps, add_eos_bos]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['I', 'xxup', 'am', 'xxup', 'shouting']"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "replace_all_caps(['I', 'AM', 'SHOUTING'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['xxmaj', 'my', 'name', 'is', 'xxmaj', 'jeremy']"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deal_caps(['My', 'name', 'is', 'Jeremy'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since tokenizing and applying those rules takes a bit of time, we'll parallelize it using `ProcessPoolExecutor` to go faster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "from spacy.symbols import ORTH\n",
    "from concurrent.futures import ProcessPoolExecutor\n",
    "\n",
    "def parallel(func, arr, max_workers=4):\n",
    "    if max_workers<2: results = list(progress_bar(map(func, enumerate(arr)), total=len(arr)))\n",
    "    else:\n",
    "        with ProcessPoolExecutor(max_workers=max_workers) as ex:\n",
    "            return list(progress_bar(ex.map(func, enumerate(arr)), total=len(arr)))\n",
    "    if any([o is not None for o in results]): return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "class TokenizeProcessor(Processor):\n",
    "    def __init__(self, lang=\"en\", chunksize=2000, pre_rules=None, post_rules=None, max_workers=4): \n",
    "        self.chunksize,self.max_workers = chunksize,max_workers\n",
    "        self.tokenizer = spacy.blank(lang).tokenizer\n",
    "        for w in default_spec_tok:\n",
    "            self.tokenizer.add_special_case(w, [{ORTH: w}])\n",
    "        self.pre_rules  = default_pre_rules  if pre_rules  is None else pre_rules\n",
    "        self.post_rules = default_post_rules if post_rules is None else post_rules\n",
    "\n",
    "    def proc_chunk(self, args):\n",
    "        i,chunk = args\n",
    "        chunk = [compose(t, self.pre_rules) for t in chunk]\n",
    "        docs = [[d.text for d in doc] for doc in self.tokenizer.pipe(chunk)]\n",
    "        docs = [compose(t, self.post_rules) for t in docs]\n",
    "        return docs\n",
    "\n",
    "    def __call__(self, items): \n",
    "        toks = []\n",
    "        if isinstance(items[0], Path): items = [read_file(i) for i in items]\n",
    "        chunks = [items[i: i+self.chunksize] for i in (range(0, len(items), self.chunksize))]\n",
    "        toks = parallel(self.proc_chunk, chunks, max_workers=self.max_workers)\n",
    "        return sum(toks, [])\n",
    "    \n",
    "    def proc1(self, item): return self.proc_chunk([item])[0]\n",
    "    \n",
    "    def deprocess(self, toks): return [self.deproc1(tok) for tok in toks]\n",
    "    def deproc1(self, tok):    return \" \".join(tok)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tp = TokenizeProcessor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Comedian Adam Sandler\\'s last theatrical release \"I Now Pronounce You Chuck and Larry\" served as a loud and proud plea for tolerance of the gay community. The former \"Saturday Night Live\" funnyman\\'s new movie \"You Don\\'t Mess with the Zohan\" (*** out o'"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "txt[:250]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='1' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      100.00% [1/1 00:00<00:00]\n",
       "    </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'xxbos • xxmaj • comedian • xxmaj • adam • xxmaj • sandler • \\'s • last • theatrical • release • \" • i • xxmaj • now • xxmaj • pronounce • xxmaj • you • xxmaj • chuck • and • xxmaj • larry • \" • served • as • a • loud • and • proud • plea • for • tolerance • of • the • gay • community • . • xxmaj • the • former • \" • xxmaj • saturday • xxmaj • night • xxmaj • live • \" • funnyman • \\'s • new • movie •'"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' • '.join(tp(il[:100])[0])[:400]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numericalizing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have tokenized our texts, we replace each token by an individual number, this is called numericalizing. Again, we do this with a processor (not so different from the `CategoryProcessor`)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 12 video](https://course19.fast.ai/videos/?lesson=12&t=5491)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "import collections\n",
    "\n",
    "class NumericalizeProcessor(Processor):\n",
    "    def __init__(self, vocab=None, max_vocab=60000, min_freq=2): \n",
    "        self.vocab,self.max_vocab,self.min_freq = vocab,max_vocab,min_freq\n",
    "    \n",
    "    def __call__(self, items):\n",
    "        #The vocab is defined on the first use.\n",
    "        if self.vocab is None:\n",
    "            freq = Counter(p for o in items for p in o)\n",
    "            self.vocab = [o for o,c in freq.most_common(self.max_vocab) if c >= self.min_freq]\n",
    "            for o in reversed(default_spec_tok):\n",
    "                if o in self.vocab: self.vocab.remove(o)\n",
    "                self.vocab.insert(0, o)\n",
    "        if getattr(self, 'otoi', None) is None:\n",
    "            self.otoi = collections.defaultdict(int,{v:k for k,v in enumerate(self.vocab)}) \n",
    "        return [self.proc1(o) for o in items]\n",
    "    def proc1(self, item):  return [self.otoi[o] for o in item]\n",
    "    \n",
    "    def deprocess(self, idxs):\n",
    "        assert self.vocab is not None\n",
    "        return [self.deproc1(idx) for idx in idxs]\n",
    "    def deproc1(self, idx): return [self.vocab[i] for i in idx]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we do language modeling, we will infer the labels from the text during training, so there's no need to label. The training loop expects labels however, so we need to add dummy ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "proc_tok,proc_num = TokenizeProcessor(max_workers=8),NumericalizeProcessor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='45' class='' max='45', style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      100.00% [45/45 00:51<00:00]\n",
       "    </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='6' class='' max='6', style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      100.00% [6/6 00:08<00:00]\n",
       "    </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 23.9 s, sys: 5.06 s, total: 29 s\n",
      "Wall time: 3min 13s\n"
     ]
    }
   ],
   "source": [
    "%time ll = label_by_func(sd, lambda x: 0, proc_x = [proc_tok,proc_num])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the items have been processed they will become list of numbers, we can still access the underlying raw data in `x_obj` (or `y_obj` for the targets, but we don't have any here)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'xxbos xxmaj comedian xxmaj adam xxmaj sandler \\'s last theatrical release \" i xxmaj now xxmaj pronounce xxmaj you xxmaj chuck and xxmaj larry \" served as a loud and proud plea for tolerance of the gay community . xxmaj the former \" xxmaj saturday xxmaj night xxmaj live \" funnyman \\'s new movie \" xxmaj you xxmaj do n\\'t xxmaj mess with the xxmaj zohan \" ( * * * out of xxrep 4 * ) constitutes his plea for tolerance toward xxmaj israeli and xxmaj palestinian immigrants in xxmaj america . xxmaj these unfortunate people are often punished in xxmaj america for the crimes of their counterparts in the war - ravaged xxmaj middle xxmaj east . xxmaj although \" xxmaj zohan \" advocates a lofty cause , xxmaj sandler does n\\'t let his political agenda overshadow his usual crude , below - the - belt , juvenile shenanigans that rely on obscene bodily functions , promiscuous sex , and far - fetched harebrained idiocy . xxmaj indeed , the hysterical horseplay that xxmaj sandler and company revel in may distract you from the plight of these uprooted , misplaced misfits that had fled to xxmaj uncle xxmaj sam \\'s shores because they believe xxmaj america is a xxmaj utopia . xxmaj interestingly , xxmaj sandler plays a xxmaj jewish xxunk agent of the xxmaj mossad , xxmaj israel \\'s secret police , with a hopelessly corny accent . xxmaj zohan \\'s exploits appear to foreshadow xxmaj will xxmaj smith \\'s upcoming \" xxmaj hancock . \" xxmaj zohan is the best xxmaj jewish secret agent in the whole wide world . xxmaj he is literally indestructible . xxmaj he catches bullets in his nose . xxmaj he can swim faster than a dolphin , and a razor - toothed piranha fish in his bikini swim trunks amuses him . \\n\\n xxmaj zohan ( xxmaj adam xxmaj sandler ) is cooking fish at the beach when his superiors interrupt his vacation and inform him that the dreaded xxmaj arab terrorist , the xxmaj phantom ( xxmaj john xxmaj turturro of \" xxmaj transformers \" ) , is up to his old tricks again . xxmaj naturally , xxmaj zohan is furious ! xxmaj actually , xxmaj zohan captured the xxmaj phantom three months ago , but the politicians have exchanged the xxmaj phantom for political prisoners . xxmaj now , xxmaj zohan must nab his nemesis again ! xxmaj the xxmaj phantom and xxmaj zohan tangle in a spectacular fight in the sea and xxmaj zohan does n\\'t survive . xxmaj in reality , xxmaj zohan deliberately fakes his death so that he can immigrate to xxmaj new xxmaj york xxmaj city and realize his life - long dream of cutting hair for xxmaj paul xxmaj mitchell . xxmaj zohan gives himself an obsolete xxmaj frankie xxmaj avalon haircut , trims his beard , and smuggles himself onto a plane bound for xxmaj america . xxmaj if what happens before his flight seems outlandish , once he is on the jet , he spends his time in the cargo hold with two fluffy dogs named \" xxmaj scrappy \" and \" xxmaj coco . \" xxmaj zohan styles their hair from photos in his xxmaj paul xxmaj mitchell haircut book . \\n\\n xxmaj at first , xxmaj zohan has no luck getting a job with xxmaj paul xxmaj mitchell , much less cutting hair . xxmaj zohan defends xxmaj michael ( xxmaj nick xxmaj swardson of \" xxmaj reno 911 , xxmaj the xxmaj movie \" ) in a street brawl after a motorist blames xxmaj michael for his accident with a delivery truck . a grateful xxmaj michael invites xxmaj zohan to stay with his mother , xxmaj gail ( xxmaj lainie xxmaj kazan of \" xxmaj dayton \\'s xxmaj devils \" ) , and him . xxmaj zohan practices cutting xxmaj gail \\'s hair when he is n\\'t having in lusty sex with her . xxmaj eventually , xxmaj zohan gets a job sweeping up hair at a salon owned by xxmaj dalia ( xxmaj emmanuelle xxmaj chriqui of \" xxmaj wrong xxmaj turn \" ) who as it turns out is a xxmaj palestinian . xxmaj indeed , xxmaj zohan knows about her heritage but does n\\'t let it bother him . xxmaj one day when one of xxmaj dalia \\'s hair stylists does n\\'t show up , xxmaj zohan takes advantage of her absence to cut hair . xxmaj much to xxmaj dalia \\'s surprise , xxmaj zohan wins the allegiance of the over sixty crowd . xxmaj older woman line up around the block to have him fashion their hair . xxmaj after each session , xxmaj zohan takes each older lady in the back and xxunk their sexual appetites . \\n\\n xxmaj meanwhile , a millionaire real estate developer xxmaj walbridge ( xxmaj michael xxmaj buffer of \" xxmaj rocky xxmaj balboa \" ) hikes the rent to force xxmaj dalia and others like her out of her store to make way for his mall with a roller - coaster . xxmaj zohan surprises both xxmaj dalia and xxmaj walbridge \\'s people and forks over the money for her to pay the rent . xxmaj an angry xxmaj walbridge contacts a white supremacy group to ignite a neighborhood war between the xxmaj israelis and xxmaj palestinians . xxmaj this happens about the same time that xxmaj zohan falls in love with xxmaj dalia . xxmaj perennial xxmaj sandler cohort xxmaj rob xxmaj schneider of \" xxmaj deuce xxmaj bigalow \" appears as a cretinous xxmaj palestinian named xxmaj salim who does n\\'t know the difference between nitroglycerin and xxmaj xxunk . xxmaj he tries to blow up xxmaj zohan for an old grudge . xxmaj it seems xxmaj zohan beat xxmaj salim up and stole his goat . \\n\\n \" xxmaj you xxmaj do n\\'t xxmaj mess with the xxmaj zohan \" qualifies as a surreal comedy . xxmaj scenarists xxmaj robert xxmaj smigel of \" xxmaj saturday xxmaj night xxmaj live , \" xxmaj judd xxmaj apatow of \" xxmaj the 40-year xxmaj old xxmaj virgin , \" and xxmaj sandler himself vigorously ignore the laws of logic in this zany comedy . xxmaj the movie that most closely resembles \" xxmaj zohan \" is \" xxmaj little xxmaj nicky , \" because both characters boast supernatural abilities . \" xxmaj you xxmaj do n\\'t xxmaj mess with the xxmaj zohan \" will keep xxmaj adam xxmaj sandler fans in stitches . xxeos'"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ll.train.x_obj(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the preprocessing takes time, we save the intermediate result using pickle. Don't use any lambda functions in your processors or they won't be able to pickle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pickle.dump(ll, open(path/'ld.pkl', 'wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ll = pickle.load(open(path/'ld.pkl', 'rb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batching"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a bit of work to convert our `LabelList` in a `DataBunch` as we don't just want batches of IMDB reviews. We want to stream through all the texts concatenated. We also have to prepare the targets that are the newt words in the text. All of this is done with the next object called `LM_PreLoader`. At the beginning of each epoch, it'll shuffle the articles (if `shuffle=True`) and create a big stream by concatenating all of them. We divide this big stream in `bs` smaller streams. That we will read in chunks of bptt length."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 12 video](https://course19.fast.ai/videos/?lesson=12&t=5565)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Just using those for illustration purposes, they're not used otherwise.\n",
    "from IPython.display import display,HTML\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say our stream is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='1' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      100.00% [1/1 00:00<00:00]\n",
       "    </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "stream = \"\"\"\n",
    "In this notebook, we will go back over the example of classifying movie reviews we studied in part 1 and dig deeper under the surface. \n",
    "First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the Processor used in the data block API.\n",
    "Then we will study how we build a language model and train it.\\n\n",
    "\"\"\"\n",
    "tokens = np.array(tp([stream])[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then if we split it in 6 batches it would give something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>xxbos</td>\n",
       "      <td>\\n</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>in</td>\n",
       "      <td>this</td>\n",
       "      <td>notebook</td>\n",
       "      <td>,</td>\n",
       "      <td>we</td>\n",
       "      <td>will</td>\n",
       "      <td>go</td>\n",
       "      <td>back</td>\n",
       "      <td>over</td>\n",
       "      <td>the</td>\n",
       "      <td>example</td>\n",
       "      <td>of</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>classifying</td>\n",
       "      <td>movie</td>\n",
       "      <td>reviews</td>\n",
       "      <td>we</td>\n",
       "      <td>studied</td>\n",
       "      <td>in</td>\n",
       "      <td>part</td>\n",
       "      <td>1</td>\n",
       "      <td>and</td>\n",
       "      <td>dig</td>\n",
       "      <td>deeper</td>\n",
       "      <td>under</td>\n",
       "      <td>the</td>\n",
       "      <td>surface</td>\n",
       "      <td>.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>\\n</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>first</td>\n",
       "      <td>we</td>\n",
       "      <td>will</td>\n",
       "      <td>look</td>\n",
       "      <td>at</td>\n",
       "      <td>the</td>\n",
       "      <td>processing</td>\n",
       "      <td>steps</td>\n",
       "      <td>necessary</td>\n",
       "      <td>to</td>\n",
       "      <td>convert</td>\n",
       "      <td>text</td>\n",
       "      <td>into</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>numbers</td>\n",
       "      <td>and</td>\n",
       "      <td>how</td>\n",
       "      <td>to</td>\n",
       "      <td>customize</td>\n",
       "      <td>it</td>\n",
       "      <td>.</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>by</td>\n",
       "      <td>doing</td>\n",
       "      <td>this</td>\n",
       "      <td>,</td>\n",
       "      <td>we</td>\n",
       "      <td>'ll</td>\n",
       "      <td>have</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>another</td>\n",
       "      <td>example</td>\n",
       "      <td>of</td>\n",
       "      <td>the</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>processor</td>\n",
       "      <td>used</td>\n",
       "      <td>in</td>\n",
       "      <td>the</td>\n",
       "      <td>data</td>\n",
       "      <td>block</td>\n",
       "      <td>api</td>\n",
       "      <td>.</td>\n",
       "      <td>\\n</td>\n",
       "      <td>xxmaj</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>then</td>\n",
       "      <td>we</td>\n",
       "      <td>will</td>\n",
       "      <td>study</td>\n",
       "      <td>how</td>\n",
       "      <td>we</td>\n",
       "      <td>build</td>\n",
       "      <td>a</td>\n",
       "      <td>language</td>\n",
       "      <td>model</td>\n",
       "      <td>and</td>\n",
       "      <td>train</td>\n",
       "      <td>it</td>\n",
       "      <td>.</td>\n",
       "      <td>\\n\\n</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "bs,seq_len = 6,15\n",
    "d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])\n",
    "df = pd.DataFrame(d_tokens)\n",
    "display(HTML(df.to_html(index=False,header=None)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then if we have a `bptt` of 5, we would go over those three batches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>xxbos</td>\n",
       "      <td>\\n</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>in</td>\n",
       "      <td>this</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>classifying</td>\n",
       "      <td>movie</td>\n",
       "      <td>reviews</td>\n",
       "      <td>we</td>\n",
       "      <td>studied</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>\\n</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>first</td>\n",
       "      <td>we</td>\n",
       "      <td>will</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>numbers</td>\n",
       "      <td>and</td>\n",
       "      <td>how</td>\n",
       "      <td>to</td>\n",
       "      <td>customize</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>another</td>\n",
       "      <td>example</td>\n",
       "      <td>of</td>\n",
       "      <td>the</td>\n",
       "      <td>xxmaj</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>then</td>\n",
       "      <td>we</td>\n",
       "      <td>will</td>\n",
       "      <td>study</td>\n",
       "      <td>how</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>notebook</td>\n",
       "      <td>,</td>\n",
       "      <td>we</td>\n",
       "      <td>will</td>\n",
       "      <td>go</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>in</td>\n",
       "      <td>part</td>\n",
       "      <td>1</td>\n",
       "      <td>and</td>\n",
       "      <td>dig</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>look</td>\n",
       "      <td>at</td>\n",
       "      <td>the</td>\n",
       "      <td>processing</td>\n",
       "      <td>steps</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>it</td>\n",
       "      <td>.</td>\n",
       "      <td>xxmaj</td>\n",
       "      <td>by</td>\n",
       "      <td>doing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>processor</td>\n",
       "      <td>used</td>\n",
       "      <td>in</td>\n",
       "      <td>the</td>\n",
       "      <td>data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>we</td>\n",
       "      <td>build</td>\n",
       "      <td>a</td>\n",
       "      <td>language</td>\n",
       "      <td>model</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>back</td>\n",
       "      <td>over</td>\n",
       "      <td>the</td>\n",
       "      <td>example</td>\n",
       "      <td>of</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>deeper</td>\n",
       "      <td>under</td>\n",
       "      <td>the</td>\n",
       "      <td>surface</td>\n",
       "      <td>.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>necessary</td>\n",
       "      <td>to</td>\n",
       "      <td>convert</td>\n",
       "      <td>text</td>\n",
       "      <td>into</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>this</td>\n",
       "      <td>,</td>\n",
       "      <td>we</td>\n",
       "      <td>'ll</td>\n",
       "      <td>have</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>block</td>\n",
       "      <td>api</td>\n",
       "      <td>.</td>\n",
       "      <td>\\n</td>\n",
       "      <td>xxmaj</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>and</td>\n",
       "      <td>train</td>\n",
       "      <td>it</td>\n",
       "      <td>.</td>\n",
       "      <td>\\n\\n</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "bs,bptt = 6,5\n",
    "for k in range(3):\n",
    "    d_tokens = np.array([tokens[i*seq_len + k*bptt:i*seq_len + (k+1)*bptt] for i in range(bs)])\n",
    "    df = pd.DataFrame(d_tokens)\n",
    "    display(HTML(df.to_html(index=False,header=None)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "class LM_PreLoader():\n",
    "    def __init__(self, data, bs=64, bptt=70, shuffle=False):\n",
    "        self.data,self.bs,self.bptt,self.shuffle = data,bs,bptt,shuffle\n",
    "        total_len = sum([len(t) for t in data.x])\n",
    "        self.n_batch = total_len // bs\n",
    "        self.batchify()\n",
    "    \n",
    "    def __len__(self): return ((self.n_batch-1) // self.bptt) * self.bs\n",
    "    \n",
    "    def __getitem__(self, idx):\n",
    "        source = self.batched_data[idx % self.bs]\n",
    "        seq_idx = (idx // self.bs) * self.bptt\n",
    "        return source[seq_idx:seq_idx+self.bptt],source[seq_idx+1:seq_idx+self.bptt+1]\n",
    "    \n",
    "    def batchify(self):\n",
    "        texts = self.data.x\n",
    "        if self.shuffle: texts = texts[torch.randperm(len(texts))]\n",
    "        stream = torch.cat([tensor(t) for t in texts])\n",
    "        self.batched_data = stream[:self.n_batch * self.bs].view(self.bs, self.n_batch)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dl = DataLoader(LM_PreLoader(ll.valid, shuffle=True), batch_size=64)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check it all works ok: `x1`, `y1`, `x2` and `y2` should all be of size `bs`  by `bptt`. The texts in each row of `x1` should continue in `x2`. `y1` and `y2` should have the same texts as their `x` counterpart, shifted of one position to the right."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iter_dl = iter(dl)\n",
    "x1,y1 = next(iter_dl)\n",
    "x2,y2 = next(iter_dl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x1.size(),y1.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = proc_num.vocab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\" \".join(vocab[o] for o in x1[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\" \".join(vocab[o] for o in y1[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\" \".join(vocab[o] for o in x2[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And let's prepare some convenience function to do this quickly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "def get_lm_dls(train_ds, valid_ds, bs, bptt, **kwargs):\n",
    "    return (DataLoader(LM_PreLoader(train_ds, bs, bptt, shuffle=True), batch_size=bs, **kwargs),\n",
    "            DataLoader(LM_PreLoader(valid_ds, bs, bptt, shuffle=False), batch_size=2*bs, **kwargs))\n",
    "\n",
    "def lm_databunchify(sd, bs, bptt, **kwargs):\n",
    "    return DataBunch(*get_lm_dls(sd.train, sd.valid, bs, bptt, **kwargs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bs,bptt = 64,70\n",
    "data = lm_databunchify(ll, bs, bptt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batching for classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we will want to tackle classification, gathering the data will be a bit different: first we will label our texts with the folder they come from, and then we will need to apply padding to batch them together. To avoid mixing very long texts with very short ones, we will also use `Sampler` to sort (with a bit of randomness for the training set) our samples by length.\n",
    "\n",
    "First the data block API calls shold look familiar."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 12 video](https://course19.fast.ai/videos/?lesson=12&t=5877)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "proc_cat = CategoryProcessor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "il = TextList.from_files(path, include=['train', 'test'])\n",
    "sd = SplitData.split_by_func(il, partial(grandparent_splitter, valid_name='test'))\n",
    "ll = label_by_func(sd, parent_labeler, proc_x = [proc_tok, proc_num], proc_y=proc_cat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pickle.dump(ll, open(path/'ll_clas.pkl', 'wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ll = pickle.load(open(path/'ll_clas.pkl', 'rb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the labels seem consistent with the texts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[(ll.train.x_obj(i), ll.train.y_obj(i)) for i in [1,12552]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We saw samplers in notebook 03. For the validation set, we will simply sort the samples by length, and we begin with the longest ones for memory reasons (it's better to always have the biggest tensors first)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "from torch.utils.data import Sampler\n",
    "\n",
    "class SortSampler(Sampler):\n",
    "    def __init__(self, data_source, key): self.data_source,self.key = data_source,key\n",
    "    def __len__(self): return len(self.data_source)\n",
    "    def __iter__(self):\n",
    "        return iter(sorted(list(range(len(self.data_source))), key=self.key, reverse=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the training set, we want some kind of randomness on top of this. So first, we shuffle the texts and build megabatches of size `50 * bs`. We sort those megabatches by length before splitting them in 50 minibatches. That way we will have randomized batches of roughly the same length.\n",
    "\n",
    "Then we make sure to have the biggest batch first and shuffle the order of the other batches. We also make sure the last batch stays at the end because its size is probably lower than batch size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "class SortishSampler(Sampler):\n",
    "    def __init__(self, data_source, key, bs):\n",
    "        self.data_source,self.key,self.bs = data_source,key,bs\n",
    "\n",
    "    def __len__(self) -> int: return len(self.data_source)\n",
    "\n",
    "    def __iter__(self):\n",
    "        idxs = torch.randperm(len(self.data_source))\n",
    "        megabatches = [idxs[i:i+self.bs*50] for i in range(0, len(idxs), self.bs*50)]\n",
    "        sorted_idx = torch.cat([tensor(sorted(s, key=self.key, reverse=True)) for s in megabatches])\n",
    "        batches = [sorted_idx[i:i+self.bs] for i in range(0, len(sorted_idx), self.bs)]\n",
    "        max_idx = torch.argmax(tensor([self.key(ck[0]) for ck in batches]))  # find the chunk with the largest key,\n",
    "        batches[0],batches[max_idx] = batches[max_idx],batches[0]            # then make sure it goes first.\n",
    "        batch_idxs = torch.randperm(len(batches)-2)\n",
    "        sorted_idx = torch.cat([batches[i+1] for i in batch_idxs]) if len(batches) > 1 else LongTensor([])\n",
    "        sorted_idx = torch.cat([batches[0], sorted_idx, batches[-1]])\n",
    "        return iter(sorted_idx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Padding: we had the padding token (that as an id of 1) at the end of each sequence to make them all the same size when batching them. Note that we need padding at the end to be able to use `PyTorch` convenience functions that will let us ignore that padding (see 12c)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "def pad_collate(samples, pad_idx=1, pad_first=False):\n",
    "    max_len = max([len(s[0]) for s in samples])\n",
    "    res = torch.zeros(len(samples), max_len).long() + pad_idx\n",
    "    for i,s in enumerate(samples):\n",
    "        if pad_first: res[i, -len(s[0]):] = LongTensor(s[0])\n",
    "        else:         res[i, :len(s[0]) ] = LongTensor(s[0])\n",
    "    return res, tensor([s[1] for s in samples])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bs = 64\n",
    "train_sampler = SortishSampler(ll.train.x, key=lambda t: len(ll.train[int(t)][0]), bs=bs)\n",
    "train_dl = DataLoader(ll.train, batch_size=bs, sampler=train_sampler, collate_fn=pad_collate)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iter_dl = iter(train_dl)\n",
    "x,y = next(iter_dl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lengths = []\n",
    "for i in range(x.size(0)): lengths.append(x.size(1) - (x[i]==1).sum().item())\n",
    "lengths[:5], lengths[-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last one is the minimal length. This is the first batch so it has the longest sequence, but if look at the next one that is more random, we see lengths are roughly the sames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x,y = next(iter_dl)\n",
    "lengths = []\n",
    "for i in range(x.size(0)): lengths.append(x.size(1) - (x[i]==1).sum().item())\n",
    "lengths[:5], lengths[-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the padding at the end:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we add a convenience function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "def get_clas_dls(train_ds, valid_ds, bs, **kwargs):\n",
    "    train_sampler = SortishSampler(train_ds.x, key=lambda t: len(train_ds.x[t]), bs=bs)\n",
    "    valid_sampler = SortSampler(valid_ds.x, key=lambda t: len(valid_ds.x[t]))\n",
    "    return (DataLoader(train_ds, batch_size=bs, sampler=train_sampler, collate_fn=pad_collate, **kwargs),\n",
    "            DataLoader(valid_ds, batch_size=bs*2, sampler=valid_sampler, collate_fn=pad_collate, **kwargs))\n",
    "\n",
    "def clas_databunchify(sd, bs, **kwargs):\n",
    "    return DataBunch(*get_clas_dls(sd.train, sd.valid, bs, **kwargs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bs,bptt = 64,70\n",
    "data = clas_databunchify(ll, bs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python notebook2script.py 12_text.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}