{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preparing German Wikipedia to train a fast.ai (ULMFiT) model for German\n",
    "(should work with most other languages, too)\n",
    "\n",
    "*Thomas Viehmann <tv@lernapparat.de>*\n",
    "\n",
    "The core idea of [Howard and Ruder's ULMFiT paper](https://arxiv.org/abs/1801.06146), see also https://nlp.fast.ai/, is to pretrain a language model on some corpus.\n",
    "Naturally, we also want such a thing for German. And happily I just launched [MathInf](https://mathinf.eu/), a great mathematical modelling, machine learning and actuarial consulting company, that allows me to do this type of research and make it public.\n",
    "\n",
    "[I have very raw info (and hope to add more description soon) on my blog](https://lernapparat.de/german-lm/). I'm making this available early at public request and hope it is useful to you to build great things, it is not as clean or well-commented I would love it to be, yet.\n",
    "I would love to hear from you if you make good use of it!\n",
    "\n",
    "So we take a wikipedia dump (`de_wikipedia_extracted dewiki-latest-pages-articles.xml.bz2` downloaded from [dumps.wikipedia.org](https://dumps.wikimedia.org/dewiki/latest/) and prepocessed by `wikiextractor/WikiExtractor.py -s --json -o de_wikipedia_extracted dewiki-latest-pages-articles.xml.bz2`) and make token files out of them.\n",
    "\n",
    "Note that the German Wikipedia contains more tokens (i.e. words) than recommended 100M to train the language model.\n",
    "I don't cut off much here, but only do this later when loading the tokens to start the training. That is a bit wasteful and follows a \"keep as much data as long as you can\" approach.\n",
    "\n",
    "Credit for all the good things in the Notebook likely belong to Sylvain Gugger ([see his notebook](https://github.com/sgugger/Deep-Learning/blob/master/Building%20a%20French%20LM.ipynb)) and Jeremy Howard [see the original imdb notebook from his great course](https://github.com/fastai/fastai/blob/master/courses/dl2/imdb.ipynb), whose work I built on, all errors are my own.\n",
    "\n",
    "\n",
    "Enough talk, here is the data preparation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "%reload_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai.text import *\n",
    "import html\n",
    "from matplotlib import pyplot\n",
    "import numpy\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "BOS = 'xbos'  # beginning-of-sentence tag\n",
    "FLD = 'xfld'  # data field tag\n",
    "\n",
    "LANG='de'\n",
    "datasetpath =  Path('/home/datasets/nlp/wiki/')\n",
    "# I ran this: wikiextractor/WikiExtractor.py -s --json -o de_wikipedia_extracted dewiki-latest-pages-articles.xml.bz2 \n",
    "work_path = Path('~/data/nlp/german_lm/data/de_wiki/tmp/').expanduser()\n",
    "work_path.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Standarize format\n",
    "\n",
    "You can skip this entire section if you like the results. In this case continue at *Tokenize*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "LANG_FILENAMES = [str(f) for f in datasetpath.rglob(\"de_wikipedia_extracted/*/*\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(5330,\n",
       " ['/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_98',\n",
       "  '/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_35',\n",
       "  '/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_96',\n",
       "  '/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_74',\n",
       "  '/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_29'])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(LANG_FILENAMES), LANG_FILENAMES[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|██████████| 5330/5330 [00:28<00:00, 185.25it/s]\n"
     ]
    }
   ],
   "source": [
    "LANG_TEXT = []\n",
    "for fn in tqdm(LANG_FILENAMES):\n",
    "    for line in open(fn, encoding='utf8'):\n",
    "        LANG_TEXT.append(json.loads(line))\n",
    "        \n",
    "LANG_TEXT = pd.DataFrame(LANG_TEXT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>text</th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>434091</td>\n",
       "      <td>Homberg (bei Lauterecken)\\n\\nHomberg ist eine ...</td>\n",
       "      <td>Homberg (bei Lauterecken)</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434091</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>434093</td>\n",
       "      <td>Nicole Petignat\\n\\nNicole Petignat (* 27. Okto...</td>\n",
       "      <td>Nicole Petignat</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434093</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>434098</td>\n",
       "      <td>Schwezow\\n\\nSchwezow ist der Familienname folg...</td>\n",
       "      <td>Schwezow</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434098</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>434102</td>\n",
       "      <td>Schukow\\n\\nSchukow (russ. \"Жуков\") bzw. Schuko...</td>\n",
       "      <td>Schukow</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>434112</td>\n",
       "      <td>Otto Schmidt\\n\\nOtto Schmidt ist der Name folg...</td>\n",
       "      <td>Otto Schmidt</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434112</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       id                                               text  \\\n",
       "0  434091  Homberg (bei Lauterecken)\\n\\nHomberg ist eine ...   \n",
       "1  434093  Nicole Petignat\\n\\nNicole Petignat (* 27. Okto...   \n",
       "2  434098  Schwezow\\n\\nSchwezow ist der Familienname folg...   \n",
       "3  434102  Schukow\\n\\nSchukow (russ. \"Жуков\") bzw. Schuko...   \n",
       "4  434112  Otto Schmidt\\n\\nOtto Schmidt ist der Name folg...   \n",
       "\n",
       "                       title                                         url  \n",
       "0  Homberg (bei Lauterecken)  https://de.wikipedia.org/wiki?curid=434091  \n",
       "1            Nicole Petignat  https://de.wikipedia.org/wiki?curid=434093  \n",
       "2                   Schwezow  https://de.wikipedia.org/wiki?curid=434098  \n",
       "3                    Schukow  https://de.wikipedia.org/wiki?curid=434102  \n",
       "4               Otto Schmidt  https://de.wikipedia.org/wiki?curid=434112  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "LANG_TEXT.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Getting rid of the title name in the text field\n",
    "def split_title_from_text(text):\n",
    "    words = text.split(\"\\n\\n\", 1)\n",
    "    if len(words) == 2:\n",
    "        return words[1]\n",
    "    else:\n",
    "        return words[0]\n",
    "    \n",
    "LANG_TEXT['text'] = LANG_TEXT['text'].apply(lambda x: split_title_from_text(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>text</th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>434091</td>\n",
       "      <td>Homberg ist eine Ortsgemeinde im Landkreis Kus...</td>\n",
       "      <td>Homberg (bei Lauterecken)</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434091</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>434093</td>\n",
       "      <td>Nicole Petignat (* 27. Oktober 1966 in La Chau...</td>\n",
       "      <td>Nicole Petignat</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434093</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>434098</td>\n",
       "      <td>Schwezow ist der Familienname folgender Person...</td>\n",
       "      <td>Schwezow</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434098</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>434102</td>\n",
       "      <td>Schukow (russ. \"Жуков\") bzw. Schukowa (weiblic...</td>\n",
       "      <td>Schukow</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>434112</td>\n",
       "      <td>Otto Schmidt ist der Name folgender Personen:\\...</td>\n",
       "      <td>Otto Schmidt</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434112</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       id                                               text  \\\n",
       "0  434091  Homberg ist eine Ortsgemeinde im Landkreis Kus...   \n",
       "1  434093  Nicole Petignat (* 27. Oktober 1966 in La Chau...   \n",
       "2  434098  Schwezow ist der Familienname folgender Person...   \n",
       "3  434102  Schukow (russ. \"Жуков\") bzw. Schukowa (weiblic...   \n",
       "4  434112  Otto Schmidt ist der Name folgender Personen:\\...   \n",
       "\n",
       "                       title                                         url  \n",
       "0  Homberg (bei Lauterecken)  https://de.wikipedia.org/wiki?curid=434091  \n",
       "1            Nicole Petignat  https://de.wikipedia.org/wiki?curid=434093  \n",
       "2                   Schwezow  https://de.wikipedia.org/wiki?curid=434098  \n",
       "3                    Schukow  https://de.wikipedia.org/wiki?curid=434102  \n",
       "4               Otto Schmidt  https://de.wikipedia.org/wiki?curid=434112  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "LANG_TEXT.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Determine article lengths and only keep at most the largest million and only those with at least 2000 characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "LANG_TEXT['label'] = 0 # dummy\n",
    "LANG_TEXT['length'] = LANG_TEXT['text'].str.len()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "MAX_ARTICLES = 1_000_000\n",
    "# keep at most 1 million articles and only those of more than 2000 characters\n",
    "MIN_LENGTH_CHARS = max(2000, int(numpy.percentile(LANG_TEXT['length'], 100-min(100*MAX_ARTICLES/len(LANG_TEXT), 100))))\n",
    "LANG_TEXT = LANG_TEXT[LANG_TEXT['length'] >= MIN_LENGTH_CHARS] # Chars not words...\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "LANG_TEXT.to_csv(datasetpath/'wiki_de.csv', header=True, index=False) # I must say, I think the header is good! If in doubt, you should listen to Jeremy though."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "LANG_TEXT = pd.read_csv(datasetpath/'wiki_de.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Article length percentiles 0%: 2000, 10%: 2197, 20%: 2427, 30%: 2699, 40%: 3033, 50%: 3458, 60%: 4030, 70%: 4857, 80%: 6258, 90%: 9442, 100%: 463748\n",
      "Number of articles 738850\n"
     ]
    }
   ],
   "source": [
    "percentages = range(0,110,10)\n",
    "print ('Article length percentiles' , ', '.join(['{}%: {}'.format(p, int(q))  for p,q in zip(percentages, numpy.percentile(LANG_TEXT['length'], percentages))]))\n",
    "print ('Number of articles', len(LANG_TEXT))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>text</th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>label</th>\n",
       "      <th>length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>434114</td>\n",
       "      <td>Carl Jacob Burckhardt (* 10. September 1891 in...</td>\n",
       "      <td>Carl Jacob Burckhardt</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434114</td>\n",
       "      <td>0</td>\n",
       "      <td>5021</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>434117</td>\n",
       "      <td>Roscheid ist eine Ortsgemeinde im Eifelkreis B...</td>\n",
       "      <td>Roscheid</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434117</td>\n",
       "      <td>0</td>\n",
       "      <td>2562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>434118</td>\n",
       "      <td>Reipeldingen ist eine Ortsgemeinde im westlich...</td>\n",
       "      <td>Reipeldingen</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434118</td>\n",
       "      <td>0</td>\n",
       "      <td>2228</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>434122</td>\n",
       "      <td>Lichtenborn ist eine Ortsgemeinde im Eifelkrei...</td>\n",
       "      <td>Lichtenborn</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434122</td>\n",
       "      <td>0</td>\n",
       "      <td>2776</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>434123</td>\n",
       "      <td>Leidenborn ist eine Ortsgemeinde im Eifelkreis...</td>\n",
       "      <td>Leidenborn</td>\n",
       "      <td>https://de.wikipedia.org/wiki?curid=434123</td>\n",
       "      <td>0</td>\n",
       "      <td>2164</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       id                                               text  \\\n",
       "0  434114  Carl Jacob Burckhardt (* 10. September 1891 in...   \n",
       "1  434117  Roscheid ist eine Ortsgemeinde im Eifelkreis B...   \n",
       "2  434118  Reipeldingen ist eine Ortsgemeinde im westlich...   \n",
       "3  434122  Lichtenborn ist eine Ortsgemeinde im Eifelkrei...   \n",
       "4  434123  Leidenborn ist eine Ortsgemeinde im Eifelkreis...   \n",
       "\n",
       "                   title                                         url  label  \\\n",
       "0  Carl Jacob Burckhardt  https://de.wikipedia.org/wiki?curid=434114      0   \n",
       "1               Roscheid  https://de.wikipedia.org/wiki?curid=434117      0   \n",
       "2           Reipeldingen  https://de.wikipedia.org/wiki?curid=434118      0   \n",
       "3            Lichtenborn  https://de.wikipedia.org/wiki?curid=434122      0   \n",
       "4             Leidenborn  https://de.wikipedia.org/wiki?curid=434123      0   \n",
       "\n",
       "   length  \n",
       "0    5021  \n",
       "1    2562  \n",
       "2    2228  \n",
       "3    2776  \n",
       "4    2164  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#LANG_TEXT = LANG_TEXT.sort_values(by=['length'], ascending=False)\n",
    "LANG_TEXT.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Splitting 10% for validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_trn,df_val = sklearn.model_selection.train_test_split(LANG_TEXT.pipe(lambda x: x[['label', 'text']]), test_size=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_trn.to_csv(work_path/'train.csv', header=False, index=False)\n",
    "df_val.to_csv(work_path/'valid.csv', header=False, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I'm always trying to produce notebooks that you can run through in one go, so here is my attempt at getting rid of old stuff."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "del LANG_TEXT\n",
    "import gc\n",
    "gc.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tokenize\n",
    "\n",
    "Note: be sure to care for your memory. I had all my memory allocated (for having several wikipedia copies in memory) and was swapping massively with the multiprocessing tokenization. My fix was to restart the notebook after after I had finished the above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "chunksize = 4000\n",
    "N_CPUS = num_cpus() # I like to use all cores here, needs a patch to fast ai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "re1 = re.compile(r'  +')\n",
    "\n",
    "def fixup(x):\n",
    "    x = x.replace('#39;', \"'\").replace('amp;', '&').replace('#146;', \"'\").replace(\n",
    "        'nbsp;', ' ').replace('#36;', '$').replace('\\\\n', \"\\n\").replace('quot;', \"'\").replace(\n",
    "        '<br />', \"\\n\").replace('\\\\\"', '\"').replace('<unk>','u_n').replace(' @.@ ','.').replace(\n",
    "        ' @-@ ','-').replace('\\\\', ' \\\\ ')\n",
    "    return re1.sub(' ', html.unescape(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_trn = pd.read_csv(work_path/'train.csv', header=None, chunksize=chunksize)\n",
    "df_val = pd.read_csv(work_path/'valid.csv', header=None, chunksize=chunksize)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_texts(df, n_lbls=1):\n",
    "    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)\n",
    "    texts = f'\\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)\n",
    "    for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)\n",
    "    texts = texts.apply(fixup).values.astype(str)\n",
    "    #tok = Tokenizer.proc_all(texts, lang=LANG) # use this if you have memory trouble\n",
    "    tok = Tokenizer.proc_all_mp(partition(texts, (len(texts)+N_CPUS-1)//N_CPUS), lang=LANG, ncpus=N_CPUS)\n",
    "    return tok, list(labels)\n",
    "\n",
    "def get_all(df, name, n_lbls=1):\n",
    "    time_start = time.time()\n",
    "    for i, r in enumerate(df):\n",
    "        print(\"\\r\", i, end=\" \")\n",
    "        if i > 0:\n",
    "            print ('time per chunk {}s'.format(int((time.time() - time_start) / i)), end=\"\")\n",
    "        tok_, labels_ = get_texts(r, n_lbls)\n",
    "        #save the partial tokens instead of regrouping them in one big array.\n",
    "        np.save(work_path/f'{name}_tok{i}.npy', tok_)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 34 time per chunk 32s"
     ]
    }
   ],
   "source": [
    "get_all(df_trn,'trn',1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 18 time per chunk 31s"
     ]
    },
    {
     "ename": "NameError",
     "evalue": "name 'tok' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-14-f8776495fb2d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mget_all\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf_val\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'val'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-12-808cc3b059f9>\u001b[0m in \u001b[0;36mget_all\u001b[0;34m(df, name, n_lbls)\u001b[0m\n\u001b[1;32m     17\u001b[0m         \u001b[0;31m#save the partial tokens instead of regrouping them in one big array.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     18\u001b[0m         \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwork_path\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;34mf'{name}_tok{i}.npy'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtok_\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 19\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mtok\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m: name 'tok' is not defined"
     ]
    }
   ],
   "source": [
    "get_all(df_val,'val',1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Numericalize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the Counter object from all the splitted files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_them_all(names):\n",
    "    cnt = Counter()\n",
    "    for name in names:\n",
    "        for file in work_path.glob(f'{name}_tok*'):\n",
    "            tok = np.load(file)\n",
    "            cnt_tok = Counter(word for sent in tok for word in sent)\n",
    "            cnt += cnt_tok\n",
    "    return cnt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "cnt = count_them_all(['trn'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('.', 32264518),\n",
       " (',', 22918938),\n",
       " ('der', 19208785),\n",
       " ('die', 15844609),\n",
       " ('und', 13947739),\n",
       " ('in', 10617819),\n",
       " ('\"', 9494449),\n",
       " ('\\n\\n', 7067205),\n",
       " ('von', 6913635),\n",
       " ('den', 5662917),\n",
       " ('im', 5452056),\n",
       " ('des', 4907024),\n",
       " ('das', 4734075),\n",
       " ('mit', 4716315),\n",
       " (')', 4682017),\n",
       " ('(', 4665298),\n",
       " ('\\n', 4523044),\n",
       " ('er', 3776986),\n",
       " ('dem', 3722644),\n",
       " ('als', 3612743),\n",
       " ('wurde', 3601000),\n",
       " ('zu', 3561187),\n",
       " ('auf', 3324672),\n",
       " ('eine', 3083051),\n",
       " ('für', 3066354)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cnt.most_common(25)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_vocab = 60000\n",
    "min_freq = 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "itos = [o for o,c in cnt.most_common(max_vocab) if c > min_freq]\n",
    "itos.insert(0,'_pad_')\n",
    "itos.insert(0,'_unk_')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(itos)\n",
    "pickle.dump(itos, open(work_path/'itos.pkl', 'wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "stoi = collections.defaultdict(int,{s:i for (i,s) in enumerate(itos)})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numericalize each partial file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def numericalize(name):\n",
    "    results = []\n",
    "    for file in tqdm(work_path.glob(f'{name}_tok*')):\n",
    "        tok = np.load(file)\n",
    "        results.append(np.array([[stoi[word] for word in sent] for sent in tok]))\n",
    "    return np.concatenate(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "166it [03:04,  1.11s/it]\n"
     ]
    }
   ],
   "source": [
    "trn_ids = numericalize('trn')\n",
    "np.save(work_path/'trn_ids.npy', trn_ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "19it [00:25,  1.33s/it]\n"
     ]
    }
   ],
   "source": [
    "val_ids = numericalize('val')\n",
    "np.save(work_path/'val_ids.npy', val_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So now you have gread dumps to use with the [training program I published on my blog](https://lernapparat.de/german-lm/).\n",
    "\n",
    "As always, I would be honored by your feedback at <tv@lernapparat.de>. I read and appreciate every mail.\n",
    "\n",
    "*Thomas*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5rc1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}