{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train model with word2vec\n",
    "##### See [training.py](https://github.com/devmount/GermanWordEmbeddings/blob/master/training.py) from [GermanWordEmbeddings](https://devmount.github.io/GermanWordEmbeddings/)\n",
    "\n",
    "The following code gives an example of how to train a language model with the training script. You need [gensim](https://radimrehurek.com/gensim/install.html) for this script to work."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### General usage\n",
    "The usage of the script can be seen with the default `-h` or `--help` flag:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: training.py [-h] [-s SIZE] [-w WINDOW] [-m MINCOUNT] [-c WORKERS] [-g SG] [-i HS] [-n NEGATIVE] [-o CBOWMEAN]\n",
      "                   corpora target\n",
      "\n",
      "Script for training word vector models using public corpora\n",
      "\n",
      "positional arguments:\n",
      "  corpora               source folder with preprocessed corpora (one sentence plain text per line in each file)\n",
      "  target                target file name to store model in\n",
      "\n",
      "optional arguments:\n",
      "  -h, --help            show this help message and exit\n",
      "  -s SIZE, --size SIZE  dimension of word vectors\n",
      "  -w WINDOW, --window WINDOW\n",
      "                        size of the sliding window\n",
      "  -m MINCOUNT, --mincount MINCOUNT\n",
      "                        minimum number of occurences of a word to be considered\n",
      "  -c WORKERS, --workers WORKERS\n",
      "                        number of worker threads to train the model\n",
      "  -g SG, --sg SG        training algorithm: Skip-Gram (1), otherwise CBOW (0)\n",
      "  -i HS, --hs HS        use of hierachical sampling for training\n",
      "  -n NEGATIVE, --negative NEGATIVE\n",
      "                        use of negative sampling for training (usually between 5-20)\n",
      "  -o CBOWMEAN, --cbowmean CBOWMEAN\n",
      "                        for CBOW training algorithm: use sum (0) or mean (1) to merge context vectors\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python training.py --help"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training example\n",
    "The example corpus from the preprocessing section is now used to train a small language model with vector size of 300, window size of 5, negative sampling with 10 samples and minimum word occurences of 5:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2015-07-12 00:28:30,816 : INFO : collecting all words and their counts\n",
      "2015-07-12 00:28:30,817 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n",
      "2015-07-12 00:28:30,822 : INFO : collected 6024 word types from a corpus of 9415 words and 1035 sentences\n",
      "2015-07-12 00:28:30,824 : INFO : min_count=5 retains 217 unique words (drops 5807)\n",
      "2015-07-12 00:28:30,825 : INFO : min_count leaves 2101 word corpus (22% of original 9415)\n",
      "2015-07-12 00:28:30,825 : INFO : deleting the raw counts dictionary of 6024 items\n",
      "2015-07-12 00:28:30,826 : INFO : sample=0 downsamples 0 most-common words\n",
      "2015-07-12 00:28:30,826 : INFO : downsampling leaves estimated 2101 word corpus (100.0% of prior 2101)\n",
      "2015-07-12 00:28:30,826 : INFO : estimated required memory for 217 words and 300 dimensions: 933100 bytes\n",
      "2015-07-12 00:28:30,826 : INFO : constructing a huffman tree from 217 words\n",
      "2015-07-12 00:28:30,831 : INFO : built huffman tree with maximum node depth 9\n",
      "2015-07-12 00:28:30,831 : INFO : resetting layer weights\n",
      "2015-07-12 00:28:30,835 : INFO : training model with 4 workers on 217 vocabulary and 300 features, using sg=1 hs=1 sample=0 and negative=10\n",
      "2015-07-12 00:28:30,841 : INFO : reached end of input; waiting to finish 11 outstanding jobs\n",
      "2015-07-12 00:28:30,882 : INFO : training on 2101 words took 0.0s, 45491 words/s\n",
      "2015-07-12 00:28:30,882 : INFO : storing 217x300 projection weights into my.model\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python training.py corpus/ my.model -s 300 -w 5 -n 10 -m 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting model can be used now for evaluation and visualization."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}