{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Train model with word2vec\n", "##### See [training.py](https://github.com/devmount/GermanWordEmbeddings/blob/master/training.py) from [GermanWordEmbeddings](https://devmount.github.io/GermanWordEmbeddings/)\n", "\n", "The following code gives an example of how to train a language model with the training script. You need [gensim](https://radimrehurek.com/gensim/install.html) for this script to work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### General usage\n", "The usage of the script can be seen with the default `-h` or `--help` flag:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: training.py [-h] [-s SIZE] [-w WINDOW] [-m MINCOUNT] [-c WORKERS] [-g SG] [-i HS] [-n NEGATIVE] [-o CBOWMEAN]\n", " corpora target\n", "\n", "Script for training word vector models using public corpora\n", "\n", "positional arguments:\n", " corpora source folder with preprocessed corpora (one sentence plain text per line in each file)\n", " target target file name to store model in\n", "\n", "optional arguments:\n", " -h, --help show this help message and exit\n", " -s SIZE, --size SIZE dimension of word vectors\n", " -w WINDOW, --window WINDOW\n", " size of the sliding window\n", " -m MINCOUNT, --mincount MINCOUNT\n", " minimum number of occurences of a word to be considered\n", " -c WORKERS, --workers WORKERS\n", " number of worker threads to train the model\n", " -g SG, --sg SG training algorithm: Skip-Gram (1), otherwise CBOW (0)\n", " -i HS, --hs HS use of hierachical sampling for training\n", " -n NEGATIVE, --negative NEGATIVE\n", " use of negative sampling for training (usually between 5-20)\n", " -o CBOWMEAN, --cbowmean CBOWMEAN\n", " for CBOW training algorithm: use sum (0) or mean (1) to merge context vectors\n" ] } ], "source": [ "%%bash\n", "python training.py --help" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training example\n", "The example corpus from the preprocessing section is now used to train a small language model with vector size of 300, window size of 5, negative sampling with 10 samples and minimum word occurences of 5:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2015-07-12 00:28:30,816 : INFO : collecting all words and their counts\n", "2015-07-12 00:28:30,817 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2015-07-12 00:28:30,822 : INFO : collected 6024 word types from a corpus of 9415 words and 1035 sentences\n", "2015-07-12 00:28:30,824 : INFO : min_count=5 retains 217 unique words (drops 5807)\n", "2015-07-12 00:28:30,825 : INFO : min_count leaves 2101 word corpus (22% of original 9415)\n", "2015-07-12 00:28:30,825 : INFO : deleting the raw counts dictionary of 6024 items\n", "2015-07-12 00:28:30,826 : INFO : sample=0 downsamples 0 most-common words\n", "2015-07-12 00:28:30,826 : INFO : downsampling leaves estimated 2101 word corpus (100.0% of prior 2101)\n", "2015-07-12 00:28:30,826 : INFO : estimated required memory for 217 words and 300 dimensions: 933100 bytes\n", "2015-07-12 00:28:30,826 : INFO : constructing a huffman tree from 217 words\n", "2015-07-12 00:28:30,831 : INFO : built huffman tree with maximum node depth 9\n", "2015-07-12 00:28:30,831 : INFO : resetting layer weights\n", "2015-07-12 00:28:30,835 : INFO : training model with 4 workers on 217 vocabulary and 300 features, using sg=1 hs=1 sample=0 and negative=10\n", "2015-07-12 00:28:30,841 : INFO : reached end of input; waiting to finish 11 outstanding jobs\n", "2015-07-12 00:28:30,882 : INFO : training on 2101 words took 0.0s, 45491 words/s\n", "2015-07-12 00:28:30,882 : INFO : storing 217x300 projection weights into my.model\n" ] } ], "source": [ "%%bash\n", "python training.py corpus/ my.model -s 300 -w 5 -n 10 -m 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting model can be used now for evaluation and visualization." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }