{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebooks is an experiment to see if a pure scikit-learn implementation of the fastText model can work better than a linear model on a small text classification problem: 20 newsgroups.\n", "\n", "http://arxiv.org/abs/1607.01759\n", "\n", "Those models are very similar to Deep Averaging Network (with only 1 hidden layer with a linear activation function):\n", "\n", "https://www.cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf\n", "\n", "\n", "Note that scikit-learn does not provide a hierarchical softmax implementation (but we don't need it on 20 newsgroups anyways)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import HashingVectorizer\n", "\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "twentyng_train = fetch_20newsgroups(\n", " subset='train',\n", " #remove=('headers', 'footers'),\n", ")\n", "docs_train, target_train = twentyng_train.data, twentyng_train.target\n", "\n", "\n", "twentyng_test = fetch_20newsgroups(\n", " subset='test',\n", " #remove=('headers', 'footers'),\n", ")\n", "\n", "docs_test, target_test = twentyng_test.data, twentyng_test.target" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "262144" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "2 ** 18" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following uses the hashing tricks on unigrams and bigrams. `binary=True` makes us ignore repeated words in a document. The `l1` normalization ensures that we \"average\" the embeddings of the tokens in the document instead of summing them." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.8 s, sys: 116 ms, total: 16.9 s\n", "Wall time: 16.9 s\n" ] } ], "source": [ "%%time\n", "vec = HashingVectorizer(\n", " encoding='latin-1', binary=True, ngram_range=(1, 2),\n", " norm='l1', n_features=2 ** 18)\n", "\n", "X_train = vec.transform(docs_train)\n", "X_test = vec.transform(docs_test)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 0., 0., 0.]])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_doc_vectors = X_train[:3].toarray()\n", "first_doc_vectors" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0., 0., 0.])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_doc_vectors.min(axis=1)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.0049505 , 0.00469484, 0.00200401])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_doc_vectors.max(axis=1)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 1., 1., 1.])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_doc_vectors.sum(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Baseline: OvR logistic regression (the multinomial logistic regression loss is currently not implemented in scikit-learn). In practice, the OvR reduction seems to work well enough." ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 46s, sys: 6.69 s, total: 1min 53s\n", "Wall time: 11.1 s\n" ] } ], "source": [ "%%time\n", "from sklearn.linear_model import SGDClassifier\n", "\n", "lr = SGDClassifier(loss='log', alpha=1e-10, n_iter=50, n_jobs=-1)\n", "lr.fit(X_train, target_train)" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train score: 1.000\n", "test score: 0.827\n", "CPU times: user 588 ms, sys: 289 ms, total: 877 ms\n", "Wall time: 602 ms\n" ] } ], "source": [ "%%time\n", "print(\"train score: %0.3f\" % lr.score(X_train, target_train))\n", "print(\"test score: %0.3f\" % lr.score(X_test, target_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now use the MLPClassifier of scikit-learn to add a single hidden layer with a small number of hidden units.\n", "\n", "Note: instead of tanh or relu we would rather like to use a linear / identity activation function for the hidden layer but this is not (yet) implemented in scikit-learn.\n", "\n", "In that respect the following model is closer to a Deep Averaging Network (without dropout) than fastText." ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iteration 1, loss = 2.94108225\n", "Validation score: 0.464664\n", "Iteration 2, loss = 2.49072336\n", "Validation score: 0.639576\n", "Iteration 3, loss = 1.63266821\n", "Validation score: 0.810954\n", "Iteration 4, loss = 0.90327443\n", "Validation score: 0.869258\n", "Iteration 5, loss = 0.48531751\n", "Validation score: 0.893993\n", "Iteration 6, loss = 0.27329257\n", "Validation score: 0.909894\n", "Iteration 7, loss = 0.16704835\n", "Validation score: 0.911661\n", "Iteration 8, loss = 0.11122343\n", "Validation score: 0.918728\n", "Iteration 9, loss = 0.07885910\n", "Validation score: 0.918728\n", "Iteration 10, loss = 0.05876991\n", "Validation score: 0.924028\n", "Iteration 11, loss = 0.04566916\n", "Validation score: 0.920495\n", "Iteration 12, loss = 0.03644058\n", "Validation score: 0.915194\n", "Iteration 13, loss = 0.02982519\n", "Validation score: 0.922261\n", "Validation score did not improve more than tol=0.000100 for two consecutive epochs. Stopping.\n", "CPU times: user 1min 21s, sys: 187 ms, total: 1min 21s\n", "Wall time: 1min 21s\n" ] } ], "source": [ "%%time\n", "from sklearn.neural_network import MLPClassifier\n", "\n", "mlp = MLPClassifier(algorithm='adam', learning_rate_init=0.01,\n", " hidden_layer_sizes=10, max_iter=100, activation='tanh', verbose=100,\n", " early_stopping=True, validation_fraction=0.05, alpha=1e-10)\n", "mlp.fit(X_train, target_train)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train score: 0.996\n", "test score: 0.801\n", "CPU times: user 304 ms, sys: 54 µs, total: 304 ms\n", "Wall time: 303 ms\n" ] } ], "source": [ "%%time\n", "print(\"train score: %0.3f\" % mlp.score(X_train, target_train))\n", "print(\"test score: %0.3f\" % mlp.score(X_test, target_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }