{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sebastian Raschka 30/05/2015 \n",
"\n",
"CPython 3.4.3\n",
"IPython 3.1.0\n",
"\n",
"scikit-learn 0.16.1\n",
"joblib 0.8.4\n",
"numpy 1.9.2\n",
"nltk 3.0.0\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark -a 'Sebastian Raschka' -d -v -p scikit-learn,joblib,numpy,nltk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Out-of-core Learning and Model Persistence using scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we are applying machine learning algorithms to real-world applications, our computer hardware often still constitutes the major bottleneck of the learning process. Of course, we all have access to supercomputers, Amazon EC2, Apache Spark, etc. However, out-of-core learning via Stochastic Gradient Descent can still be attractive if we'd want to update our model on-the-fly (\"online-learning\"), and in this notebook, I want to provide some examples of how we can implement an \"out-of-core\" approach using scikit-learn. \n",
"I compiled the following code examples for personal reference, and I don't intend it to be a comprehensive reference for the underlying theory, but nonetheless, I decided to share it since it may be useful to one or the other!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sections\n",
"\n",
"- [The IMDb Movie Review Dataset](#The-IMDb-Movie-Review-Dataset)\n",
"- [Preprocessing Text Data](#Preprocessing-Text-Data)\n",
"- [Out-of-core learning](#Out-of-core-learning)\n",
"- [Model Persistence](#Model-Persistence)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The IMDb Movie Review Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.\n",
"\n",
"> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics\n",
"\n",
"[Source: http://ai.stanford.edu/~amaas/data/sentiment/]\n",
"\n",
"The dataset consists of 50,000 movie reviews from the original \"train\" and \"test\" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.\n",
"For simplicity, I assembled the reviews in a single CSV file.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | review | \n", "sentiment | \n", "set | \n", "
---|---|---|---|
49995 | \n", "Towards the end of the movie, I felt it was to... | \n", "0 | \n", "train | \n", "
49996 | \n", "This is the kind of movie that my enemies cont... | \n", "0 | \n", "train | \n", "
49997 | \n", "I saw 'Descent' last night at the Stockholm Fi... | \n", "0 | \n", "train | \n", "
49998 | \n", "Some films that you pick up for a pound turn o... | \n", "0 | \n", "train | \n", "
49999 | \n", "This is one of the dumbest films, I've ever se... | \n", "0 | \n", "train | \n", "