{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classifying Toxic Comments\n",
    "\n",
    "by: <a href=\"http://keithqu.com\">Keith Qu</a>\n",
    "\n",
    "Some natural language classification of toxic comments using logistic regression and also keras (running on tensorflow). This is a very broad run through ranging from basic linear methods, to modified linear methods, to deep learning.\n",
    "\n",
    "Methods include logistic regression, NB-SVM, and CNN and RNN (bidirectional LSTM) with Keras on a TensorFlow backend."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import re, string\n",
    "from collections import Counter\n",
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc\n",
    "from scipy import sparse\n",
    "from sklearn.model_selection import train_test_split\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First look with logit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Toxicity categories:\n",
    "<ul>\n",
    "<li><b>toxic</b></li>\n",
    "<li><b>severe_toxic</b></li>\n",
    "<li><b>obscene</b></li>\n",
    "<li><b>threat</b></li>\n",
    "<li><b>insult</b></li>\n",
    "<li><b>identity_hate</b></li>\n",
    "</ul>\n",
    "The labels are fairly self-explanatory, but it also seems like there should be a lot of overlap between the categories, since everything that qualifies as severe_toxic, obscene, threat, insult or identity_hate should also be regular toxic as well.\n",
    "\n",
    "It makes sense that the categories are not exclusive. So we can treat them like 6 different classification problems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train = pd.read_csv('train.csv')\n",
    "test = pd.read_csv('test.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>comment_text</th>\n",
       "      <th>toxic</th>\n",
       "      <th>severe_toxic</th>\n",
       "      <th>obscene</th>\n",
       "      <th>threat</th>\n",
       "      <th>insult</th>\n",
       "      <th>identity_hate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>159566</th>\n",
       "      <td>ffe987279560d7ff</td>\n",
       "      <td>\":::::And for the second time of asking, when ...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159567</th>\n",
       "      <td>ffea4adeee384e90</td>\n",
       "      <td>You should be ashamed of yourself \\n\\nThat is ...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159568</th>\n",
       "      <td>ffee36eab5c267c9</td>\n",
       "      <td>Spitzer \\n\\nUmm, theres no actual article for ...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159569</th>\n",
       "      <td>fff125370e4aaaf3</td>\n",
       "      <td>And it looks like it was actually you who put ...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159570</th>\n",
       "      <td>fff46fc426af1f9a</td>\n",
       "      <td>\"\\nAnd ... I really don't think you understand...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      id                                       comment_text  \\\n",
       "159566  ffe987279560d7ff  \":::::And for the second time of asking, when ...   \n",
       "159567  ffea4adeee384e90  You should be ashamed of yourself \\n\\nThat is ...   \n",
       "159568  ffee36eab5c267c9  Spitzer \\n\\nUmm, theres no actual article for ...   \n",
       "159569  fff125370e4aaaf3  And it looks like it was actually you who put ...   \n",
       "159570  fff46fc426af1f9a  \"\\nAnd ... I really don't think you understand...   \n",
       "\n",
       "        toxic  severe_toxic  obscene  threat  insult  identity_hate  \n",
       "159566      0             0        0       0       0              0  \n",
       "159567      0             0        0       0       0              0  \n",
       "159568      0             0        0       0       0              0  \n",
       "159569      0             0        0       0       0              0  \n",
       "159570      0             0        0       0       0              0  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at some of our test comments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>comment_text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00001cee341fdb12</td>\n",
       "      <td>Yo bitch Ja Rule is more succesful then you'll...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0000247867823ef7</td>\n",
       "      <td>== From RfC == \\n\\n The title is fine as it is...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>00013b17ad220c46</td>\n",
       "      <td>\" \\n\\n == Sources == \\n\\n * Zawe Ashton on Lap...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00017563c3f7919a</td>\n",
       "      <td>:If you have a look back at the source, the in...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00017695ad8997eb</td>\n",
       "      <td>I don't anonymously edit articles at all.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 id                                       comment_text\n",
       "0  00001cee341fdb12  Yo bitch Ja Rule is more succesful then you'll...\n",
       "1  0000247867823ef7  == From RfC == \\n\\n The title is fine as it is...\n",
       "2  00013b17ad220c46  \" \\n\\n == Sources == \\n\\n * Zawe Ashton on Lap...\n",
       "3  00017563c3f7919a  :If you have a look back at the source, the in...\n",
       "4  00017695ad8997eb          I don't anonymously edit articles at all."
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,\""
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.loc[0].comment_text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks like we have some obscenity and insult, combined with a dash of identity hate.\n",
    "\n",
    "They are also completely wrong, since 50 Cent $>$ Ja Rule any day. Well, maybe not his last album..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Missing data?\n",
    "\n",
    "Only a little."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "test.fillna(' ',inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vectorizing the comments\n",
    "\n",
    "We'll do it by words and by character. Internet comments are a cesspool, and there are character ngrams that can have toxic meanings. Maybe we can also combine them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def tok(s):\n",
    "    return re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])').sub(r' \\1 ',s).split()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "words = TfidfVectorizer(ngram_range=(1,2), lowercase=True,\n",
    "                        analyzer='word',stop_words='english',tokenizer=tok,\n",
    "                        min_df=3,max_df=0.9, sublinear_tf=1, smooth_idf=1, dtype=np.float32,\n",
    "                       strip_accents='unicode')\n",
    "chars = TfidfVectorizer(ngram_range=(1,5), lowercase=True,\n",
    "                        analyzer='char',min_df=3,\n",
    "                        max_df=0.9, sublinear_tf=1,smooth_idf=1,dtype=np.float32,)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_words = words.fit_transform(train['comment_text'])\n",
    "train_chars = chars.fit_transform(train['comment_text'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Only words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = sparse.csr_matrix(train_words)\n",
    "cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']\n",
    "y = train[cols]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=10101)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "logit = LogisticRegression(C=4, dual=True)\n",
    "pred = np.zeros((X_test.shape[0],y_test.shape[1]))\n",
    "for i,c in enumerate(cols):\n",
    "    logit.fit(X_train,y_train[c])\n",
    "    pred[:,i] = logit.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Confusion matrix for toxic\n",
      "[[42996   244]\n",
      " [ 1767  2865]]\n",
      "Confusion matrix for severe_toxic\n",
      "[[47295    87]\n",
      " [  390   100]]\n",
      "Confusion matrix for obscene\n",
      "[[45167   148]\n",
      " [  892  1665]]\n",
      "Confusion matrix for threat\n",
      "[[47720    15]\n",
      " [  121    16]]\n",
      "Confusion matrix for insult\n",
      "[[45231   269]\n",
      " [ 1158  1214]]\n",
      "Confusion matrix for identity_hate\n",
      "[[47414    40]\n",
      " [  338    80]]\n"
     ]
    }
   ],
   "source": [
    "for i,c in enumerate(cols):\n",
    "    print('Confusion matrix for', c)\n",
    "    print(confusion_matrix(y_test[c],pred[:,i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Only characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_c = sparse.csr_matrix(train_chars)\n",
    "X_train,X_test,y_train,y_test = train_test_split(X_c,y,test_size=0.3,random_state=10101)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pred = np.zeros((X_test.shape[0],y_test.shape[1]))\n",
    "for i,c in enumerate(cols):\n",
    "    logit.fit(X_train,y_train[c])\n",
    "    pred[:,i] = logit.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Confusion matrix for toxic\n",
      "[[42927   313]\n",
      " [ 1540  3092]]\n",
      "Confusion matrix for severe_toxic\n",
      "[[47284    98]\n",
      " [  372   118]]\n",
      "Confusion matrix for obscene\n",
      "[[45146   169]\n",
      " [  791  1766]]\n",
      "Confusion matrix for threat\n",
      "[[47727     8]\n",
      " [  109    28]]\n",
      "Confusion matrix for insult\n",
      "[[45212   288]\n",
      " [ 1006  1366]]\n",
      "Confusion matrix for identity_hate\n",
      "[[47417    37]\n",
      " [  315   103]]\n"
     ]
    }
   ],
   "source": [
    "for i,c in enumerate(cols):\n",
    "    print('Confusion matrix for', c)\n",
    "    print(confusion_matrix(y_test[c],pred[:,i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Characterwise vectorization appears to have better results with ngrams of up to size 5, but this could vary with different splits."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combine words and characters\n",
    "Horizontal stacking the word and character sets to create a larger blended dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X2 = sparse.hstack([train_words,train_chars])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train,X_test,y_train,y_test = train_test_split(X2,y,test_size=0.3,random_state=10101)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pred = np.zeros((X_test.shape[0],y_test.shape[1]))\n",
    "for i,c in enumerate(cols):\n",
    "    logit.fit(X_train,y_train[c])\n",
    "    pred[:,i] = logit.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Confusion matrix for toxic\n",
      "[[42935   305]\n",
      " [ 1472  3160]]\n",
      "Confusion matrix for severe_toxic\n",
      "[[47270   112]\n",
      " [  367   123]]\n",
      "Confusion matrix for obscene\n",
      "[[45128   187]\n",
      " [  757  1800]]\n",
      "Confusion matrix for threat\n",
      "[[47716    19]\n",
      " [  106    31]]\n",
      "Confusion matrix for insult\n",
      "[[45198   302]\n",
      " [  982  1390]]\n",
      "Confusion matrix for identity_hate\n",
      "[[47404    50]\n",
      " [  308   110]]\n"
     ]
    }
   ],
   "source": [
    "for i,c in enumerate(cols):\n",
    "    print('Confusion matrix for', c)\n",
    "    print(confusion_matrix(y_test[c],pred[:,i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So putting character and string combinations together seem to give better results than they are separately. However, we can see that the model does best in identifying toxic, obscene and insult comments, and these are the ones that that are likely to have specific keywords associated with them. There would appear to be heavy subjectivity when it comes to what exactly constitutes severe toxicity. Threats are very context-sensitive and identity hate is a lot easier to identify for some groups than others.\n",
    "\n",
    "It's also extremely memory-intensive for a machine with a very normal 16 GB of RAM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NB-SVM\n",
    "\n",
    "<A href=\"https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf\">Wang & Manning (2012)</a> find that Multinomial Naive Bayes performs better at classifying smaller snippets of text, while SVM is superior with full-length text. By combining the two models with linear interpolation, they create a new model that is robust for a wide variety of text.\n",
    "\n",
    "There is a very helpful Python <a href=\"https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline-eda-0-052-lb\">implementation</a> by Jeremy Howard."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import string, re\n",
    "from collections import Counter\n",
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc\n",
    "from scipy import sparse\n",
    "from sklearn.model_selection import train_test_split\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train = pd.read_csv('train.csv')\n",
    "test = pd.read_csv('test.csv')\n",
    "cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']\n",
    "test.fillna(' ',inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# here's the main part of the implementation by jhoward mentioned above\n",
    "def pr(y_i, y):\n",
    "    p = train_words[y == y_i].sum(0)\n",
    "    return (p+1)/((y==y_i).sum()+1)\n",
    "\n",
    "# Get rid of the punctuation. Again, thanks to jhoward for this...\n",
    "def tok(s):\n",
    "    return re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])').sub(r' \\1 ',s).split()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "words = TfidfVectorizer(ngram_range=(1,2), lowercase=True,\n",
    "                        analyzer='word',stop_words='english',tokenizer=tok,\n",
    "                        min_df=3,max_df=0.9, sublinear_tf=1, smooth_idf=1, use_idf=1,\n",
    "                       strip_accents='unicode')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_words = words.fit_transform(train['comment_text'])\n",
    "test_words = words.transform(test['comment_text'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pred = np.zeros((test.shape[0],len(cols)))\n",
    "for i,c in enumerate(cols):\n",
    "    logit = LogisticRegression(C=4, dual=True)    \n",
    "    r = np.log(pr(1,train[c].values)/pr(0,train[c].values))\n",
    "    X_nb = train_words.multiply(r)\n",
    "    logit.fit(X_nb,train[c].values)\n",
    "    pred[:,i] = logit.predict_proba(test_words.multiply(r))[:,1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission = pd.read_csv('sample_submission.csv')\n",
    "submission[cols] = pred\n",
    "submission.to_csv('submission.csv',index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives a score in the 0.07s (mean column-wise log loss), which is on the high side. But it was quick and painless; there wasn't any lemmatization, feature engineering, using toxic word dictionaries, spellchecking, or using existing repositories of vectorized text. So there's a lot of room for improvement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>toxic</th>\n",
       "      <th>severe_toxic</th>\n",
       "      <th>obscene</th>\n",
       "      <th>threat</th>\n",
       "      <th>insult</th>\n",
       "      <th>identity_hate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00001cee341fdb12</td>\n",
       "      <td>0.999996</td>\n",
       "      <td>0.056500</td>\n",
       "      <td>0.999964</td>\n",
       "      <td>0.002058</td>\n",
       "      <td>0.986571</td>\n",
       "      <td>0.362056</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0000247867823ef7</td>\n",
       "      <td>0.006075</td>\n",
       "      <td>0.000966</td>\n",
       "      <td>0.004201</td>\n",
       "      <td>0.000120</td>\n",
       "      <td>0.005512</td>\n",
       "      <td>0.000392</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>00013b17ad220c46</td>\n",
       "      <td>0.009947</td>\n",
       "      <td>0.000812</td>\n",
       "      <td>0.004720</td>\n",
       "      <td>0.000098</td>\n",
       "      <td>0.004124</td>\n",
       "      <td>0.000283</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00017563c3f7919a</td>\n",
       "      <td>0.001160</td>\n",
       "      <td>0.000283</td>\n",
       "      <td>0.001051</td>\n",
       "      <td>0.000240</td>\n",
       "      <td>0.001028</td>\n",
       "      <td>0.000225</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00017695ad8997eb</td>\n",
       "      <td>0.016037</td>\n",
       "      <td>0.000392</td>\n",
       "      <td>0.001696</td>\n",
       "      <td>0.000138</td>\n",
       "      <td>0.002533</td>\n",
       "      <td>0.000301</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 id     toxic  severe_toxic   obscene    threat    insult  \\\n",
       "0  00001cee341fdb12  0.999996      0.056500  0.999964  0.002058  0.986571   \n",
       "1  0000247867823ef7  0.006075      0.000966  0.004201  0.000120  0.005512   \n",
       "2  00013b17ad220c46  0.009947      0.000812  0.004720  0.000098  0.004124   \n",
       "3  00017563c3f7919a  0.001160      0.000283  0.001051  0.000240  0.001028   \n",
       "4  00017695ad8997eb  0.016037      0.000392  0.001696  0.000138  0.002533   \n",
       "\n",
       "   identity_hate  \n",
       "0       0.362056  \n",
       "1       0.000392  \n",
       "2       0.000283  \n",
       "3       0.000225  \n",
       "4       0.000301  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "submission.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first entry (0) is from the enlightened Ja Rule supporter shown above. We have detected the obscenity and insult (and so it is definitely toxic), but it's a bit weak on the identity hate measure. To be fair, calling someone a \"fuckin white boy\" ranks low on the identity hate ladder, but arguably it still should count. We don't really know what it's true classification is at this point."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Keras/TensorFlow\n",
    "\n",
    "The Keras API is a nice way to access TensorFlow, which will be needed to create convolutional (CNN) and recurrent neural networks (RNNs)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from nltk.corpus import stopwords\n",
    "from nltk import WordNetLemmatizer\n",
    "from nltk import pos_tag, word_tokenize\n",
    "\n",
    "from keras.models import Sequential\n",
    "from keras.layers import (Dense, Dropout, Input, LSTM, Activation, Flatten,\n",
    "                          Convolution1D, MaxPooling1D, Bidirectional,\n",
    "                         GlobalMaxPooling1D, Embedding, BatchNormalization,\n",
    "                         SpatialDropout1D)\n",
    "from keras.preprocessing import sequence\n",
    "from keras.preprocessing.text import Tokenizer\n",
    "from keras.preprocessing.sequence import pad_sequences\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train = pd.read_csv('train.csv')\n",
    "test = pd.read_csv('test.csv')\n",
    "cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']\n",
    "test.fillna(' ',inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Going to convert everything to lower case, remove stopwords, lemmatize words (get the root), and convert text into batch sequences of 200 length to run through the learning models.\n",
    "\n",
    ""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def clean_up(t):\n",
    "    t = t.strip().lower()\n",
    "    words = t.split()\n",
    "    \n",
    "    # first get rid of the stopwords, or a lemmatized stopword might not\n",
    "    # be recognized as a stopword\n",
    "    \n",
    "    imp_words = ' '.join(w for w in words if w not in set(stopwords.words('english')))\n",
    "\n",
    "    # lemmatize based on adjectives (J), verbs (V), nouns (N) and adverbs (R) to\n",
    "    # return only the base words (as opposed to stemming which can return\n",
    "    # non-words). e.g. ponies -> poni with stemming, and pony with lemmatizing\n",
    "    \n",
    "    final_words = ''\n",
    "    \n",
    "    lemma = WordNetLemmatizer()\n",
    "    for (w,tag) in pos_tag(word_tokenize(imp_words)):\n",
    "        if tag.startswith('J'):\n",
    "            final_words += ' '+ lemma.lemmatize(w, pos='a')\n",
    "        elif tag.startswith('V'):\n",
    "            final_words += ' '+ lemma.lemmatize(w, pos='v')\n",
    "        elif tag.startswith('N'):\n",
    "            final_words += ' '+ lemma.lemmatize(w, pos='n')\n",
    "        elif tag.startswith('R'):\n",
    "            final_words += ' '+ lemma.lemmatize(w, pos='r')\n",
    "        else:\n",
    "            final_words += ' '+ w\n",
    "    \n",
    "    return final_words\n",
    "\n",
    "# what a great name. do_stuff\n",
    "\n",
    "def do_stuff (df):\n",
    "    text = df['comment_text'].copy()\n",
    "    \n",
    "    # First get rid of anything that's not a letter. This may not be the greatest idea, since\n",
    "    # on3 c4n 3451ly substitute numbers in for letters, but keep it like this for now.\n",
    "    \n",
    "    text.replace(to_replace={r'[^\\x00-\\x7F]':' '},inplace=True,regex=True)\n",
    "    text.replace(to_replace={r'[^a-zA-Z]': ' '},inplace=True,regex=True)\n",
    "    \n",
    "    # Then lower case, tokenize and lemmatize\n",
    "\n",
    "    text = text.apply(lambda t:clean_up(t))\n",
    "    return text\n",
    "\n",
    "def tok_seq (train,test):\n",
    "    tok = Tokenizer(num_words=100000)\n",
    "    tok.fit_on_texts(X_train)\n",
    "     \n",
    "    # set our max text length to 200 characters\n",
    "    seq_train = tok.texts_to_sequences(train)\n",
    "    seq_test = tok.texts_to_sequences(test)    \n",
    "    data_train = pad_sequences(seq_train,maxlen=200)\n",
    "    data_test = pad_sequences(seq_test,maxlen=200)\n",
    "    \n",
    "    return data_train,data_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convolution model with 25% dropouts to help with generalization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def seq_model (X_train, y_train, test, val='no'):\n",
    "    model = Sequential()\n",
    "    model.add(Embedding(100000,50,input_length=200))\n",
    "    model.add(Dropout(0.25))\n",
    "    model.add(Convolution1D(250, activation='relu',padding='valid',kernel_size=3))\n",
    "    model.add(GlobalMaxPooling1D())\n",
    "    model.add(Dense(250))\n",
    "    model.add(Dropout(0.25))\n",
    "    model.add(Activation('relu'))\n",
    "    \n",
    "    # A sigmoid ensures a bounded solution in (0,1)    \n",
    "    model.add(Dense(6,activation='sigmoid'))\n",
    "    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "    \n",
    "    # batch_size=1000 seems to be the limit of my 2gb GTX 960m\n",
    "    # like with all predictive modeling, there is an under/overfitting\n",
    "    # tradeoff between too few epochs and too many\n",
    "    if val == 'no':\n",
    "        model.fit(X_train,y_train,batch_size=1000,epochs=5)\n",
    "    else:\n",
    "        model.fit(X_train,y_train,batch_size=1000,epochs=5,validation_split=0.1)\n",
    "    pred = model.predict(test)\n",
    "    return pred"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bidirectional LSTM model with similar dropouts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def bidirect_model (X_train, y_train, test, val='no'):\n",
    "    model=Sequential()\n",
    "    model.add(Embedding(100000,100,input_length=200))\n",
    "    model.add(Bidirectional(LSTM(50, return_sequences=True)))\n",
    "    model.add(GlobalMaxPooling1D())\n",
    "    model.add(Dropout(0.25))\n",
    "    model.add(Dense(250))\n",
    "    model.add(Dropout(0.25))\n",
    "    model.add(Activation('relu'))\n",
    "    \n",
    "    model.add(Dense(6,activation='sigmoid'))\n",
    "    model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
    "    \n",
    "    if val == 'no':\n",
    "        model.fit(X_train,y_train,batch_size=1000,epochs=4)\n",
    "    else:\n",
    "        model.fit(X_train,y_train,batch_size=1000,epochs=4,validation_split=0.1)\n",
    "    pred = model.predict(test)\n",
    "    return pred"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train/Validation\n",
    "\n",
    "First let's take a look at results with a train test split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train = do_stuff(train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']\n",
    "y_train = train[cols].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Xt, Xv, yt, yv = train_test_split(X_train, y_train, test_size=0.2, random_state=11)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xt, Xv = tok_seq(Xt,Xv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/5\n",
      "127656/127656 [==============================] - 33s 260us/step - loss: 0.1777 - acc: 0.9598\n",
      "Epoch 2/5\n",
      "127656/127656 [==============================] - 27s 214us/step - loss: 0.1050 - acc: 0.9651\n",
      "Epoch 3/5\n",
      "127656/127656 [==============================] - 27s 213us/step - loss: 0.0550 - acc: 0.9803\n",
      "Epoch 4/5\n",
      "127656/127656 [==============================] - 27s 213us/step - loss: 0.0456 - acc: 0.9826\n",
      "Epoch 5/5\n",
      "127656/127656 [==============================] - 27s 213us/step - loss: 0.0417 - acc: 0.9840\n"
     ]
    }
   ],
   "source": [
    "pred_seq = seq_model (Xt, yt, Xv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/4\n",
      "127656/127656 [==============================] - 166s 1ms/step - loss: 0.1802 - acc: 0.9591\n",
      "Epoch 2/4\n",
      "127656/127656 [==============================] - 163s 1ms/step - loss: 0.0769 - acc: 0.9740\n",
      "Epoch 3/4\n",
      "127656/127656 [==============================] - 162s 1ms/step - loss: 0.0606 - acc: 0.9785\n",
      "Epoch 4/4\n",
      "127656/127656 [==============================] - 162s 1ms/step - loss: 0.0447 - acc: 0.9831\n"
     ]
    }
   ],
   "source": [
    "pred_bid = bidirect_model(Xt,yt,Xv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unsurprisingly, bidirectional adds a hefty amount of computing time.\n",
    "\n",
    "Using a GPU for computation is kind of like reading Playboy for the articles, where it might seem questionable at first but GPUs are really good for deep learning and every now and then Playboy has a great article about the military industrial complex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.965157908356303"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "roc_auc_score(yv, pred_seq)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.96750127296574728"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "roc_auc_score(yv, pred_bid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Correlation between results of toxic\n",
      "0.936130675937\n",
      "Correlation between results of severe_toxic\n",
      "0.933198514747\n",
      "Correlation between results of obscene\n",
      "0.958145180505\n",
      "Correlation between results of threat\n",
      "0.901314923493\n",
      "Correlation between results of insult\n",
      "0.949524363145\n",
      "Correlation between results of identity_hate\n",
      "0.935303371467\n"
     ]
    }
   ],
   "source": [
    "for i,c in enumerate(cols):\n",
    "    print ('Correlation between results of', c)\n",
    "    print(np.corrcoef(pred_bid[:,i],pred_seq[:,i])[0,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The two sets of predictions are fairly highly correlated for our validation split, but we can still try mean ensembling the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "roc_auc_score(yv,0.5*(pred_seq+pred_bid))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing on the public test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_test = do_stuff(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train,X_test=tok_seq(X_train,X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y_train = train[['toxic', 'severe_toxic','obscene','threat','insult','identity_hate']].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 143613 samples, validate on 15958 samples\n",
      "Epoch 1/5\n",
      "143613/143613 [==============================] - 32s 225us/step - loss: 0.1688 - acc: 0.9606 - val_loss: 0.1124 - val_acc: 0.9627\n",
      "Epoch 2/5\n",
      "143613/143613 [==============================] - 31s 218us/step - loss: 0.0884 - acc: 0.9706 - val_loss: 0.0607 - val_acc: 0.9790\n",
      "Epoch 3/5\n",
      "143613/143613 [==============================] - 31s 218us/step - loss: 0.0506 - acc: 0.9815 - val_loss: 0.0568 - val_acc: 0.9793\n",
      "Epoch 4/5\n",
      "143613/143613 [==============================] - 31s 217us/step - loss: 0.0449 - acc: 0.9829 - val_loss: 0.0554 - val_acc: 0.9796\n",
      "Epoch 5/5\n",
      "143613/143613 [==============================] - 31s 218us/step - loss: 0.0412 - acc: 0.9842 - val_loss: 0.0575 - val_acc: 0.9784\n"
     ]
    }
   ],
   "source": [
    "finalpred_seq = seq_model(X_train, y_train, X_test,val='yes')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 143613 samples, validate on 15958 samples\n",
      "Epoch 1/4\n",
      "143613/143613 [==============================] - 190s 1ms/step - loss: 0.1739 - acc: 0.9598 - val_loss: 0.0773 - val_acc: 0.9744\n",
      "Epoch 2/4\n",
      "143613/143613 [==============================] - 188s 1ms/step - loss: 0.0576 - acc: 0.9796 - val_loss: 0.0536 - val_acc: 0.9804\n",
      "Epoch 3/4\n",
      "143613/143613 [==============================] - 188s 1ms/step - loss: 0.0452 - acc: 0.9832 - val_loss: 0.0549 - val_acc: 0.9809\n",
      "Epoch 4/4\n",
      "143613/143613 [==============================] - 187s 1ms/step - loss: 0.0397 - acc: 0.9847 - val_loss: 0.0560 - val_acc: 0.9809\n"
     ]
    }
   ],
   "source": [
    "finalpred_bid = bidirect_model (X_train, y_train, X_test,val='yes')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission1 = pd.read_csv('sample_submission.csv')\n",
    "submission1[cols] = finalpred_seq\n",
    "submission1.to_csv('submission1.csv',index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission2 = pd.read_csv('sample_submission.csv')\n",
    "submission2[cols] = finalpred_bid\n",
    "submission2.to_csv('submission2.csv',index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission3 = pd.read_csv('sample_submission.csv')\n",
    "submission3[cols] = 0.5*(submission1[cols] + submission2[cols])\n",
    "submission3.to_csv('submission3.csv',index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One more..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def onemore_model (X_train, y_train, test, val='no'):\n",
    "    model=Sequential()\n",
    "    model.add(Embedding(100000,100,input_length=200))\n",
    "    model.add(SpatialDropout1D(0.25))\n",
    "    model.add(GlobalMaxPooling1D())\n",
    "    model.add(BatchNormalization())\n",
    "    model.add(Dense(128))\n",
    "    model.add(Dropout(0.5))\n",
    "    \n",
    "    model.add(Dense(6,activation='sigmoid'))\n",
    "    model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
    "    \n",
    "    if val == 'no':\n",
    "        model.fit(X_train,y_train,batch_size=1000,epochs=5)\n",
    "    else:\n",
    "        model.fit(X_train,y_train,batch_size=1000,epochs=5,validation_split=0.1)\n",
    "    pred = model.predict(test)\n",
    "    return pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 143613 samples, validate on 15958 samples\n",
      "Epoch 1/5\n",
      "143613/143613 [==============================] - 14s 97us/step - loss: 0.3350 - acc: 0.8389 - val_loss: 0.0671 - val_acc: 0.9772\n",
      "Epoch 2/5\n",
      "143613/143613 [==============================] - 13s 89us/step - loss: 0.0808 - acc: 0.9751 - val_loss: 0.0590 - val_acc: 0.9792\n",
      "Epoch 3/5\n",
      "143613/143613 [==============================] - 13s 89us/step - loss: 0.0619 - acc: 0.9794 - val_loss: 0.0567 - val_acc: 0.9797\n",
      "Epoch 4/5\n",
      "143613/143613 [==============================] - 13s 89us/step - loss: 0.0523 - acc: 0.9817 - val_loss: 0.0560 - val_acc: 0.9801\n",
      "Epoch 5/5\n",
      "143613/143613 [==============================] - 13s 89us/step - loss: 0.0478 - acc: 0.9828 - val_loss: 0.0555 - val_acc: 0.9804\n"
     ]
    }
   ],
   "source": [
    "finalpred_om = onemore_model (X_train, y_train, X_test,val='yes')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission4 = pd.read_csv('sample_submission.csv')\n",
    "submission4[cols] = finalpred_om\n",
    "submission4.to_csv('submission4.csv',index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This last one scored 0.067. However, by dividing the predictions matrix by 1.12, this score actually improves to 0.063. This is due to the unbalanced nature of the dataset, and has led to some discussion about switching to an AUC scoring system. Regardless, there is still a lot of room for improvement. But I think that getting within earshot of the top 25% isn't too shabby considering I have about 4 days worth of natural language processing experience."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Conclusion\n",
    "\n",
    "The obvious next step would be to use existing vectorized co-occurence dictionaries such as GloVe or Facebook fastText. Spell checking and toxic word dictionaries may also be helpful. Possible features that can be created include the use of all caps, prevalence of symbols within the body of text or exlamation and question marks. These can be determined prior to detokenization/lemmatization/forced lowercasing. Comment length can also be useful; anecdotally, at times there seems to be a slight corelation between the length of a comment and how angry its writer is. Early stoppage can also be incorporated to reduce overfitting.\n",
    "\n",
    "These will almost certainly be necessary for significant score improvements. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}