{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Classifying Toxic Comments\n", "\n", "by: Keith Qu\n", "\n", "Some natural language classification of toxic comments using logistic regression and also keras (running on tensorflow). This is a very broad run through ranging from basic linear methods, to modified linear methods, to deep learning.\n", "\n", "Methods include logistic regression, NB-SVM, and CNN and RNN (bidirectional LSTM) with Keras on a TensorFlow backend." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import re, string\n", "from collections import Counter\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc\n", "from scipy import sparse\n", "from sklearn.model_selection import train_test_split\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First look with logit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Toxicity categories:\n", "\n", "The labels are fairly self-explanatory, but it also seems like there should be a lot of overlap between the categories, since everything that qualifies as severe_toxic, obscene, threat, insult or identity_hate should also be regular toxic as well.\n", "\n", "It makes sense that the categories are not exclusive. So we can treat them like 6 different classification problems." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train = pd.read_csv('train.csv')\n", "test = pd.read_csv('test.csv')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcomment_texttoxicsevere_toxicobscenethreatinsultidentity_hate
159566ffe987279560d7ff\":::::And for the second time of asking, when ...000000
159567ffea4adeee384e90You should be ashamed of yourself \\n\\nThat is ...000000
159568ffee36eab5c267c9Spitzer \\n\\nUmm, theres no actual article for ...000000
159569fff125370e4aaaf3And it looks like it was actually you who put ...000000
159570fff46fc426af1f9a\"\\nAnd ... I really don't think you understand...000000
\n", "
" ], "text/plain": [ " id comment_text \\\n", "159566 ffe987279560d7ff \":::::And for the second time of asking, when ... \n", "159567 ffea4adeee384e90 You should be ashamed of yourself \\n\\nThat is ... \n", "159568 ffee36eab5c267c9 Spitzer \\n\\nUmm, theres no actual article for ... \n", "159569 fff125370e4aaaf3 And it looks like it was actually you who put ... \n", "159570 fff46fc426af1f9a \"\\nAnd ... I really don't think you understand... \n", "\n", " toxic severe_toxic obscene threat insult identity_hate \n", "159566 0 0 0 0 0 0 \n", "159567 0 0 0 0 0 0 \n", "159568 0 0 0 0 0 0 \n", "159569 0 0 0 0 0 0 \n", "159570 0 0 0 0 0 0 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at some of our test comments." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcomment_text
000001cee341fdb12Yo bitch Ja Rule is more succesful then you'll...
10000247867823ef7== From RfC == \\n\\n The title is fine as it is...
200013b17ad220c46\" \\n\\n == Sources == \\n\\n * Zawe Ashton on Lap...
300017563c3f7919a:If you have a look back at the source, the in...
400017695ad8997ebI don't anonymously edit articles at all.
\n", "
" ], "text/plain": [ " id comment_text\n", "0 00001cee341fdb12 Yo bitch Ja Rule is more succesful then you'll...\n", "1 0000247867823ef7 == From RfC == \\n\\n The title is fine as it is...\n", "2 00013b17ad220c46 \" \\n\\n == Sources == \\n\\n * Zawe Ashton on Lap...\n", "3 00017563c3f7919a :If you have a look back at the source, the in...\n", "4 00017695ad8997eb I don't anonymously edit articles at all." ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,\"" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.loc[0].comment_text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks like we have some obscenity and insult, combined with a dash of identity hate.\n", "\n", "They are also completely wrong, since 50 Cent $>$ Ja Rule any day. Well, maybe not his last album..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Missing data?\n", "\n", "Only a little." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "test.fillna(' ',inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorizing the comments\n", "\n", "We'll do it by words and by character. Internet comments are a cesspool, and there are character ngrams that can have toxic meanings. Maybe we can also combine them." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def tok(s):\n", " return re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])').sub(r' \\1 ',s).split()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "words = TfidfVectorizer(ngram_range=(1,2), lowercase=True,\n", " analyzer='word',stop_words='english',tokenizer=tok,\n", " min_df=3,max_df=0.9, sublinear_tf=1, smooth_idf=1, dtype=np.float32,\n", " strip_accents='unicode')\n", "chars = TfidfVectorizer(ngram_range=(1,5), lowercase=True,\n", " analyzer='char',min_df=3,\n", " max_df=0.9, sublinear_tf=1,smooth_idf=1,dtype=np.float32,)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train_words = words.fit_transform(train['comment_text'])\n", "train_chars = chars.fit_transform(train['comment_text'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Only words" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X = sparse.csr_matrix(train_words)\n", "cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']\n", "y = train[cols]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=10101)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logit = LogisticRegression(C=4, dual=True)\n", "pred = np.zeros((X_test.shape[0],y_test.shape[1]))\n", "for i,c in enumerate(cols):\n", " logit.fit(X_train,y_train[c])\n", " pred[:,i] = logit.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion matrix for toxic\n", "[[42996 244]\n", " [ 1767 2865]]\n", "Confusion matrix for severe_toxic\n", "[[47295 87]\n", " [ 390 100]]\n", "Confusion matrix for obscene\n", "[[45167 148]\n", " [ 892 1665]]\n", "Confusion matrix for threat\n", "[[47720 15]\n", " [ 121 16]]\n", "Confusion matrix for insult\n", "[[45231 269]\n", " [ 1158 1214]]\n", "Confusion matrix for identity_hate\n", "[[47414 40]\n", " [ 338 80]]\n" ] } ], "source": [ "for i,c in enumerate(cols):\n", " print('Confusion matrix for', c)\n", " print(confusion_matrix(y_test[c],pred[:,i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Only characters" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_c = sparse.csr_matrix(train_chars)\n", "X_train,X_test,y_train,y_test = train_test_split(X_c,y,test_size=0.3,random_state=10101)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pred = np.zeros((X_test.shape[0],y_test.shape[1]))\n", "for i,c in enumerate(cols):\n", " logit.fit(X_train,y_train[c])\n", " pred[:,i] = logit.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion matrix for toxic\n", "[[42927 313]\n", " [ 1540 3092]]\n", "Confusion matrix for severe_toxic\n", "[[47284 98]\n", " [ 372 118]]\n", "Confusion matrix for obscene\n", "[[45146 169]\n", " [ 791 1766]]\n", "Confusion matrix for threat\n", "[[47727 8]\n", " [ 109 28]]\n", "Confusion matrix for insult\n", "[[45212 288]\n", " [ 1006 1366]]\n", "Confusion matrix for identity_hate\n", "[[47417 37]\n", " [ 315 103]]\n" ] } ], "source": [ "for i,c in enumerate(cols):\n", " print('Confusion matrix for', c)\n", " print(confusion_matrix(y_test[c],pred[:,i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Characterwise vectorization appears to have better results with ngrams of up to size 5, but this could vary with different splits." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combine words and characters\n", "Horizontal stacking the word and character sets to create a larger blended dataset." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X2 = sparse.hstack([train_words,train_chars])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train,X_test,y_train,y_test = train_test_split(X2,y,test_size=0.3,random_state=10101)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pred = np.zeros((X_test.shape[0],y_test.shape[1]))\n", "for i,c in enumerate(cols):\n", " logit.fit(X_train,y_train[c])\n", " pred[:,i] = logit.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion matrix for toxic\n", "[[42935 305]\n", " [ 1472 3160]]\n", "Confusion matrix for severe_toxic\n", "[[47270 112]\n", " [ 367 123]]\n", "Confusion matrix for obscene\n", "[[45128 187]\n", " [ 757 1800]]\n", "Confusion matrix for threat\n", "[[47716 19]\n", " [ 106 31]]\n", "Confusion matrix for insult\n", "[[45198 302]\n", " [ 982 1390]]\n", "Confusion matrix for identity_hate\n", "[[47404 50]\n", " [ 308 110]]\n" ] } ], "source": [ "for i,c in enumerate(cols):\n", " print('Confusion matrix for', c)\n", " print(confusion_matrix(y_test[c],pred[:,i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So putting character and string combinations together seem to give better results than they are separately. However, we can see that the model does best in identifying toxic, obscene and insult comments, and these are the ones that that are likely to have specific keywords associated with them. There would appear to be heavy subjectivity when it comes to what exactly constitutes severe toxicity. Threats are very context-sensitive and identity hate is a lot easier to identify for some groups than others.\n", "\n", "It's also extremely memory-intensive for a machine with a very normal 16 GB of RAM." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NB-SVM\n", "\n", "Wang & Manning (2012) find that Multinomial Naive Bayes performs better at classifying smaller snippets of text, while SVM is superior with full-length text. By combining the two models with linear interpolation, they create a new model that is robust for a wide variety of text.\n", "\n", "There is a very helpful Python implementation by Jeremy Howard." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import string, re\n", "from collections import Counter\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc\n", "from scipy import sparse\n", "from sklearn.model_selection import train_test_split\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train = pd.read_csv('train.csv')\n", "test = pd.read_csv('test.csv')\n", "cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']\n", "test.fillna(' ',inplace=True)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# here's the main part of the implementation by jhoward mentioned above\n", "def pr(y_i, y):\n", " p = train_words[y == y_i].sum(0)\n", " return (p+1)/((y==y_i).sum()+1)\n", "\n", "# Get rid of the punctuation. Again, thanks to jhoward for this...\n", "def tok(s):\n", " return re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])').sub(r' \\1 ',s).split()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "words = TfidfVectorizer(ngram_range=(1,2), lowercase=True,\n", " analyzer='word',stop_words='english',tokenizer=tok,\n", " min_df=3,max_df=0.9, sublinear_tf=1, smooth_idf=1, use_idf=1,\n", " strip_accents='unicode')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train_words = words.fit_transform(train['comment_text'])\n", "test_words = words.transform(test['comment_text'])" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pred = np.zeros((test.shape[0],len(cols)))\n", "for i,c in enumerate(cols):\n", " logit = LogisticRegression(C=4, dual=True) \n", " r = np.log(pr(1,train[c].values)/pr(0,train[c].values))\n", " X_nb = train_words.multiply(r)\n", " logit.fit(X_nb,train[c].values)\n", " pred[:,i] = logit.predict_proba(test_words.multiply(r))[:,1]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "submission = pd.read_csv('sample_submission.csv')\n", "submission[cols] = pred\n", "submission.to_csv('submission.csv',index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives a score in the 0.07s (mean column-wise log loss), which is on the high side. But it was quick and painless; there wasn't any lemmatization, feature engineering, using toxic word dictionaries, spellchecking, or using existing repositories of vectorized text. So there's a lot of room for improvement." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtoxicsevere_toxicobscenethreatinsultidentity_hate
000001cee341fdb120.9999960.0565000.9999640.0020580.9865710.362056
10000247867823ef70.0060750.0009660.0042010.0001200.0055120.000392
200013b17ad220c460.0099470.0008120.0047200.0000980.0041240.000283
300017563c3f7919a0.0011600.0002830.0010510.0002400.0010280.000225
400017695ad8997eb0.0160370.0003920.0016960.0001380.0025330.000301
\n", "
" ], "text/plain": [ " id toxic severe_toxic obscene threat insult \\\n", "0 00001cee341fdb12 0.999996 0.056500 0.999964 0.002058 0.986571 \n", "1 0000247867823ef7 0.006075 0.000966 0.004201 0.000120 0.005512 \n", "2 00013b17ad220c46 0.009947 0.000812 0.004720 0.000098 0.004124 \n", "3 00017563c3f7919a 0.001160 0.000283 0.001051 0.000240 0.001028 \n", "4 00017695ad8997eb 0.016037 0.000392 0.001696 0.000138 0.002533 \n", "\n", " identity_hate \n", "0 0.362056 \n", "1 0.000392 \n", "2 0.000283 \n", "3 0.000225 \n", "4 0.000301 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "submission.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first entry (0) is from the enlightened Ja Rule supporter shown above. We have detected the obscenity and insult (and so it is definitely toxic), but it's a bit weak on the identity hate measure. To be fair, calling someone a \"fuckin white boy\" ranks low on the identity hate ladder, but arguably it still should count. We don't really know what it's true classification is at this point." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Keras/TensorFlow\n", "\n", "The Keras API is a nice way to access TensorFlow, which will be needed to create convolutional (CNN) and recurrent neural networks (RNNs)." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from nltk.corpus import stopwords\n", "from nltk import WordNetLemmatizer\n", "from nltk import pos_tag, word_tokenize\n", "\n", "from keras.models import Sequential\n", "from keras.layers import (Dense, Dropout, Input, LSTM, Activation, Flatten,\n", " Convolution1D, MaxPooling1D, Bidirectional,\n", " GlobalMaxPooling1D, Embedding, BatchNormalization,\n", " SpatialDropout1D)\n", "from keras.preprocessing import sequence\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train = pd.read_csv('train.csv')\n", "test = pd.read_csv('test.csv')\n", "cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']\n", "test.fillna(' ',inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Going to convert everything to lower case, remove stopwords, lemmatize words (get the root), and convert text into batch sequences of 200 length to run through the learning models.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def clean_up(t):\n", " t = t.strip().lower()\n", " words = t.split()\n", " \n", " # first get rid of the stopwords, or a lemmatized stopword might not\n", " # be recognized as a stopword\n", " \n", " imp_words = ' '.join(w for w in words if w not in set(stopwords.words('english')))\n", "\n", " # lemmatize based on adjectives (J), verbs (V), nouns (N) and adverbs (R) to\n", " # return only the base words (as opposed to stemming which can return\n", " # non-words). e.g. ponies -> poni with stemming, and pony with lemmatizing\n", " \n", " final_words = ''\n", " \n", " lemma = WordNetLemmatizer()\n", " for (w,tag) in pos_tag(word_tokenize(imp_words)):\n", " if tag.startswith('J'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='a')\n", " elif tag.startswith('V'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='v')\n", " elif tag.startswith('N'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='n')\n", " elif tag.startswith('R'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='r')\n", " else:\n", " final_words += ' '+ w\n", " \n", " return final_words\n", "\n", "# what a great name. do_stuff\n", "\n", "def do_stuff (df):\n", " text = df['comment_text'].copy()\n", " \n", " # First get rid of anything that's not a letter. This may not be the greatest idea, since\n", " # on3 c4n 3451ly substitute numbers in for letters, but keep it like this for now.\n", " \n", " text.replace(to_replace={r'[^\\x00-\\x7F]':' '},inplace=True,regex=True)\n", " text.replace(to_replace={r'[^a-zA-Z]': ' '},inplace=True,regex=True)\n", " \n", " # Then lower case, tokenize and lemmatize\n", "\n", " text = text.apply(lambda t:clean_up(t))\n", " return text\n", "\n", "def tok_seq (train,test):\n", " tok = Tokenizer(num_words=100000)\n", " tok.fit_on_texts(X_train)\n", " \n", " # set our max text length to 200 characters\n", " seq_train = tok.texts_to_sequences(train)\n", " seq_test = tok.texts_to_sequences(test) \n", " data_train = pad_sequences(seq_train,maxlen=200)\n", " data_test = pad_sequences(seq_test,maxlen=200)\n", " \n", " return data_train,data_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convolution model with 25% dropouts to help with generalization." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def seq_model (X_train, y_train, test, val='no'):\n", " model = Sequential()\n", " model.add(Embedding(100000,50,input_length=200))\n", " model.add(Dropout(0.25))\n", " model.add(Convolution1D(250, activation='relu',padding='valid',kernel_size=3))\n", " model.add(GlobalMaxPooling1D())\n", " model.add(Dense(250))\n", " model.add(Dropout(0.25))\n", " model.add(Activation('relu'))\n", " \n", " # A sigmoid ensures a bounded solution in (0,1) \n", " model.add(Dense(6,activation='sigmoid'))\n", " model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", " \n", " # batch_size=1000 seems to be the limit of my 2gb GTX 960m\n", " # like with all predictive modeling, there is an under/overfitting\n", " # tradeoff between too few epochs and too many\n", " if val == 'no':\n", " model.fit(X_train,y_train,batch_size=1000,epochs=5)\n", " else:\n", " model.fit(X_train,y_train,batch_size=1000,epochs=5,validation_split=0.1)\n", " pred = model.predict(test)\n", " return pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bidirectional LSTM model with similar dropouts." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def bidirect_model (X_train, y_train, test, val='no'):\n", " model=Sequential()\n", " model.add(Embedding(100000,100,input_length=200))\n", " model.add(Bidirectional(LSTM(50, return_sequences=True)))\n", " model.add(GlobalMaxPooling1D())\n", " model.add(Dropout(0.25))\n", " model.add(Dense(250))\n", " model.add(Dropout(0.25))\n", " model.add(Activation('relu'))\n", " \n", " model.add(Dense(6,activation='sigmoid'))\n", " model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n", " \n", " if val == 'no':\n", " model.fit(X_train,y_train,batch_size=1000,epochs=4)\n", " else:\n", " model.fit(X_train,y_train,batch_size=1000,epochs=4,validation_split=0.1)\n", " pred = model.predict(test)\n", " return pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train/Validation\n", "\n", "First let's take a look at results with a train test split." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train = do_stuff(train)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']\n", "y_train = train[cols].values" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "Xt, Xv, yt, yv = train_test_split(X_train, y_train, test_size=0.2, random_state=11)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "Xt, Xv = tok_seq(Xt,Xv)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/5\n", "127656/127656 [==============================] - 33s 260us/step - loss: 0.1777 - acc: 0.9598\n", "Epoch 2/5\n", "127656/127656 [==============================] - 27s 214us/step - loss: 0.1050 - acc: 0.9651\n", "Epoch 3/5\n", "127656/127656 [==============================] - 27s 213us/step - loss: 0.0550 - acc: 0.9803\n", "Epoch 4/5\n", "127656/127656 [==============================] - 27s 213us/step - loss: 0.0456 - acc: 0.9826\n", "Epoch 5/5\n", "127656/127656 [==============================] - 27s 213us/step - loss: 0.0417 - acc: 0.9840\n" ] } ], "source": [ "pred_seq = seq_model (Xt, yt, Xv)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/4\n", "127656/127656 [==============================] - 166s 1ms/step - loss: 0.1802 - acc: 0.9591\n", "Epoch 2/4\n", "127656/127656 [==============================] - 163s 1ms/step - loss: 0.0769 - acc: 0.9740\n", "Epoch 3/4\n", "127656/127656 [==============================] - 162s 1ms/step - loss: 0.0606 - acc: 0.9785\n", "Epoch 4/4\n", "127656/127656 [==============================] - 162s 1ms/step - loss: 0.0447 - acc: 0.9831\n" ] } ], "source": [ "pred_bid = bidirect_model(Xt,yt,Xv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unsurprisingly, bidirectional adds a hefty amount of computing time.\n", "\n", "Using a GPU for computation is kind of like reading Playboy for the articles, where it might seem questionable at first but GPUs are really good for deep learning and every now and then Playboy has a great article about the military industrial complex." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.965157908356303" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(yv, pred_seq)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.96750127296574728" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(yv, pred_bid)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Correlation between results of toxic\n", "0.936130675937\n", "Correlation between results of severe_toxic\n", "0.933198514747\n", "Correlation between results of obscene\n", "0.958145180505\n", "Correlation between results of threat\n", "0.901314923493\n", "Correlation between results of insult\n", "0.949524363145\n", "Correlation between results of identity_hate\n", "0.935303371467\n" ] } ], "source": [ "for i,c in enumerate(cols):\n", " print ('Correlation between results of', c)\n", " print(np.corrcoef(pred_bid[:,i],pred_seq[:,i])[0,1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two sets of predictions are fairly highly correlated for our validation split, but we can still try mean ensembling the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "roc_auc_score(yv,0.5*(pred_seq+pred_bid))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing on the public test set" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_test = do_stuff(test)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train,X_test=tok_seq(X_train,X_test)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y_train = train[['toxic', 'severe_toxic','obscene','threat','insult','identity_hate']].values" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 143613 samples, validate on 15958 samples\n", "Epoch 1/5\n", "143613/143613 [==============================] - 32s 225us/step - loss: 0.1688 - acc: 0.9606 - val_loss: 0.1124 - val_acc: 0.9627\n", "Epoch 2/5\n", "143613/143613 [==============================] - 31s 218us/step - loss: 0.0884 - acc: 0.9706 - val_loss: 0.0607 - val_acc: 0.9790\n", "Epoch 3/5\n", "143613/143613 [==============================] - 31s 218us/step - loss: 0.0506 - acc: 0.9815 - val_loss: 0.0568 - val_acc: 0.9793\n", "Epoch 4/5\n", "143613/143613 [==============================] - 31s 217us/step - loss: 0.0449 - acc: 0.9829 - val_loss: 0.0554 - val_acc: 0.9796\n", "Epoch 5/5\n", "143613/143613 [==============================] - 31s 218us/step - loss: 0.0412 - acc: 0.9842 - val_loss: 0.0575 - val_acc: 0.9784\n" ] } ], "source": [ "finalpred_seq = seq_model(X_train, y_train, X_test,val='yes')" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 143613 samples, validate on 15958 samples\n", "Epoch 1/4\n", "143613/143613 [==============================] - 190s 1ms/step - loss: 0.1739 - acc: 0.9598 - val_loss: 0.0773 - val_acc: 0.9744\n", "Epoch 2/4\n", "143613/143613 [==============================] - 188s 1ms/step - loss: 0.0576 - acc: 0.9796 - val_loss: 0.0536 - val_acc: 0.9804\n", "Epoch 3/4\n", "143613/143613 [==============================] - 188s 1ms/step - loss: 0.0452 - acc: 0.9832 - val_loss: 0.0549 - val_acc: 0.9809\n", "Epoch 4/4\n", "143613/143613 [==============================] - 187s 1ms/step - loss: 0.0397 - acc: 0.9847 - val_loss: 0.0560 - val_acc: 0.9809\n" ] } ], "source": [ "finalpred_bid = bidirect_model (X_train, y_train, X_test,val='yes')" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": true }, "outputs": [], "source": [ "submission1 = pd.read_csv('sample_submission.csv')\n", "submission1[cols] = finalpred_seq\n", "submission1.to_csv('submission1.csv',index=False)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": true }, "outputs": [], "source": [ "submission2 = pd.read_csv('sample_submission.csv')\n", "submission2[cols] = finalpred_bid\n", "submission2.to_csv('submission2.csv',index=False)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": true }, "outputs": [], "source": [ "submission3 = pd.read_csv('sample_submission.csv')\n", "submission3[cols] = 0.5*(submission1[cols] + submission2[cols])\n", "submission3.to_csv('submission3.csv',index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One more..." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def onemore_model (X_train, y_train, test, val='no'):\n", " model=Sequential()\n", " model.add(Embedding(100000,100,input_length=200))\n", " model.add(SpatialDropout1D(0.25))\n", " model.add(GlobalMaxPooling1D())\n", " model.add(BatchNormalization())\n", " model.add(Dense(128))\n", " model.add(Dropout(0.5))\n", " \n", " model.add(Dense(6,activation='sigmoid'))\n", " model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n", " \n", " if val == 'no':\n", " model.fit(X_train,y_train,batch_size=1000,epochs=5)\n", " else:\n", " model.fit(X_train,y_train,batch_size=1000,epochs=5,validation_split=0.1)\n", " pred = model.predict(test)\n", " return pred" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 143613 samples, validate on 15958 samples\n", "Epoch 1/5\n", "143613/143613 [==============================] - 14s 97us/step - loss: 0.3350 - acc: 0.8389 - val_loss: 0.0671 - val_acc: 0.9772\n", "Epoch 2/5\n", "143613/143613 [==============================] - 13s 89us/step - loss: 0.0808 - acc: 0.9751 - val_loss: 0.0590 - val_acc: 0.9792\n", "Epoch 3/5\n", "143613/143613 [==============================] - 13s 89us/step - loss: 0.0619 - acc: 0.9794 - val_loss: 0.0567 - val_acc: 0.9797\n", "Epoch 4/5\n", "143613/143613 [==============================] - 13s 89us/step - loss: 0.0523 - acc: 0.9817 - val_loss: 0.0560 - val_acc: 0.9801\n", "Epoch 5/5\n", "143613/143613 [==============================] - 13s 89us/step - loss: 0.0478 - acc: 0.9828 - val_loss: 0.0555 - val_acc: 0.9804\n" ] } ], "source": [ "finalpred_om = onemore_model (X_train, y_train, X_test,val='yes')" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": true }, "outputs": [], "source": [ "submission4 = pd.read_csv('sample_submission.csv')\n", "submission4[cols] = finalpred_om\n", "submission4.to_csv('submission4.csv',index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This last one scored 0.067. However, by dividing the predictions matrix by 1.12, this score actually improves to 0.063. This is due to the unbalanced nature of the dataset, and has led to some discussion about switching to an AUC scoring system. Regardless, there is still a lot of room for improvement. But I think that getting within earshot of the top 25% isn't too shabby considering I have about 4 days worth of natural language processing experience." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Conclusion\n", "\n", "The obvious next step would be to use existing vectorized co-occurence dictionaries such as GloVe or Facebook fastText. Spell checking and toxic word dictionaries may also be helpful. Possible features that can be created include the use of all caps, prevalence of symbols within the body of text or exlamation and question marks. These can be determined prior to detokenization/lemmatization/forced lowercasing. Comment length can also be useful; anecdotally, at times there seems to be a slight corelation between the length of a comment and how angry its writer is. Early stoppage can also be incorporated to reduce overfitting.\n", "\n", "These will almost certainly be necessary for significant score improvements. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }