{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Data source\n", "This [dataset](https://www.kaggle.com/rtatman/ironic-corpus) contains 1950 comments, which have been labeled as ironic (1) or not ironic (-1) by human annotators. The text was taken from Reddit comments.\n", " " ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import csv\n", "data = []\n", "with open('irony-labeled.csv') as datafile:\n", " csvReader = csv.reader(datafile)\n", " for row in csvReader:\n", " data.append(row)\n" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[\"I suspect atheists are projecting their desires when they imagine Obama is one of their number. Does anyone remember the crazy preacher with whom he was associated? \\nhttp://www. examiner. com/article/obama-and-wright-throw-each-other-under-the-bus\\n\\nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church. \\n\\n\\nHe's not an atheist. He's not a liberal either.\",\n", " '-1'],\n", " ['It\\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to. Always attacking \"lazy minorities and young people. \" \\n\\n>\\xe2\\x80\\x9c[i]f it hurts a bunch of college kids that\\xe2\\x80\\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\\xe2\\x80\\x9d and \\xe2\\x80\\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \\xe2\\x80\\x9d \\xe2\\x80\\x9cthe law is going to kick the Democrats in the butt. \\xe2\\x80\\x9d',\n", " '-1'],\n", " [\"We are truly following the patterns of how the mandarins took over empires, not because of the sword, but because control of the endless paper, and regulations, that do more to stagnate most people's lives, then to do anything productive. \\n\\nBut then because they don't see what else they can do, they write up even more laws and regulations, that either do nothing, or hinder more freedom and production.\",\n", " '-1']]" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# delete the first element (header) in the data list\n", "del data[0]\n", "data[:3]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preprocessing" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[\"I suspect atheists are projecting their desires when they imagine Obama is one of their number. Does anyone remember the crazy preacher with whom he was associated? \\nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church. \\n\\n\\nHe's not an atheist. He's not a liberal either.\", '-1'], ['It\\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to. Always attacking \"lazy minorities and young people. \" \\n\\n>\\xe2\\x80\\x9c[i]f it hurts a bunch of college kids that\\xe2\\x80\\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\\xe2\\x80\\x9d and \\xe2\\x80\\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \\xe2\\x80\\x9d \\xe2\\x80\\x9cthe law is going to kick the Democrats in the butt. \\xe2\\x80\\x9d', '-1'], [\"We are truly following the patterns of how the mandarins took over empires, not because of the sword, but because control of the endless paper, and regulations, that do more to stagnate most people's lives, then to do anything productive. \\n\\nBut then because they don't see what else they can do, they write up even more laws and regulations, that either do nothing, or hinder more freedom and production.\", '-1']]\n" ] } ], "source": [ "# remove url from texts\n", "import re\n", "for row in data:\n", " row[0] = re.sub(r'^https?:\\/\\/.*[\\r\\n]*', '', row[0], flags=re.MULTILINE)\n", "print data[:3]" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"I suspect atheists are projecting their desires when they imagine Obama is one of their number. Does anyone remember the crazy preacher with whom he was associated? \\nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church. \\n\\n\\nHe's not an atheist. He's not a liberal either.\", 'It\\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to. Always attacking \"lazy minorities and young people. \" \\n\\n>\\xe2\\x80\\x9c[i]f it hurts a bunch of college kids that\\xe2\\x80\\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\\xe2\\x80\\x9d and \\xe2\\x80\\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \\xe2\\x80\\x9d \\xe2\\x80\\x9cthe law is going to kick the Democrats in the butt. \\xe2\\x80\\x9d', \"We are truly following the patterns of how the mandarins took over empires, not because of the sword, but because control of the endless paper, and regulations, that do more to stagnate most people's lives, then to do anything productive. \\n\\nBut then because they don't see what else they can do, they write up even more laws and regulations, that either do nothing, or hinder more freedom and production.\"]\n", "['-1', '-1', '-1']\n" ] } ], "source": [ "data_texts = [] # build a list to store texts\n", "data_labels = [] # build a list to store labels\n", "for row in data:\n", " data_texts.append(row[0])\n", " data_labels.append(row[1])\n", "print data_texts[:3]\n", "print data_labels[:3]" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "537\n", "1412\n" ] } ], "source": [ "# check the counts of ironic (labeled as -1) and non-ironic texts (labeled as 1)\n", "print data_labels.count('1')\n", "print data_labels.count('-1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, in this dataset we have 1412 non-ironic texts and 537 ironic texts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorization" ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "vectorizer = TfidfVectorizer(ngram_range = (1,2),min_df=5,max_df=0.8, sublinear_tf=True,use_idf=True)\n", "\n", "features = vectorizer.fit_transform(data_texts)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split training and testing dataset" ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(features, data_labels, test_size=0.2, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Classifier - SVM" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False)" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import svm\n", "clf = svm.SVC()\n", "clf.fit(X_train,y_train)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Evaluation - SVM" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score of SVM model:\n", "0.725641025641\n", " precision recall f1-score support\n", "\n", " -1 0.73 1.00 0.84 283\n", " 1 0.00 0.00 0.00 107\n", "\n", "avg / total 0.53 0.73 0.61 390\n", "\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "predicted = clf.predict(X_test)\n", "# print the accuracy score\n", "from sklearn.metrics import accuracy_score\n", "print(\"Accuracy score of SVM model:\\n\"+ str(accuracy_score(y_test,predicted)))\n", "\n", "# print evaluation report showing precision, recall, f1, support\n", "from sklearn.metrics import classification_report\n", "print(classification_report(y_test, predicted))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Classifier - Naive Bayes\n", "The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work." ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.naive_bayes import MultinomialNB\n", "\n", "mnb = MultinomialNB()\n", "mnb.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation-Naive Bayes" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score of Naive Bayes model:\n", "0.728205128205\n", " precision recall f1-score support\n", "\n", " -1 0.73 1.00 0.84 283\n", " 1 1.00 0.01 0.02 107\n", "\n", "avg / total 0.80 0.73 0.62 390\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "mnb_predict = mnb.predict(X_test)\n", "print(\"Accuracy score of Naive Bayes model:\\n\"+ str(accuracy_score(y_test,mnb_predict)))\n", "\n", "print(classification_report(y_test, mnb_predict))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accuracy VS Precision\n", "- Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.\n", "\n", "\n", "- Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. \n", "\n", "Since my dataset is not a symmetric dataset, so I consider to take a look at the precision score.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "In my experiments, the 'Precision' score changes when I adjust the parameter 'ngram_range' in the Vectorization procedure.\n", "\n", "##### What is 'ngram_range'?\n", "The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.\n", "\n", "e.g.\n", "\n", "text = \"I do not know what you mean\"\n", "\n", "If we set ngram_range = (1,1), the vectorizer only vectorize the vocabulary to include 1-gram.\n", "\n", "So the vocabulary includes \"I\"\"do\"\"not\"\"know\"\"what\"\"you\"\"mean\"\n", "\n", "If we set ngram_range = (2,2), the vectorizer only vectorize the vocabulary to include 2-grams.\n", "\n", "So the vocabulary includes \"I do\"\"do not\"\"not know\"\"know what\"\"what you\"\"you\"\"mean\"\n", "\n", "If we set ngram_range = (1,2), \n", "\n", "we will get:\n", "\"I\"\"do\"\"not\"\"know\"\"what\"\"you\"\"mean\"\"I do\"\"do not\"\"not know\"\"know what\"\"what you\"\"you\"\"mean\"\n", "\n", "##### Here I record the precision scores:\n", " \n", "- ngram_range = (1,1):\n", "\n", "SVM:0.54 NB: 0.54\n", "\n", "- ngram_range = (2,2):\n", "\n", "SVM:0.54 NB: 0.68\n", "\n", "- ngram_range = (3,3):\n", "\n", "SVM:0.54 NB: 0.61 \n", "\n", "- ngram_range = (4,4):\n", "\n", "SVM:0.54 NB: 0.54\n", "\n", "- ngram_range = (1,2):\n", "\n", "SVM: 0.54 NB: 0.81\n", "\n", "- ngram_range = (1,3):\n", "\n", "SVM: 0.54 NB: 0.81\n", "\n", "- ngram_range = (2,3):\n", "\n", "SVM: 0.54 NB: 0.68\n", "\n", "Obviously, the precision score of Naive Bayes model reaches to higher value when set the ngram_range as (1,2) and (1,3), which means when the model identify an ironic text, the combination of unigram and bigram will help the model to perform better.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Next\n", "- Apart from the ngrams, what features can help to detect irony ?\n", " - Features used in previous study:\n", " - ngram\n", " - sentiments (ironic text maybe more negtive than non-ironic?)\n", " - topics\n", " - written-spoken style (We de- signed this set of features to explore the unexpect- edness created by using spoken style words in a mainly written style tweet or vice versa (formal words usually adopted in written text employed in a spoken style context). )\n", " - Hyperbole (indicates the occurrence of a sequence of three positive or negative words in a row)\n", " - Punctuation (presence of an ellipses as well as multiple question or excla- mation marks or a combination of the latter two)\n", " \n", "- How to construct mutiple features?\n", "- How to use multiple features in a model?" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "107\n" ] } ], "source": [ "print list(mnb_predict).count('1')\n", "print list(y_test).count('1')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }