### Data source
This [dataset](https://www.kaggle.com/rtatman/ironic-corpus) contains 1950 comments, which have been labeled as ironic (1) or not ironic (-1) by human annotators. The text was taken from Reddit comments.
 

In [143]:
import csv
data = []
with open('irony-labeled.csv') as datafile:
 csvReader = csv.reader(datafile)
 for row in csvReader:
 data.append(row)


In [144]:
# delete the first element (header) in the data list
del data[0]
data[:3]


[["I suspect atheists are projecting their desires when they imagine Obama is one of their number. Does anyone remember the crazy preacher with whom he was associated? \nhttp://www. examiner. com/article/obama-and-wright-throw-each-other-under-the-bus\n\nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church. \n\n\nHe's not an atheist. He's not a liberal either.",
 '-1'],
 ['It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to. Always attacking "lazy minorities and young people. " \n\n>\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in t

### Preprocessing

In [145]:
# remove url from texts
import re
for row in data:
 row[0] = re.sub(r'^https?:\/\/.*[\r\n]*', '', row[0], flags=re.MULTILINE)
print data[:3]

[["I suspect atheists are projecting their desires when they imagine Obama is one of their number. Does anyone remember the crazy preacher with whom he was associated? \nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church. \n\n\nHe's not an atheist. He's not a liberal either.", '-1'], ['It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to. Always attacking "lazy minorities and young people. " \n\n>\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d', '-1'], ["We are truly following the patterns of how the mandari

In [146]:
data_texts = [] # build a list to store texts
data_labels = [] # build a list to store labels
for row in data:
 data_texts.append(row[0])
 data_labels.append(row[1])
print data_texts[:3]
print data_labels[:3]

["I suspect atheists are projecting their desires when they imagine Obama is one of their number. Does anyone remember the crazy preacher with whom he was associated? \nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church. \n\n\nHe's not an atheist. He's not a liberal either.", 'It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to. Always attacking "lazy minorities and young people. " \n\n>\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d', "We are truly following the patterns of how the mandarins took over empi

In [147]:
# check the counts of ironic (labeled as -1) and non-ironic texts (labeled as 1)
print data_labels.count('1')
print data_labels.count('-1')

537
1412


So, in this dataset we have 1412 non-ironic texts and 537 ironic texts

### Vectorization

In [148]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range = (1,2),min_df=5,max_df=0.8, sublinear_tf=True,use_idf=True)

features = vectorizer.fit_transform(data_texts)


### Split training and testing dataset

In [149]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, data_labels, test_size=0.2, random_state=42)

### Import Classifier - SVM

In [150]:
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
 decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
 max_iter=-1, probability=False, random_state=None, shrinking=True,
 tol=0.001, verbose=False)

### Evaluation - SVM

In [151]:
from sklearn.metrics import accuracy_score

predicted = clf.predict(X_test)
# print the accuracy score
from sklearn.metrics import accuracy_score
print("Accuracy score of SVM model:\n"+ str(accuracy_score(y_test,predicted)))

# print evaluation report showing precision, recall, f1, support
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted))


Accuracy score of SVM model:
0.725641025641
 precision recall f1-score support

 -1 0.73 1.00 0.84 283
 1 0.00 0.00 0.00 107

avg / total 0.53 0.73 0.61 390



### Import Classifier - Naive Bayes
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [152]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Evaluation-Naive Bayes

In [153]:
from sklearn.metrics import classification_report
mnb_predict = mnb.predict(X_test)
print("Accuracy score of Naive Bayes model:\n"+ str(accuracy_score(y_test,mnb_predict)))

print(classification_report(y_test, mnb_predict))

Accuracy score of Naive Bayes model:
0.728205128205
 precision recall f1-score support

 -1 0.73 1.00 0.84 283
 1 1.00 0.01 0.02 107

avg / total 0.80 0.73 0.62 390



### Accuracy VS Precision
- Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.


- Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. 

Since my dataset is not a symmetric dataset, so I consider to take a look at the precision score.




In my experiments, the 'Precision' score changes when I adjust the parameter 'ngram_range' in the Vectorization procedure.

##### What is 'ngram_range'?
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

e.g.

text = "I do not know what you mean"

If we set ngram_range = (1,1), the vectorizer only vectorize the vocabulary to include 1-gram.

So the vocabulary includes "I""do""not""know""what""you""mean"

If we set ngram_range = (2,2), the vectorizer only vectorize the vocabulary to include 2-grams.

So the vocabulary includes "I do""do not""not know""know what""what you""you""mean"

If we set ngram_range = (1,2), 

we will get:
"I""do""not""know""what""you""mean""I do""do not""not know""know what""what you""you""mean"

##### Here I record the precision scores:
 
- ngram_range = (1,1):

SVM:0.54 NB: 0.54

- ngram_range = (2,2):

SVM:0.54 NB: 0.68

- ngram_range = (3,3):

SVM:0.54 NB: 0.61 

- ngram_range = (4,4):

SVM:0.54 NB: 0.54

- ngram_range = (1,2):

SVM: 0.54 NB: 0.81

- ngram_range = (1,3):

SVM: 0.54 NB: 0.81

- ngram_range = (2,3):

SVM: 0.54 NB: 0.68

Obviously, the precision score of Naive Bayes model reaches to higher value when set the ngram_range as (1,2) and (1,3), which means when the model identify an ironic text, the combination of unigram and bigram will help the model to perform better.


### Next
- Apart from the ngrams, what features can help to detect irony ?
 - Features used in previous study:
 - ngram
 - sentiments (ironic text maybe more negtive than non-ironic?)
 - topics
 - written-spoken style (We de- signed this set of features to explore the unexpect- edness created by using spoken style words in a mainly written style tweet or vice versa (formal words usually adopted in written text employed in a spoken style context). )
 - Hyperbole (indicates the occurrence of a sequence of three positive or negative words in a row)
 - Punctuation (presence of an ellipses as well as multiple question or excla- mation marks or a combination of the latter two)
 
- How to construct mutiple features?
- How to use multiple features in a model?

In [156]:
print list(mnb_predict).count('1')
print list(y_test).count('1')

0
1
107
