Social Media Sentiment Analysis

by Deepak Das

sen

Problem Statement

Dataset containing several tweets with positive and negative sentiment associated with it

  • Cyber bullying and hate speech has been a menace for quite a long time,So our objective for this task is to detect speeches tweets associated with negative sentiments.From this dataset we classify a tweet as hate speech if it has racist or sexist tweets associated with it.

  • So our task here is to classify racist and sexist tweets from other tweets and filter them out.

tweet

Dataset Description

  • The data is in csv format.In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text.Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
  • Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist,our objective is to predict the labels on the given test dataset.

Attribute Information

  • id : The id associated with the tweets in the given dataset
  • tweets : The tweets collected from various sources and having either postive or negative sentiments associated with it
  • label : A tweet with label '0' is of positive sentiment while a tweet with label '1' is of negative sentiment

Importing the necessary packages

In [1]:
import re
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import string
import nltk
import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)

%matplotlib inline

Train dataset used for our analysis

In [2]:
train = pd.read_csv('https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/train.csv')

We make a copy of training data so that even if we have to make any changes in this dataset we would not lose the original dataset.

In [3]:
train_original=train.copy()

Here we see that there are a total of 31692 tweets in the training dataset

In [4]:
train.shape
Out[4]:
(31962, 3)
In [5]:
train_original
Out[5]:
id label tweet
0 1 0 @user when a father is dysfunctional and is s...
1 2 0 @user @user thanks for #lyft credit i can't us...
2 3 0 bihday your majesty
3 4 0 #model i love u take with u all the time in ...
4 5 0 factsguide: society now #motivation
5 6 0 [2/2] huge fan fare and big talking before the...
6 7 0 @user camping tomorrow @user @user @user @use...
7 8 0 the next school year is the year for exams.ðŸ˜...
8 9 0 we won!!! love the land!!! #allin #cavs #champ...
9 10 0 @user @user welcome here ! i'm it's so #gr...
10 11 0 ↝ #ireland consumer price index (mom) climb...
11 12 0 we are so selfish. #orlando #standwithorlando ...
12 13 0 i get to see my daddy today!! #80days #getti...
13 14 1 @user #cnn calls #michigan middle school 'buil...
14 15 1 no comment! in #australia #opkillingbay #se...
15 16 0 ouch...junior is angry😐#got7 #junior #yugyo...
16 17 0 i am thankful for having a paner. #thankful #p...
17 18 1 retweet if you agree!
18 19 0 its #friday! 😀 smiles all around via ig use...
19 20 0 as we all know, essential oils are not made of...
20 21 0 #euro2016 people blaming ha for conceded goal ...
21 22 0 sad little dude.. #badday #coneofshame #cats...
22 23 0 product of the day: happy man #wine tool who'...
23 24 1 @user @user lumpy says i am a . prove it lumpy.
24 25 0 @user #tgif #ff to my #gamedev #indiedev #i...
25 26 0 beautiful sign by vendor 80 for $45.00!! #upsi...
26 27 0 @user all #smiles when #media is !! 😜ðŸ˜...
27 28 0 we had a great panel on the mediatization of t...
28 29 0 happy father's day @user 💓💓💓💓
29 30 0 50 people went to nightclub to have a good nig...
... ... ... ...
31932 31933 0 @user thanks gemma
31933 31934 1 @user judd is a & #homophobic #freemilo #...
31934 31935 1 lady banned from kentucky mall. @user #jcpenn...
31935 31936 0 ugh i'm trying to enjoy my happy hour drink &a...
31936 31937 0 want to know how to live a life? do more thi...
31937 31938 0 love island 💔
31938 31939 0 my fav actor #vijaysethupathi ! my fav actress...
31939 31940 0 whew 😅 it's a productive and #friday!!!
31940 31941 0 @user she's finally here! @user
31941 31942 0 passed first year of uni #yay #love #pass #uni...
31942 31943 0 this week is flying by #humpday - #wednesday...
31943 31944 0 @user modeling photoshoot this friday yay #mo...
31944 31945 0 you're surrounded by people who love you (even...
31945 31946 0 feel like... 😝🐶😎 #dog #summer #hot #h...
31946 31947 1 @user omfg i'm offended! i'm a mailbox and i'...
31947 31948 1 @user @user you don't have the balls to hashta...
31948 31949 1 makes you ask yourself, who am i? then am i a...
31949 31950 0 hear one of my new songs! don't go - katie ell...
31950 31951 0 @user you can try to 'tail' us to stop, 'butt...
31951 31952 0 i've just posted a new blog: #secondlife #lone...
31952 31953 0 @user you went too far with @user
31953 31954 0 good morning #instagram #shower #water #berlin...
31954 31955 0 #holiday bull up: you will dominate your bul...
31955 31956 0 less than 2 weeks 😅🙏🏼🍹😎🎵 @us...
31956 31957 0 off fishing tomorrow @user carnt wait first ti...
31957 31958 0 ate @user isz that youuu?😍😍😍😍😍ð...
31958 31959 0 to see nina turner on the airwaves trying to...
31959 31960 0 listening to sad songs on a monday morning otw...
31960 31961 1 @user #sikh #temple vandalised in in #calgary,...
31961 31962 0 thank you @user for you follow

31962 rows × 3 columns

Test dataset used for our analysis

In [6]:
test = pd.read_csv('https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/test.csv')

We make a copy of test data so that even if we have to make any changes in this dataset we would not lose the original dataset.

In [7]:
test_original=test.copy()

Here we see that there are a total of 17197 tweets in the test dataset

In [8]:
test.shape
Out[8]:
(17197, 2)
In [9]:
test_original
Out[9]:
id tweet
0 31963 #studiolife #aislife #requires #passion #dedic...
1 31964 @user #white #supremacists want everyone to s...
2 31965 safe ways to heal your #acne!! #altwaystohe...
3 31966 is the hp and the cursed child book up for res...
4 31967 3rd #bihday to my amazing, hilarious #nephew...
5 31968 choose to be :) #momtips
6 31969 something inside me dies 💦💿✨ eyes nes...
7 31970 #finished#tattoo#inked#ink#loveit❤️ #❤ï¸...
8 31971 @user @user @user i will never understand why...
9 31972 #delicious #food #lovelife #capetown mannaep...
10 31973 1000dayswasted - narcosis infinite ep.. make m...
11 31974 one of the world's greatest spoing events #l...
12 31975 half way through the website now and #allgoing...
13 31976 good food, good life , #enjoy and 🙌🍕ðŸ...
14 31977 i'll stand behind this #guncontrolplease #se...
15 31978 i ate,i ate and i ate...😀😊 #jamaisasth...
16 31979 @user got my @user limited edition rain or sh...
17 31980 & #love & #hugs & #kisses too! how...
18 31981 👭🌞💖 #girls #sun #fave @ london, uni...
19 31982 thought factory: bbc neutrality on right wing ...
20 31983 hey guys tommorow is the last day of my exams ...
21 31984 @user @user @user #levyrroni #recuerdos mem...
22 31985 my mind is like 🎉💃🏽🏀 but my body l...
23 31986 never been this down on myself in my entire li...
24 31987 check twitterww - trends: "trending worldwide...
25 31988 i thought i saw a mermaid!!! #ceegee #smcr ...
26 31989 chick gets fucked hottest naked lady
27 31990 happy bday lucy✨✨🎈 xoxo #love #beautifu...
28 31991 haroldfriday have a weekend filled with sunbe...
29 31992 @user @user tried that! but nothing - will try...
... ... ...
17167 49130 people do anything for fucking attention nowad...
17168 49131 creative bubble got burst 😢 looking forward...
17169 49132 tomorrow is gonna be a big day! we are going t...
17170 49133 i am thankful for baby giggles. #thankful #pos...
17171 49134 #model i love u take with u all the time in ...
17172 49135 in life u will grow to learn some pple will wo...
17173 49136 💙i was the storm,you were the rain. togethe...
17174 49137 lovelgq - broken ep via #rnb #love #heabrok...
17175 49138 spread love not hate❤️💛💚💙💜 #pr...
17176 49139 @user @user are the most racist pay ever!!!!!
17177 49140 i am thankful for children. #thankful #positiv...
17178 49141 liverpool ❤️🇬🇧 #walk #liverpool #sta...
17179 49142 #bakersfield rooster simulation: i want to c...
17180 49143 por do sol 󾀋❤️#instagood #beautiful #...
17181 49144 @user hell yeah what a great surprise for your...
17182 49145 when ur the joke ur defensive towards everythi...
17183 49146 #enjoying the #evening #sun in my #bedroom ✨...
17184 49147 tonight on @user from 9pm gmt you can here a ...
17185 49148 today is a good day for excercise #imready #so...
17186 49149 good night with a tea and music ☕️👌🙌...
17187 49150 loving life🇺🇸☀️🏊 #createyourfutu...
17188 49151 black professor demonizes, proposes nazi style...
17189 49152 learn how to think positive. #positive #ins...
17190 49153 we love the pretty, happy and fresh you! #teen...
17191 49154 2_damn_tuff-ruff_muff__techno_city-(ng005)-web...
17192 49155 thought factory: left-right polarisation! #tru...
17193 49156 feeling like a mermaid 😘 #hairflip #neverre...
17194 49157 #hillary #campaigned today in #ohio((omg)) &am...
17195 49158 happy, at work conference: right mindset leads...
17196 49159 my song "so glad" free download! #shoegaze ...

17197 rows × 2 columns

We combine Train and Test datasets for pre-processing stage

In [10]:
combine = train.append(test,ignore_index=True,sort=True)
In [11]:
combine.head()
Out[11]:
id label tweet
0 1 0.0 @user when a father is dysfunctional and is s...
1 2 0.0 @user @user thanks for #lyft credit i can't us...
2 3 0.0 bihday your majesty
3 4 0.0 #model i love u take with u all the time in ...
4 5 0.0 factsguide: society now #motivation
In [12]:
combine.tail()
Out[12]:
id label tweet
49154 49155 NaN thought factory: left-right polarisation! #tru...
49155 49156 NaN feeling like a mermaid 😘 #hairflip #neverre...
49156 49157 NaN #hillary #campaigned today in #ohio((omg)) &am...
49157 49158 NaN happy, at work conference: right mindset leads...
49158 49159 NaN my song "so glad" free download! #shoegaze ...

Data Pre-Processing

pre

Removing Twitter Handles (@user)

Given below is a user-defined function to remove unwanted text patterns from the tweets. It takes two arguments, one is the original string of text and the other is the pattern of text that we want to remove from the string. The function returns the same input string but without the given pattern. We will use this function to remove the pattern ‘@user’ from all the tweets in our data.

In [13]:
def remove_pattern(text,pattern):
    
    # re.findall() finds the pattern i.e @user and puts it in a list for further task
    r = re.findall(pattern,text)
    
    # re.sub() removes @user from the sentences in the dataset
    for i in r:
        text = re.sub(i,"",text)
    
    return text
        
In [14]:
combine['Tidy_Tweets'] = np.vectorize(remove_pattern)(combine['tweet'], "@[\w]*")

combine.head()
Out[14]:
id label tweet Tidy_Tweets
0 1 0.0 @user when a father is dysfunctional and is s... when a father is dysfunctional and is so sel...
1 2 0.0 @user @user thanks for #lyft credit i can't us... thanks for #lyft credit i can't use cause th...
2 3 0.0 bihday your majesty bihday your majesty
3 4 0.0 #model i love u take with u all the time in ... #model i love u take with u all the time in ...
4 5 0.0 factsguide: society now #motivation factsguide: society now #motivation

Removing Punctuations, Numbers, and Special Characters

Punctuations, numbers and special characters do not help much. It is better to remove them from the text just as we removed the twitter handles. Here we will replace everything except characters and hashtags with spaces.

In [15]:
combine['Tidy_Tweets'] = combine['Tidy_Tweets'].str.replace("[^a-zA-Z#]", " ")
In [16]:
combine.head(10)
Out[16]:
id label tweet Tidy_Tweets
0 1 0.0 @user when a father is dysfunctional and is s... when a father is dysfunctional and is so sel...
1 2 0.0 @user @user thanks for #lyft credit i can't us... thanks for #lyft credit i can t use cause th...
2 3 0.0 bihday your majesty bihday your majesty
3 4 0.0 #model i love u take with u all the time in ... #model i love u take with u all the time in ...
4 5 0.0 factsguide: society now #motivation factsguide society now #motivation
5 6 0.0 [2/2] huge fan fare and big talking before the... huge fan fare and big talking before the...
6 7 0.0 @user camping tomorrow @user @user @user @use... camping tomorrow danny
7 8 0.0 the next school year is the year for exams.ðŸ˜... the next school year is the year for exams ...
8 9 0.0 we won!!! love the land!!! #allin #cavs #champ... we won love the land #allin #cavs #champ...
9 10 0.0 @user @user welcome here ! i'm it's so #gr... welcome here i m it s so #gr

Removing Short Words

We have to be a little careful here in selecting the length of the words which we want to remove. So, I have decided to remove all the words having length 3 or less. For example, terms like “hmm”, “oh” are of very little use. It is better to get rid of them.

In [17]:
combine['Tidy_Tweets'] = combine['Tidy_Tweets'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

combine.head(10)
Out[17]:
id label tweet Tidy_Tweets
0 1 0.0 @user when a father is dysfunctional and is s... when father dysfunctional selfish drags kids i...
1 2 0.0 @user @user thanks for #lyft credit i can't us... thanks #lyft credit cause they offer wheelchai...
2 3 0.0 bihday your majesty bihday your majesty
3 4 0.0 #model i love u take with u all the time in ... #model love take with time
4 5 0.0 factsguide: society now #motivation factsguide society #motivation
5 6 0.0 [2/2] huge fan fare and big talking before the... huge fare talking before they leave chaos disp...
6 7 0.0 @user camping tomorrow @user @user @user @use... camping tomorrow danny
7 8 0.0 the next school year is the year for exams.ðŸ˜... next school year year exams think about that #...
8 9 0.0 we won!!! love the land!!! #allin #cavs #champ... love land #allin #cavs #champions #cleveland #...
9 10 0.0 @user @user welcome here ! i'm it's so #gr... welcome here

Tokenization

Now we will tokenize all the cleaned tweets in our dataset. Tokens are individual terms or words, and tokenization is the process of splitting a string of text into tokens.

In [18]:
tokenized_tweet = combine['Tidy_Tweets'].apply(lambda x: x.split())
tokenized_tweet.head()
Out[18]:
0    [when, father, dysfunctional, selfish, drags, ...
1    [thanks, #lyft, credit, cause, they, offer, wh...
2                              [bihday, your, majesty]
3                     [#model, love, take, with, time]
4                   [factsguide, society, #motivation]
Name: Tidy_Tweets, dtype: object

Stemming

Stemming is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word. For example, For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”

In [19]:
from nltk import PorterStemmer

ps = PorterStemmer()

tokenized_tweet = tokenized_tweet.apply(lambda x: [ps.stem(i) for i in x])

tokenized_tweet.head()
Out[19]:
0    [when, father, dysfunct, selfish, drag, kid, i...
1    [thank, #lyft, credit, caus, they, offer, whee...
2                              [bihday, your, majesti]
3                     [#model, love, take, with, time]
4                         [factsguid, societi, #motiv]
Name: Tidy_Tweets, dtype: object

Now let’s stitch these tokens back together.

In [20]:
for i in range(len(tokenized_tweet)):
    tokenized_tweet[i] = ' '.join(tokenized_tweet[i])

combine['Tidy_Tweets'] = tokenized_tweet
combine.head()
Out[20]:
id label tweet Tidy_Tweets
0 1 0.0 @user when a father is dysfunctional and is s... when father dysfunct selfish drag kid into dys...
1 2 0.0 @user @user thanks for #lyft credit i can't us... thank #lyft credit caus they offer wheelchair ...
2 3 0.0 bihday your majesty bihday your majesti
3 4 0.0 #model i love u take with u all the time in ... #model love take with time
4 5 0.0 factsguide: society now #motivation factsguid societi #motiv

Visualization from Tweets

vis

WordCloud

wc

A wordcloud is a visualization wherein the most frequent words appear in large size and the less frequent words appear in smaller sizes.

Importing Packages necessary for generating a WordCloud

In [21]:
from wordcloud import WordCloud,ImageColorGenerator
from PIL import Image
import urllib
import requests

Store all the words from the dataset which are non-racist/sexist

In [22]:
all_words_positive = ' '.join(text for text in combine['Tidy_Tweets'][combine['label']==0])

We can see most of the words are positive or neutral. With happy, smile, and love being the most frequent ones. Hence, most of the frequent words are compatible with the sentiment which is non racist/sexists tweets.

In [23]:
# combining the image with the dataset
Mask = np.array(Image.open(requests.get('http://clipart-library.com/image_gallery2/Twitter-PNG-Image.png', stream=True).raw))

# We use the ImageColorGenerator library from Wordcloud 
# Here we take the color of the image and impose it over our wordcloud
image_colors = ImageColorGenerator(Mask)

# Now we use the WordCloud function from the wordcloud library 
wc = WordCloud(background_color='black', height=1500, width=4000,mask=Mask).generate(all_words_positive)

# Size of the image generated 
plt.figure(figsize=(10,20))

# Here we recolor the words from the dataset to the image's color
# recolor just recolors the default colors to the image's blue color
# interpolation is used to smooth the image generated 
plt.imshow(wc.recolor(color_func=image_colors),interpolation="hamming")

plt.axis('off')
plt.show()

Store all the words from the dataset which are racist/sexist

In [24]:
all_words_negative = ' '.join(text for text in combine['Tidy_Tweets'][combine['label']==1])

As we can clearly see, most of the words have negative connotations. So, it seems we have a pretty good text data to work on.

In [25]:
# combining the image with the dataset
Mask = np.array(Image.open(requests.get('http://clipart-library.com/image_gallery2/Twitter-PNG-Image.png', stream=True).raw))

# We use the ImageColorGenerator library from Wordcloud 
# Here we take the color of the image and impose it over our wordcloud
image_colors = ImageColorGenerator(Mask)

# Now we use the WordCloud function from the wordcloud library 
wc = WordCloud(background_color='black', height=1500, width=4000,mask=Mask).generate(all_words_negative)

# Size of the image generated 
plt.figure(figsize=(10,20))

# Here we recolor the words from the dataset to the image's color
# recolor just recolors the default colors to the image's blue color
# interpolation is used to smooth the image generated 
plt.imshow(wc.recolor(color_func=image_colors),interpolation="gaussian")

plt.axis('off')
plt.show()

Understanding the impact of Hashtags on tweets sentiment

hash

Function to extract hashtags from tweets

In [26]:
def Hashtags_Extract(x):
    hashtags=[]
    
    # Loop over the words in the tweet
    for i in x:
        ht = re.findall(r'#(\w+)',i)
        hashtags.append(ht)
    
    return hashtags

A nested list of all the hashtags from the positive reviews from the dataset

In [27]:
ht_positive = Hashtags_Extract(combine['Tidy_Tweets'][combine['label']==0])

Here we unnest the list

In [28]:
ht_positive_unnest = sum(ht_positive,[])

A nested list of all the hashtags from the negative reviews from the dataset

In [29]:
ht_negative = Hashtags_Extract(combine['Tidy_Tweets'][combine['label']==1])

Here we unnest the list

In [30]:
ht_negative_unnest = sum(ht_negative,[])

Plotting BarPlots

plot

For Positive Tweets in the dataset

Counting the frequency of the words having Positive Sentiment

In [31]:
word_freq_positive = nltk.FreqDist(ht_positive_unnest)

word_freq_positive
Out[31]:
FreqDist({'love': 1654, 'posit': 917, 'smile': 676, 'healthi': 573, 'thank': 534, 'fun': 463, 'life': 425, 'affirm': 423, 'summer': 390, 'model': 375, ...})

Creating a dataframe for the most frequently used words in hashtags

In [32]:
df_positive = pd.DataFrame({'Hashtags':list(word_freq_positive.keys()),'Count':list(word_freq_positive.values())})
In [33]:
df_positive.head(10)
Out[33]:
Hashtags Count
0 run 72
1 lyft 2
2 disapoint 1
3 getthank 2
4 model 375
5 motiv 202
6 allshowandnogo 1
7 school 30
8 exam 9
9 hate 27

Plotting the barplot for the 10 most frequent words used for hashtags

In [34]:
df_positive_plot = df_positive.nlargest(20,columns='Count') 
In [35]:
sns.barplot(data=df_positive_plot,y='Hashtags',x='Count')
sns.despine()

For Negative Tweets in the dataset

Counting the frequency of the words having Negative Sentiment

In [36]:
word_freq_negative = nltk.FreqDist(ht_negative_unnest)
In [37]:
word_freq_negative
Out[37]:
FreqDist({'trump': 136, 'polit': 95, 'allahsoil': 92, 'liber': 81, 'libtard': 77, 'sjw': 75, 'retweet': 63, 'black': 46, 'miami': 46, 'hate': 37, ...})

Creating a dataframe for the most frequently used words in hashtags

In [38]:
df_negative = pd.DataFrame({'Hashtags':list(word_freq_negative.keys()),'Count':list(word_freq_negative.values())})
In [39]:
df_negative.head(10)
Out[39]:
Hashtags Count
0 cnn 10
1 michigan 2
2 tcot 14
3 australia 6
4 opkillingbay 5
5 seashepherd 22
6 helpcovedolphin 3
7 thecov 4
8 neverump 8
9 xenophobia 12

Plotting the barplot for the 10 most frequent words used for hashtags

In [40]:
df_negative_plot = df_negative.nlargest(20,columns='Count') 
In [41]:
sns.barplot(data=df_negative_plot,y='Hashtags',x='Count')
sns.despine()

Extracting Features from cleaned Tweets

Bag-of-Words Features

Bag of Words is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

Consider a corpus (a collection of texts) called C of D documents {d1,d2…..dD} and N unique tokens extracted out of the corpus C. The N tokens (words) will form a list, and the size of the bag-of-words matrix M will be given by D X N. Each row in the matrix M contains the frequency of tokens in document D(i).

For example, if you have 2 documents-

  • D1: He is a lazy boy. She is also lazy.

  • D2: Smith is a lazy person.

First, it creates a vocabulary using unique words from all the documents

[‘He’ , ’She’ , ’lazy’ , 'boy’ , 'Smith’ , ’person’]

  • Here, D=2, N=6
  • The matrix M of size 2 X 6 will be represented as:

bow

The above table depicts the training features containing term frequencies of each word in each document. This is called bag-of-words approach since the number of occurrence and not sequence or order of words matters in this approach.

In [42]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')

# bag-of-words feature matrix
bow = bow_vectorizer.fit_transform(combine['Tidy_Tweets'])

df_bow = pd.DataFrame(bow.todense())

df_bow
Out[42]:
0 1 2 3 4 5 6 7 8 9 ... 990 991 992 993 994 995 996 997 998 999
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
25 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
28 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
29 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49129 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49130 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49131 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49132 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49133 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49134 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49135 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49136 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49137 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49138 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49139 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49140 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49141 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49142 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49143 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
49144 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49145 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49146 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49147 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49148 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49149 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49150 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49151 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49152 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49153 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49154 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49155 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49156 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49157 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49158 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

49159 rows × 1000 columns

TF-IDF Features

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

  • TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: #### TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

  • IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: #### IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

tfidf

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf=TfidfVectorizer(max_df=0.90, min_df=2,max_features=1000,stop_words='english')

tfidf_matrix=tfidf.fit_transform(combine['Tidy_Tweets'])

df_tfidf = pd.DataFrame(tfidf_matrix.todense())

df_tfidf
Out[43]:
0 1 2 3 4 5 6 7 8 9 ... 990 991 992 993 994 995 996 997 998 999
0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.403826 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
16 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
17 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
18 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
19 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
21 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
23 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
24 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
26 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
27 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
28 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29 0.0 0.0 0.0 0.0 0.0 0.342695 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49129 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49130 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49131 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49132 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49133 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49134 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49135 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49136 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49137 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49138 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49139 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49140 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49141 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49142 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49143 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.351966 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49144 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49145 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49146 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49147 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49148 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49149 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49150 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49151 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49152 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49153 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49154 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49155 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49156 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49157 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
49158 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0

49159 rows × 1000 columns

Applying Machine Learning Models

ml)

Using the features from Bag-of-Words Model for training set

In [44]:
train_bow = bow[:31962]

train_bow.todense()
Out[44]:
matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Using features from TF-IDF for training set

In [45]:
train_tfidf_matrix = tfidf_matrix[:31962]

train_tfidf_matrix.todense()
Out[45]:
matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

Splitting the data into training and validation set

In [46]:
from sklearn.model_selection import train_test_split

Bag-of-Words Features

In [47]:
x_train_bow,x_valid_bow,y_train_bow,y_valid_bow = train_test_split(train_bow,train['label'],test_size=0.3,random_state=2)

Using TF-IDF features

In [48]:
x_train_tfidf,x_valid_tfidf,y_train_tfidf,y_valid_tfidf = train_test_split(train_tfidf_matrix,train['label'],test_size=0.3,random_state=17)

Logistic Regression

In [49]:
from sklearn.linear_model import LogisticRegression
In [50]:
Log_Reg = LogisticRegression(random_state=0,solver='lbfgs')

Using Bag-of-Words Features

In [51]:
# Fitting the Logistic Regression Model

Log_Reg.fit(x_train_bow,y_train_bow)
Out[51]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)
In [52]:
# The first part of the list is predicting probabilities for label:0 
# and the second part of the list is predicting probabilities for label:1
prediction_bow = Log_Reg.predict_proba(x_valid_bow)

prediction_bow
Out[52]:
array([[9.86501156e-01, 1.34988440e-02],
       [9.99599096e-01, 4.00904144e-04],
       [9.13577383e-01, 8.64226167e-02],
       ...,
       [8.95457155e-01, 1.04542845e-01],
       [9.59736065e-01, 4.02639345e-02],
       [9.67541420e-01, 3.24585797e-02]])

Calculating the F1 score

In [53]:
from sklearn.metrics import f1_score
In [54]:
# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
prediction_int = prediction_bow[:,1]>=0.3

prediction_int = prediction_int.astype(np.int)
prediction_int

# calculating f1 score
log_bow = f1_score(y_valid_bow, prediction_int)

log_bow
Out[54]:
0.5721352019785655

Using TF-IDF Features

In [55]:
Log_Reg.fit(x_train_tfidf,y_train_tfidf)
Out[55]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)
In [56]:
prediction_tfidf = Log_Reg.predict_proba(x_valid_tfidf)

prediction_tfidf
Out[56]:
array([[0.98487907, 0.01512093],
       [0.97949889, 0.02050111],
       [0.9419737 , 0.0580263 ],
       ...,
       [0.98630906, 0.01369094],
       [0.96746188, 0.03253812],
       [0.99055287, 0.00944713]])

Calculating the F1 score

In [57]:
prediction_int = prediction_tfidf[:,1]>=0.3

prediction_int = prediction_int.astype(np.int)
prediction_int

# calculating f1 score
log_tfidf = f1_score(y_valid_tfidf, prediction_int)

log_tfidf
Out[57]:
0.5862068965517241

XGBoost

In [58]:
from xgboost import XGBClassifier

Using Bag-of-Words Features

In [59]:
model_bow = XGBClassifier(random_state=22,learning_rate=0.9)
In [60]:
model_bow.fit(x_train_bow, y_train_bow)
Out[60]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.9, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=22, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)
In [61]:
# The first part of the list is predicting probabilities for label:0 
# and the second part of the list is predicting probabilities for label:1
xgb=model_bow.predict_proba(x_valid_bow)

xgb
Out[61]:
array([[0.9717447 , 0.02825526],
       [0.99767685, 0.00232312],
       [0.9436968 , 0.05630319],
       ...,
       [0.9660848 , 0.03391522],
       [0.9436968 , 0.05630319],
       [0.9436968 , 0.05630319]], dtype=float32)

Calculating the F1 score

In [62]:
# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
xgb=xgb[:,1]>=0.3

# converting the results to integer type
xgb_int=xgb.astype(np.int)

# calculating f1 score
xgb_bow=f1_score(y_valid_bow,xgb_int)

xgb_bow
Out[62]:
0.5712012728719172

Using TF-IDF Features

In [63]:
model_tfidf=XGBClassifier(random_state=29,learning_rate=0.7)
In [64]:
model_tfidf.fit(x_train_tfidf, y_train_tfidf)
Out[64]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.7, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=29, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)
In [65]:
# The first part of the list is predicting probabilities for label:0 
# and the second part of the list is predicting probabilities for label:1
xgb_tfidf=model_tfidf.predict_proba(x_valid_tfidf)

xgb_tfidf
Out[65]:
array([[0.9905173 , 0.00948265],
       [0.9902541 , 0.00974591],
       [0.9579129 , 0.0420871 ],
       ...,
       [0.9883729 , 0.0116271 ],
       [0.9878232 , 0.0121768 ],
       [0.9807036 , 0.01929642]], dtype=float32)

Calculating the F1 score

In [66]:
# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
xgb_tfidf=xgb_tfidf[:,1]>=0.3

# converting the results to integer type
xgb_int_tfidf=xgb_tfidf.astype(np.int)

# calculating f1 score
score=f1_score(y_valid_tfidf,xgb_int_tfidf)

score
Out[66]:
0.5657051282051281

Decision Tree

In [67]:
from sklearn.tree import DecisionTreeClassifier
In [68]:
dct = DecisionTreeClassifier(criterion='entropy', random_state=1)

Using Bag-of-Words Features

In [69]:
dct.fit(x_train_bow,y_train_bow)
Out[69]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')
In [70]:
dct_bow = dct.predict_proba(x_valid_bow)

dct_bow
Out[70]:
array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])
In [71]:
# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
dct_bow=dct_bow[:,1]>=0.3

# converting the results to integer type
dct_int_bow=dct_bow.astype(np.int)

# calculating f1 score
dct_score_bow=f1_score(y_valid_bow,dct_int_bow)

dct_score_bow
Out[71]:
0.5141776937618148

Using TF-IDF Features

In [72]:
dct.fit(x_train_tfidf,y_train_tfidf)
Out[72]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')
In [73]:
dct_tfidf = dct.predict_proba(x_valid_tfidf)

dct_tfidf
Out[73]:
array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])

Calculating F1 Score

In [74]:
# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
dct_tfidf=dct_tfidf[:,1]>=0.3

# converting the results to integer type
dct_int_tfidf=dct_tfidf.astype(np.int)

# calculating f1 score
dct_score_tfidf=f1_score(y_valid_tfidf,dct_int_tfidf)

dct_score_tfidf
Out[74]:
0.5498821681068342

Model Comparison

In [75]:
Algo=['LogisticRegression(Bag-of-Words)','XGBoost(Bag-of-Words)','DecisionTree(Bag-of-Words)','LogisticRegression(TF-IDF)','XGBoost(TF-IDF)','DecisionTree(TF-IDF)']
In [76]:
score = [log_bow,xgb_bow,dct_score_bow,log_tfidf,score,dct_score_tfidf]

compare=pd.DataFrame({'Model':Algo,'F1_Score':score},index=[i for i in range(1,7)])
In [77]:
compare.T
Out[77]:
1 2 3 4 5 6
Model LogisticRegression(Bag-of-Words) XGBoost(Bag-of-Words) DecisionTree(Bag-of-Words) LogisticRegression(TF-IDF) XGBoost(TF-IDF) DecisionTree(TF-IDF)
F1_Score 0.572135 0.571201 0.514178 0.586207 0.565705 0.549882
In [78]:
plt.figure(figsize=(18,5))

sns.pointplot(x='Model',y='F1_Score',data=compare)

plt.title('Model Vs Score')
plt.xlabel('MODEL')
plt.ylabel('SCORE')

plt.show()

Using the best possible model to predict for the test data

From the above comaprison graph we can see that Logistic Regression trained using TF-IDF features gives us the best performance

In [79]:
test_tfidf = tfidf_matrix[31962:]
In [80]:
test_pred = Log_Reg.predict_proba(test_tfidf)

test_pred_int = test_pred[:,1] >= 0.3

test_pred_int = test_pred_int.astype(np.int)

test['label'] = test_pred_int

submission = test[['id','label']]

submission.to_csv('result.csv', index=False)

Test dataset after prediction

In [81]:
res = pd.read_csv('result.csv')
In [82]:
res
Out[82]:
id label
0 31963 0
1 31964 0
2 31965 0
3 31966 0
4 31967 0
5 31968 0
6 31969 0
7 31970 0
8 31971 0
9 31972 0
10 31973 0
11 31974 0
12 31975 0
13 31976 0
14 31977 0
15 31978 0
16 31979 0
17 31980 0
18 31981 0
19 31982 1
20 31983 0
21 31984 0
22 31985 0
23 31986 0
24 31987 0
25 31988 0
26 31989 1
27 31990 0
28 31991 0
29 31992 0
... ... ...
17167 49130 0
17168 49131 0
17169 49132 0
17170 49133 0
17171 49134 0
17172 49135 0
17173 49136 0
17174 49137 0
17175 49138 0
17176 49139 1
17177 49140 0
17178 49141 0
17179 49142 0
17180 49143 0
17181 49144 0
17182 49145 0
17183 49146 0
17184 49147 0
17185 49148 0
17186 49149 0
17187 49150 0
17188 49151 1
17189 49152 0
17190 49153 0
17191 49154 0
17192 49155 1
17193 49156 0
17194 49157 0
17195 49158 0
17196 49159 0

17197 rows × 2 columns

Summary

  • From the given dataset we were able to predict on which class i.e Positive or Negative does the given tweet fall into.The following data was collected from Analytics Vidhya's site.

Pre-processing

  1. Removing Twitter Handles(@user)
  2. Removing puntuation,numbers,special characters
  3. Removing short words i.e. words with length<3
  4. Tokenization
  5. Stemming

Data Visualisation

  1. Wordclouds
  2. Barplots

Word Embeddings used to convert words to features for our Machine Learning Model

  1. Bag-of-Words
  2. TF-IDF

Machine Learning Models used

  1. Logistic Regression
  2. XGBoost
  3. Decision Trees

Evaluation Metrics

  • F1 score
In [84]:
sns.countplot(train_original['label'])
sns.despine()

Why use F1-Score instead of Accuracy ?

  • From the above countplot generated above we see how imbalanced our dataset is.We can see that the values with label:0 i.e. positive sentiments are quite high in number as compared to the values with labels:1 i.e. negative sentiments.
  • So when we keep accuracy as our evaluation metric there may be cases where we may encounter high number of false positives.

Precison & Recall :-

  • Precision means the percentage of your results which are relevant.
  • Recall refers to the percentage of total relevant results correctly classified by your algorithm met

  • We always face a trade-off situation between Precison and Recall i.e. High Precison gives low recall and vice versa.

  • In most problems, you could either give a higher priority to maximizing precision, or recall, depending upon the problem you are trying to solve. But in general, there is a simpler metric which takes into account both precision and recall, and therefore, you can aim to maximize this number to make your model better. This metric is known as F1-score, which is simply the harmonic mean of precision and recall.

f1

  • So this metric seems much more easier and convenient to work with, as you only have to maximize one score, rather than balancing two separate scores.