by Deepak Das¶

Problem Statement¶

Dataset containing several tweets with positive and negative sentiment associated with it¶

Cyber bullying and hate speech has been a menace for quite a long time,So our objective for this task is to detect speeches tweets associated with negative sentiments.From this dataset we classify a tweet as hate speech if it has racist or sexist tweets associated with it.
So our task here is to classify racist and sexist tweets from other tweets and filter them out.

Dataset Description¶

The data is in csv format.In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text.Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist,our objective is to predict the labels on the given test dataset.

Attribute Information¶

id : The id associated with the tweets in the given dataset
tweets : The tweets collected from various sources and having either postive or negative sentiments associated with it
label : A tweet with label '0' is of positive sentiment while a tweet with label '1' is of negative sentiment

Importing the necessary packages¶

In [1]:

import re
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import string
import nltk
import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)

%matplotlib inline

Train dataset used for our analysis¶

In [2]:

train = pd.read_csv('https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/train.csv')

We make a copy of training data so that even if we have to make any changes in this dataset we would not lose the original dataset.¶

In [3]:

train_original=train.copy()

Here we see that there are a total of 31692 tweets in the training dataset¶

In [4]:

train.shape

Out[4]:

(31962, 3)

In [5]:

train_original

Out[5]:

	id	label	tweet
0	1	0	@user when a father is dysfunctional and is s...
1	2	0	@user @user thanks for #lyft credit i can't us...
2	3	0	bihday your majesty
3	4	0	#model i love u take with u all the time in ...
4	5	0	factsguide: society now #motivation
5	6	0	[2/2] huge fan fare and big talking before the...
6	7	0	@user camping tomorrow @user @user @user @use...
7	8	0	the next school year is the year for exams.ð...
8	9	0	we won!!! love the land!!! #allin #cavs #champ...
9	10	0	@user @user welcome here ! i'm it's so #gr...
10	11	0	â #ireland consumer price index (mom) climb...
11	12	0	we are so selfish. #orlando #standwithorlando ...
12	13	0	i get to see my daddy today!! #80days #getti...
13	14	1	@user #cnn calls #michigan middle school 'buil...
14	15	1	no comment! in #australia #opkillingbay #se...
15	16	0	ouch...junior is angryð#got7 #junior #yugyo...
16	17	0	i am thankful for having a paner. #thankful #p...
17	18	1	retweet if you agree!
18	19	0	its #friday! ð smiles all around via ig use...
19	20	0	as we all know, essential oils are not made of...
20	21	0	#euro2016 people blaming ha for conceded goal ...
21	22	0	sad little dude.. #badday #coneofshame #cats...
22	23	0	product of the day: happy man #wine tool who'...
23	24	1	@user @user lumpy says i am a . prove it lumpy.
24	25	0	@user #tgif #ff to my #gamedev #indiedev #i...
25	26	0	beautiful sign by vendor 80 for $45.00!! #upsi...
26	27	0	@user all #smiles when #media is !! ðð...
27	28	0	we had a great panel on the mediatization of t...
28	29	0	happy father's day @user ðððð
29	30	0	50 people went to nightclub to have a good nig...
...	...	...	...
31932	31933	0	@user thanks gemma
31933	31934	1	@user judd is a & #homophobic #freemilo #...
31934	31935	1	lady banned from kentucky mall. @user #jcpenn...
31935	31936	0	ugh i'm trying to enjoy my happy hour drink &a...
31936	31937	0	want to know how to live a life? do more thi...
31937	31938	0	love island ð
31938	31939	0	my fav actor #vijaysethupathi ! my fav actress...
31939	31940	0	whew ð it's a productive and #friday!!!
31940	31941	0	@user she's finally here! @user
31941	31942	0	passed first year of uni #yay #love #pass #uni...
31942	31943	0	this week is flying by #humpday - #wednesday...
31943	31944	0	@user modeling photoshoot this friday yay #mo...
31944	31945	0	you're surrounded by people who love you (even...
31945	31946	0	feel like... ðð¶ð #dog #summer #hot #h...
31946	31947	1	@user omfg i'm offended! i'm a mailbox and i'...
31947	31948	1	@user @user you don't have the balls to hashta...
31948	31949	1	makes you ask yourself, who am i? then am i a...
31949	31950	0	hear one of my new songs! don't go - katie ell...
31950	31951	0	@user you can try to 'tail' us to stop, 'butt...
31951	31952	0	i've just posted a new blog: #secondlife #lone...
31952	31953	0	@user you went too far with @user
31953	31954	0	good morning #instagram #shower #water #berlin...
31954	31955	0	#holiday bull up: you will dominate your bul...
31955	31956	0	less than 2 weeks ððð¼ð¹ððµ @us...
31956	31957	0	off fishing tomorrow @user carnt wait first ti...
31957	31958	0	ate @user isz that youuu?ðððððð...
31958	31959	0	to see nina turner on the airwaves trying to...
31959	31960	0	listening to sad songs on a monday morning otw...
31960	31961	1	@user #sikh #temple vandalised in in #calgary,...
31961	31962	0	thank you @user for you follow

31962 rows × 3 columns

Test dataset used for our analysis¶

In [6]:

test = pd.read_csv('https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/test.csv')

We make a copy of test data so that even if we have to make any changes in this dataset we would not lose the original dataset.¶

In [7]:

test_original=test.copy()

Here we see that there are a total of 17197 tweets in the test dataset¶

In [8]:

test.shape

Out[8]:

(17197, 2)

In [9]:

test_original

Out[9]:

	id	tweet
0	31963	#studiolife #aislife #requires #passion #dedic...
1	31964	@user #white #supremacists want everyone to s...
2	31965	safe ways to heal your #acne!! #altwaystohe...
3	31966	is the hp and the cursed child book up for res...
4	31967	3rd #bihday to my amazing, hilarious #nephew...
5	31968	choose to be :) #momtips
6	31969	something inside me dies ð¦ð¿â¨ eyes nes...
7	31970	#finished#tattoo#inked#ink#loveitâ¤ï¸ #â¤ï¸...
8	31971	@user @user @user i will never understand why...
9	31972	#delicious #food #lovelife #capetown mannaep...
10	31973	1000dayswasted - narcosis infinite ep.. make m...
11	31974	one of the world's greatest spoing events #l...
12	31975	half way through the website now and #allgoing...
13	31976	good food, good life , #enjoy and ððð...
14	31977	i'll stand behind this #guncontrolplease #se...
15	31978	i ate,i ate and i ate...ðð #jamaisasth...
16	31979	@user got my @user limited edition rain or sh...
17	31980	& #love & #hugs & #kisses too! how...
18	31981	ððð #girls #sun #fave @ london, uni...
19	31982	thought factory: bbc neutrality on right wing ...
20	31983	hey guys tommorow is the last day of my exams ...
21	31984	@user @user @user #levyrroni #recuerdos mem...
22	31985	my mind is like ððð½ð but my body l...
23	31986	never been this down on myself in my entire li...
24	31987	check twitterww - trends: "trending worldwide...
25	31988	i thought i saw a mermaid!!! #ceegee #smcr ...
26	31989	chick gets fucked hottest naked lady
27	31990	happy bday lucyâ¨â¨ð xoxo #love #beautifu...
28	31991	haroldfriday have a weekend filled with sunbe...
29	31992	@user @user tried that! but nothing - will try...
...	...	...
17167	49130	people do anything for fucking attention nowad...
17168	49131	creative bubble got burst ð¢ looking forward...
17169	49132	tomorrow is gonna be a big day! we are going t...
17170	49133	i am thankful for baby giggles. #thankful #pos...
17171	49134	#model i love u take with u all the time in ...
17172	49135	in life u will grow to learn some pple will wo...
17173	49136	ði was the storm,you were the rain. togethe...
17174	49137	lovelgq - broken ep via #rnb #love #heabrok...
17175	49138	spread love not hateâ¤ï¸ðððð #pr...
17176	49139	@user @user are the most racist pay ever!!!!!
17177	49140	i am thankful for children. #thankful #positiv...
17178	49141	liverpool â¤ï¸ð¬ð§ #walk #liverpool #sta...
17179	49142	#bakersfield rooster simulation: i want to c...
17180	49143	por do sol ó¾â¤ï¸#instagood #beautiful #...
17181	49144	@user hell yeah what a great surprise for your...
17182	49145	when ur the joke ur defensive towards everythi...
17183	49146	#enjoying the #evening #sun in my #bedroom â¨...
17184	49147	tonight on @user from 9pm gmt you can here a ...
17185	49148	today is a good day for excercise #imready #so...
17186	49149	good night with a tea and music âï¸ðð...
17187	49150	loving lifeðºð¸âï¸ð #createyourfutu...
17188	49151	black professor demonizes, proposes nazi style...
17189	49152	learn how to think positive. #positive #ins...
17190	49153	we love the pretty, happy and fresh you! #teen...
17191	49154	2_damn_tuff-ruff_muff__techno_city-(ng005)-web...
17192	49155	thought factory: left-right polarisation! #tru...
17193	49156	feeling like a mermaid ð #hairflip #neverre...
17194	49157	#hillary #campaigned today in #ohio((omg)) &am...
17195	49158	happy, at work conference: right mindset leads...
17196	49159	my song "so glad" free download! #shoegaze ...

17197 rows × 2 columns

We combine Train and Test datasets for pre-processing stage¶

In [10]:

combine = train.append(test,ignore_index=True,sort=True)

In [11]:

combine.head()

Out[11]:

	id	tweet
0	1	@user when a father is dysfunctional and is s...
1	2	@user @user thanks for #lyft credit i can't us...
2	3	bihday your majesty
3	4	#model i love u take with u all the time in ...
4	5	factsguide: society now #motivation

In [12]:

combine.tail()

Out[12]:

	id	label	tweet
49154	49155	NaN	thought factory: left-right polarisation! #tru...
49155	49156	NaN	feeling like a mermaid ð #hairflip #neverre...
49156	49157	NaN	#hillary #campaigned today in #ohio((omg)) &am...
49157	49158	NaN	happy, at work conference: right mindset leads...
49158	49159	NaN	my song "so glad" free download! #shoegaze ...

Data Pre-Processing¶

pre

Removing Twitter Handles (@user)¶

Given below is a user-defined function to remove unwanted text patterns from the tweets. It takes two arguments, one is the original string of text and the other is the pattern of text that we want to remove from the string. The function returns the same input string but without the given pattern. We will use this function to remove the pattern ‘@user’ from all the tweets in our data.

In [13]:

def remove_pattern(text,pattern):
    
    # re.findall() finds the pattern i.e @user and puts it in a list for further task
    r = re.findall(pattern,text)
    
    # re.sub() removes @user from the sentences in the dataset
    for i in r:
        text = re.sub(i,"",text)
    
    return text

In [14]:

combine['Tidy_Tweets'] = np.vectorize(remove_pattern)(combine['tweet'], "@[\w]*")

combine.head()

Out[14]:

	id	tweet	Tidy_Tweets
0	1	@user when a father is dysfunctional and is s...	when a father is dysfunctional and is so sel...
1	2	@user @user thanks for #lyft credit i can't us...	thanks for #lyft credit i can't use cause th...
2	3	bihday your majesty	bihday your majesty
3	4	#model i love u take with u all the time in ...	#model i love u take with u all the time in ...
4	5	factsguide: society now #motivation	factsguide: society now #motivation

Removing Punctuations, Numbers, and Special Characters¶

Punctuations, numbers and special characters do not help much. It is better to remove them from the text just as we removed the twitter handles. Here we will replace everything except characters and hashtags with spaces.

In [15]:

combine['Tidy_Tweets'] = combine['Tidy_Tweets'].str.replace("[^a-zA-Z#]", " ")

In [16]:

combine.head(10)

Out[16]:

	id	tweet	Tidy_Tweets
0	1	@user when a father is dysfunctional and is s...	when a father is dysfunctional and is so sel...
1	2	@user @user thanks for #lyft credit i can't us...	thanks for #lyft credit i can t use cause th...
2	3	bihday your majesty	bihday your majesty
3	4	#model i love u take with u all the time in ...	#model i love u take with u all the time in ...
4	5	factsguide: society now #motivation	factsguide society now #motivation
5	6	[2/2] huge fan fare and big talking before the...	huge fan fare and big talking before the...
6	7	@user camping tomorrow @user @user @user @use...	camping tomorrow danny
7	8	the next school year is the year for exams.ð...	the next school year is the year for exams ...
8	9	we won!!! love the land!!! #allin #cavs #champ...	we won love the land #allin #cavs #champ...
9	10	@user @user welcome here ! i'm it's so #gr...	welcome here i m it s so #gr

Removing Short Words¶

We have to be a little careful here in selecting the length of the words which we want to remove. So, I have decided to remove all the words having length 3 or less. For example, terms like “hmm”, “oh” are of very little use. It is better to get rid of them.

In [17]:

combine['Tidy_Tweets'] = combine['Tidy_Tweets'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

combine.head(10)

Out[17]:

	id	tweet	Tidy_Tweets
0	1	@user when a father is dysfunctional and is s...	when father dysfunctional selfish drags kids i...
1	2	@user @user thanks for #lyft credit i can't us...	thanks #lyft credit cause they offer wheelchai...
2	3	bihday your majesty	bihday your majesty
3	4	#model i love u take with u all the time in ...	#model love take with time
4	5	factsguide: society now #motivation	factsguide society #motivation
5	6	[2/2] huge fan fare and big talking before the...	huge fare talking before they leave chaos disp...
6	7	@user camping tomorrow @user @user @user @use...	camping tomorrow danny
7	8	the next school year is the year for exams.ð...	next school year year exams think about that #...
8	9	we won!!! love the land!!! #allin #cavs #champ...	love land #allin #cavs #champions #cleveland #...
9	10	@user @user welcome here ! i'm it's so #gr...	welcome here

Tokenization¶

Now we will tokenize all the cleaned tweets in our dataset. Tokens are individual terms or words, and tokenization is the process of splitting a string of text into tokens.

In [18]:

tokenized_tweet = combine['Tidy_Tweets'].apply(lambda x: x.split())
tokenized_tweet.head()

Out[18]:

0    [when, father, dysfunctional, selfish, drags, ...
1    [thanks, #lyft, credit, cause, they, offer, wh...
2                              [bihday, your, majesty]
3                     [#model, love, take, with, time]
4                   [factsguide, society, #motivation]
Name: Tidy_Tweets, dtype: object

Stemming¶

Stemming is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word. For example, For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”

In [19]:

from nltk import PorterStemmer

ps = PorterStemmer()

tokenized_tweet = tokenized_tweet.apply(lambda x: [ps.stem(i) for i in x])

tokenized_tweet.head()

Out[19]:

0    [when, father, dysfunct, selfish, drag, kid, i...
1    [thank, #lyft, credit, caus, they, offer, whee...
2                              [bihday, your, majesti]
3                     [#model, love, take, with, time]
4                         [factsguid, societi, #motiv]
Name: Tidy_Tweets, dtype: object

Now let’s stitch these tokens back together.¶

In [20]:

for i in range(len(tokenized_tweet)):
    tokenized_tweet[i] = ' '.join(tokenized_tweet[i])

combine['Tidy_Tweets'] = tokenized_tweet
combine.head()

Out[20]:

	id	tweet	Tidy_Tweets
0	1	@user when a father is dysfunctional and is s...	when father dysfunct selfish drag kid into dys...
1	2	@user @user thanks for #lyft credit i can't us...	thank #lyft credit caus they offer wheelchair ...
2	3	bihday your majesty	bihday your majesti
3	4	#model i love u take with u all the time in ...	#model love take with time
4	5	factsguide: society now #motivation	factsguid societi #motiv

Visualization from Tweets¶

vis

WordCloud¶

A wordcloud is a visualization wherein the most frequent words appear in large size and the less frequent words appear in smaller sizes.¶

Importing Packages necessary for generating a WordCloud¶

In [21]:

from wordcloud import WordCloud,ImageColorGenerator
from PIL import Image
import urllib
import requests

Store all the words from the dataset which are non-racist/sexist¶

In [22]:

all_words_positive = ' '.join(text for text in combine['Tidy_Tweets'][combine['label']==0])

We can see most of the words are positive or neutral. With happy, smile, and love being the most frequent ones. Hence, most of the frequent words are compatible with the sentiment which is non racist/sexists tweets.¶

In [23]:

# combining the image with the dataset
Mask = np.array(Image.open(requests.get('http://clipart-library.com/image_gallery2/Twitter-PNG-Image.png', stream=True).raw))

# We use the ImageColorGenerator library from Wordcloud 
# Here we take the color of the image and impose it over our wordcloud
image_colors = ImageColorGenerator(Mask)

# Now we use the WordCloud function from the wordcloud library 
wc = WordCloud(background_color='black', height=1500, width=4000,mask=Mask).generate(all_words_positive)

# Size of the image generated 
plt.figure(figsize=(10,20))

# Here we recolor the words from the dataset to the image's color
# recolor just recolors the default colors to the image's blue color
# interpolation is used to smooth the image generated 
plt.imshow(wc.recolor(color_func=image_colors),interpolation="hamming")

plt.axis('off')
plt.show()

Store all the words from the dataset which are racist/sexist¶

In [24]:

all_words_negative = ' '.join(text for text in combine['Tidy_Tweets'][combine['label']==1])

As we can clearly see, most of the words have negative connotations. So, it seems we have a pretty good text data to work on.¶

In [25]:

# combining the image with the dataset
Mask = np.array(Image.open(requests.get('http://clipart-library.com/image_gallery2/Twitter-PNG-Image.png', stream=True).raw))

# We use the ImageColorGenerator library from Wordcloud 
# Here we take the color of the image and impose it over our wordcloud
image_colors = ImageColorGenerator(Mask)

# Now we use the WordCloud function from the wordcloud library 
wc = WordCloud(background_color='black', height=1500, width=4000,mask=Mask).generate(all_words_negative)

# Size of the image generated 
plt.figure(figsize=(10,20))

# Here we recolor the words from the dataset to the image's color
# recolor just recolors the default colors to the image's blue color
# interpolation is used to smooth the image generated 
plt.imshow(wc.recolor(color_func=image_colors),interpolation="gaussian")

plt.axis('off')
plt.show()

Understanding the impact of Hashtags on tweets sentiment¶

Function to extract hashtags from tweets¶

In [26]:

def Hashtags_Extract(x):
    hashtags=[]
    
    # Loop over the words in the tweet
    for i in x:
        ht = re.findall(r'#(\w+)',i)
        hashtags.append(ht)
    
    return hashtags

A nested list of all the hashtags from the positive reviews from the dataset¶

In [27]:

ht_positive = Hashtags_Extract(combine['Tidy_Tweets'][combine['label']==0])

Here we unnest the list¶

In [28]:

ht_positive_unnest = sum(ht_positive,[])

A nested list of all the hashtags from the negative reviews from the dataset¶

In [29]:

ht_negative = Hashtags_Extract(combine['Tidy_Tweets'][combine['label']==1])

Here we unnest the list¶

In [30]:

ht_negative_unnest = sum(ht_negative,[])

Plotting BarPlots¶

plot

For Positive Tweets in the dataset¶

Counting the frequency of the words having Positive Sentiment¶

In [31]:

word_freq_positive = nltk.FreqDist(ht_positive_unnest)

word_freq_positive

Out[31]:

FreqDist({'love': 1654, 'posit': 917, 'smile': 676, 'healthi': 573, 'thank': 534, 'fun': 463, 'life': 425, 'affirm': 423, 'summer': 390, 'model': 375, ...})

Creating a dataframe for the most frequently used words in hashtags¶

In [32]:

df_positive = pd.DataFrame({'Hashtags':list(word_freq_positive.keys()),'Count':list(word_freq_positive.values())})

In [33]:

df_positive.head(10)

Out[33]:

	Hashtags	Count
0	run	72
1	lyft	2
2	disapoint	1
3	getthank	2
4	model	375
5	motiv	202
6	allshowandnogo	1
7	school	30
8	exam	9
9	hate	27

Plotting the barplot for the 10 most frequent words used for hashtags¶

In [34]:

df_positive_plot = df_positive.nlargest(20,columns='Count')

In [35]:

sns.barplot(data=df_positive_plot,y='Hashtags',x='Count')
sns.despine()

For Negative Tweets in the dataset¶

Counting the frequency of the words having Negative Sentiment¶

In [36]:

word_freq_negative = nltk.FreqDist(ht_negative_unnest)

In [37]:

word_freq_negative

Out[37]:

FreqDist({'trump': 136, 'polit': 95, 'allahsoil': 92, 'liber': 81, 'libtard': 77, 'sjw': 75, 'retweet': 63, 'black': 46, 'miami': 46, 'hate': 37, ...})

Creating a dataframe for the most frequently used words in hashtags¶

In [38]:

df_negative = pd.DataFrame({'Hashtags':list(word_freq_negative.keys()),'Count':list(word_freq_negative.values())})

In [39]:

df_negative.head(10)

Out[39]:

	Hashtags	Count
0	cnn	10
1	michigan	2
2	tcot	14
3	australia	6
4	opkillingbay	5
5	seashepherd	22
6	helpcovedolphin	3
7	thecov	4
8	neverump	8
9	xenophobia	12

Plotting the barplot for the 10 most frequent words used for hashtags¶

In [40]:

df_negative_plot = df_negative.nlargest(20,columns='Count')

In [41]:

sns.barplot(data=df_negative_plot,y='Hashtags',x='Count')
sns.despine()

Extracting Features from cleaned Tweets¶

Bag-of-Words Features¶

Bag of Words is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

Consider a corpus (a collection of texts) called C of D documents {d1,d2…..dD} and N unique tokens extracted out of the corpus C. The N tokens (words) will form a list, and the size of the bag-of-words matrix M will be given by D X N. Each row in the matrix M contains the frequency of tokens in document D(i).

For example, if you have 2 documents-

D1: He is a lazy boy. She is also lazy.
D2: Smith is a lazy person.

First, it creates a vocabulary using unique words from all the documents

[‘He’ , ’She’ , ’lazy’ , 'boy’ , 'Smith’ , ’person’]¶

Here, D=2, N=6

The matrix M of size 2 X 6 will be represented as:

bow

The above table depicts the training features containing term frequencies of each word in each document. This is called bag-of-words approach since the number of occurrence and not sequence or order of words matters in this approach.

In [42]:

from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')

# bag-of-words feature matrix
bow = bow_vectorizer.fit_transform(combine['Tidy_Tweets'])

df_bow = pd.DataFrame(bow.todense())

df_bow

Out[42]:

	0	1	2	3	4	5	6	7	8	9	...	990	991	992	993	994	995	996	997	998	999
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	...	0	0	2	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
10	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
11	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
12	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
13	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
14	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
15	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
16	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
17	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
18	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
19	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
20	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
21	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
22	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
23	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
24	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
25	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
26	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
27	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
28	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
29	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49129	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49130	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49131	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49132	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49133	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49134	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49135	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49136	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49137	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49138	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49139	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49140	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49141	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49142	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49143	0	0	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
49144	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49145	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49146	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49147	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49148	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49149	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49150	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49151	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49152	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49153	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49154	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49155	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49156	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49157	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49158	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

49159 rows × 1000 columns

TF-IDF Features¶

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: #### TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: #### IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Example:¶

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

tfidf

In [43]:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf=TfidfVectorizer(max_df=0.90, min_df=2,max_features=1000,stop_words='english')

tfidf_matrix=tfidf.fit_transform(combine['Tidy_Tweets'])

df_tfidf = pd.DataFrame(tfidf_matrix.todense())

df_tfidf

Out[43]:

	0	1	2	3	4	5	6	7	8	9	...	990	991	992	993	994	995	996	997	998	999
0	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.403826	0.0	0.0	0.0	0.0	0.0	0.0	0.0
8	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
9	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
10	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
11	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
12	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
13	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
15	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
16	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
17	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
18	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
19	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
20	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
21	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
22	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
23	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
24	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
25	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
26	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
27	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
28	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
29	0.0	0.0	0.0	0.0	0.0	0.342695	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49129	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49130	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49131	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49132	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49133	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49134	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49135	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49136	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49137	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49138	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49139	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49140	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49141	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49142	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49143	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.351966	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49144	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49145	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49146	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49147	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49148	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49149	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49150	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49151	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49152	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49153	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49154	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49155	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49156	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49157	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
49158	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0

49159 rows × 1000 columns

Applying Machine Learning Models¶

)

Using the features from Bag-of-Words Model for training set¶

In [44]:

train_bow = bow[:31962]

train_bow.todense()

Out[44]:

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Using features from TF-IDF for training set¶

In [45]:

train_tfidf_matrix = tfidf_matrix[:31962]

train_tfidf_matrix.todense()

Out[45]:

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

Splitting the data into training and validation set¶

In [46]:

from sklearn.model_selection import train_test_split

Bag-of-Words Features¶

In [47]:

x_train_bow,x_valid_bow,y_train_bow,y_valid_bow = train_test_split(train_bow,train['label'],test_size=0.3,random_state=2)

Using TF-IDF features¶

In [48]:

x_train_tfidf,x_valid_tfidf,y_train_tfidf,y_valid_tfidf = train_test_split(train_tfidf_matrix,train['label'],test_size=0.3,random_state=17)

Logistic Regression¶

In [49]:

from sklearn.linear_model import LogisticRegression

In [50]:

Log_Reg = LogisticRegression(random_state=0,solver='lbfgs')

Using Bag-of-Words Features¶

In [51]:

# Fitting the Logistic Regression Model

Log_Reg.fit(x_train_bow,y_train_bow)

Out[51]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [52]:

# The first part of the list is predicting probabilities for label:0 
# and the second part of the list is predicting probabilities for label:1
prediction_bow = Log_Reg.predict_proba(x_valid_bow)

prediction_bow

Out[52]:

array([[9.86501156e-01, 1.34988440e-02],
       [9.99599096e-01, 4.00904144e-04],
       [9.13577383e-01, 8.64226167e-02],
       ...,
       [8.95457155e-01, 1.04542845e-01],
       [9.59736065e-01, 4.02639345e-02],
       [9.67541420e-01, 3.24585797e-02]])

Calculating the F1 score¶

In [53]:

from sklearn.metrics import f1_score

In [54]:

# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
prediction_int = prediction_bow[:,1]>=0.3

prediction_int = prediction_int.astype(np.int)
prediction_int

# calculating f1 score
log_bow = f1_score(y_valid_bow, prediction_int)

log_bow

Out[54]:

0.5721352019785655

Using TF-IDF Features¶

In [55]:

Log_Reg.fit(x_train_tfidf,y_train_tfidf)

Out[55]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [56]:

prediction_tfidf = Log_Reg.predict_proba(x_valid_tfidf)

prediction_tfidf

Out[56]:

array([[0.98487907, 0.01512093],
       [0.97949889, 0.02050111],
       [0.9419737 , 0.0580263 ],
       ...,
       [0.98630906, 0.01369094],
       [0.96746188, 0.03253812],
       [0.99055287, 0.00944713]])

Calculating the F1 score¶

In [57]:

prediction_int = prediction_tfidf[:,1]>=0.3

prediction_int = prediction_int.astype(np.int)
prediction_int

# calculating f1 score
log_tfidf = f1_score(y_valid_tfidf, prediction_int)

log_tfidf

Out[57]:

0.5862068965517241

XGBoost¶

In [58]:

from xgboost import XGBClassifier

Using Bag-of-Words Features¶

In [59]:

model_bow = XGBClassifier(random_state=22,learning_rate=0.9)

In [60]:

model_bow.fit(x_train_bow, y_train_bow)

Out[60]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.9, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=22, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [61]:

# The first part of the list is predicting probabilities for label:0 
# and the second part of the list is predicting probabilities for label:1
xgb=model_bow.predict_proba(x_valid_bow)

xgb

Out[61]:

array([[0.9717447 , 0.02825526],
       [0.99767685, 0.00232312],
       [0.9436968 , 0.05630319],
       ...,
       [0.9660848 , 0.03391522],
       [0.9436968 , 0.05630319],
       [0.9436968 , 0.05630319]], dtype=float32)

Calculating the F1 score¶

In [62]:

# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
xgb=xgb[:,1]>=0.3

# converting the results to integer type
xgb_int=xgb.astype(np.int)

# calculating f1 score
xgb_bow=f1_score(y_valid_bow,xgb_int)

xgb_bow

Out[62]:

0.5712012728719172

Using TF-IDF Features¶

In [63]:

model_tfidf=XGBClassifier(random_state=29,learning_rate=0.7)

In [64]:

model_tfidf.fit(x_train_tfidf, y_train_tfidf)

Out[64]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.7, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=29, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [65]:

# The first part of the list is predicting probabilities for label:0 
# and the second part of the list is predicting probabilities for label:1
xgb_tfidf=model_tfidf.predict_proba(x_valid_tfidf)

xgb_tfidf

Out[65]:

array([[0.9905173 , 0.00948265],
       [0.9902541 , 0.00974591],
       [0.9579129 , 0.0420871 ],
       ...,
       [0.9883729 , 0.0116271 ],
       [0.9878232 , 0.0121768 ],
       [0.9807036 , 0.01929642]], dtype=float32)

Calculating the F1 score¶

In [66]:

# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
xgb_tfidf=xgb_tfidf[:,1]>=0.3

# converting the results to integer type
xgb_int_tfidf=xgb_tfidf.astype(np.int)

# calculating f1 score
score=f1_score(y_valid_tfidf,xgb_int_tfidf)

score

Out[66]:

0.5657051282051281

Decision Tree¶

In [67]:

from sklearn.tree import DecisionTreeClassifier

In [68]:

dct = DecisionTreeClassifier(criterion='entropy', random_state=1)

Using Bag-of-Words Features¶

In [69]:

dct.fit(x_train_bow,y_train_bow)

Out[69]:

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

In [70]:

dct_bow = dct.predict_proba(x_valid_bow)

dct_bow

Out[70]:

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])

In [71]:

# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
dct_bow=dct_bow[:,1]>=0.3

# converting the results to integer type
dct_int_bow=dct_bow.astype(np.int)

# calculating f1 score
dct_score_bow=f1_score(y_valid_bow,dct_int_bow)

dct_score_bow

Out[71]:

0.5141776937618148

Using TF-IDF Features¶

In [72]:

dct.fit(x_train_tfidf,y_train_tfidf)

Out[72]:

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

In [73]:

dct_tfidf = dct.predict_proba(x_valid_tfidf)

dct_tfidf

Out[73]:

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])

Calculating F1 Score¶

In [74]:

# if prediction is greater than or equal to 0.3 than 1 else 0
# Where 0 is for positive sentiment tweets and 1 for negative sentiment tweets
dct_tfidf=dct_tfidf[:,1]>=0.3

# converting the results to integer type
dct_int_tfidf=dct_tfidf.astype(np.int)

# calculating f1 score
dct_score_tfidf=f1_score(y_valid_tfidf,dct_int_tfidf)

dct_score_tfidf

Out[74]:

0.5498821681068342

Model Comparison¶

In [75]:

Algo=['LogisticRegression(Bag-of-Words)','XGBoost(Bag-of-Words)','DecisionTree(Bag-of-Words)','LogisticRegression(TF-IDF)','XGBoost(TF-IDF)','DecisionTree(TF-IDF)']

In [76]:

score = [log_bow,xgb_bow,dct_score_bow,log_tfidf,score,dct_score_tfidf]

compare=pd.DataFrame({'Model':Algo,'F1_Score':score},index=[i for i in range(1,7)])

In [77]:

compare.T

Out[77]:

	1	2	3	4	5	6
Model	LogisticRegression(Bag-of-Words)	XGBoost(Bag-of-Words)	DecisionTree(Bag-of-Words)	LogisticRegression(TF-IDF)	XGBoost(TF-IDF)	DecisionTree(TF-IDF)
F1_Score	0.572135	0.571201	0.514178	0.586207	0.565705	0.549882

In [78]:

plt.figure(figsize=(18,5))

sns.pointplot(x='Model',y='F1_Score',data=compare)

plt.title('Model Vs Score')
plt.xlabel('MODEL')
plt.ylabel('SCORE')

plt.show()

Using the best possible model to predict for the test data¶

From the above comaprison graph we can see that Logistic Regression trained using TF-IDF features gives us the best performance¶

In [79]:

test_tfidf = tfidf_matrix[31962:]

In [80]:

test_pred = Log_Reg.predict_proba(test_tfidf)

test_pred_int = test_pred[:,1] >= 0.3

test_pred_int = test_pred_int.astype(np.int)

test['label'] = test_pred_int

submission = test[['id','label']]

submission.to_csv('result.csv', index=False)

Test dataset after prediction¶

In [81]:

res = pd.read_csv('result.csv')

In [82]:

res

Out[82]:

	id	label
0	31963	0
1	31964	0
2	31965	0
3	31966	0
4	31967	0
5	31968	0
6	31969	0
7	31970	0
8	31971	0
9	31972	0
10	31973	0
11	31974	0
12	31975	0
13	31976	0
14	31977	0
15	31978	0
16	31979	0
17	31980	0
18	31981	0
19	31982	1
20	31983	0
21	31984	0
22	31985	0
23	31986	0
24	31987	0
25	31988	0
26	31989	1
27	31990	0
28	31991	0
29	31992	0
...	...	...
17167	49130	0
17168	49131	0
17169	49132	0
17170	49133	0
17171	49134	0
17172	49135	0
17173	49136	0
17174	49137	0
17175	49138	0
17176	49139	1
17177	49140	0
17178	49141	0
17179	49142	0
17180	49143	0
17181	49144	0
17182	49145	0
17183	49146	0
17184	49147	0
17185	49148	0
17186	49149	0
17187	49150	0
17188	49151	1
17189	49152	0
17190	49153	0
17191	49154	0
17192	49155	1
17193	49156	0
17194	49157	0
17195	49158	0
17196	49159	0

17197 rows × 2 columns

Summary¶

From the given dataset we were able to predict on which class i.e Positive or Negative does the given tweet fall into.The following data was collected from Analytics Vidhya's site.

Pre-processing¶

Removing Twitter Handles(@user)
Removing puntuation,numbers,special characters
Removing short words i.e. words with length<3
Tokenization
Stemming

Data Visualisation¶

Wordclouds
Barplots

Word Embeddings used to convert words to features for our Machine Learning Model¶

Bag-of-Words
TF-IDF

Machine Learning Models used¶

Logistic Regression
XGBoost
Decision Trees

Evaluation Metrics¶

F1 score

In [84]:

sns.countplot(train_original['label'])
sns.despine()

Why use F1-Score instead of Accuracy ?¶

From the above countplot generated above we see how imbalanced our dataset is.We can see that the values with label:0 i.e. positive sentiments are quite high in number as compared to the values with labels:1 i.e. negative sentiments.

So when we keep accuracy as our evaluation metric there may be cases where we may encounter high number of false positives.

Precison & Recall :-¶

Precision means the percentage of your results which are relevant.
Recall refers to the percentage of total relevant results correctly classified by your algorithm
We always face a trade-off situation between Precison and Recall i.e. High Precison gives low recall and vice versa.

In most problems, you could either give a higher priority to maximizing precision, or recall, depending upon the problem you are trying to solve. But in general, there is a simpler metric which takes into account both precision and recall, and therefore, you can aim to maximize this number to make your model better. This metric is known as F1-score, which is simply the harmonic mean of precision and recall.

So this metric seems much more easier and convenient to work with, as you only have to maximize one score, rather than balancing two separate scores.

Social Media Sentiment Analysis¶

by Deepak Das¶

Problem Statement¶

Dataset containing several tweets with positive and negative sentiment associated with it¶

Dataset Description¶

Attribute Information¶

Importing the necessary packages¶

Train dataset used for our analysis¶

We make a copy of training data so that even if we have to make any changes in this dataset we would not lose the original dataset.¶

Here we see that there are a total of 31692 tweets in the training dataset¶

Test dataset used for our analysis¶

We make a copy of test data so that even if we have to make any changes in this dataset we would not lose the original dataset.¶

Here we see that there are a total of 17197 tweets in the test dataset¶

We combine Train and Test datasets for pre-processing stage¶

Data Pre-Processing¶

Removing Twitter Handles (@user)¶

Removing Punctuations, Numbers, and Special Characters¶

Removing Short Words¶

Tokenization¶

Stemming¶

Now let’s stitch these tokens back together.¶

Visualization from Tweets¶

WordCloud¶

A wordcloud is a visualization wherein the most frequent words appear in large size and the less frequent words appear in smaller sizes.¶

Importing Packages necessary for generating a WordCloud¶

Store all the words from the dataset which are non-racist/sexist¶

We can see most of the words are positive or neutral. With happy, smile, and love being the most frequent ones. Hence, most of the frequent words are compatible with the sentiment which is non racist/sexists tweets.¶

Store all the words from the dataset which are racist/sexist¶

As we can clearly see, most of the words have negative connotations. So, it seems we have a pretty good text data to work on.¶

Understanding the impact of Hashtags on tweets sentiment¶

Function to extract hashtags from tweets¶

A nested list of all the hashtags from the positive reviews from the dataset¶

Here we unnest the list¶

A nested list of all the hashtags from the negative reviews from the dataset¶

Here we unnest the list¶

Plotting BarPlots¶

For Positive Tweets in the dataset¶

Counting the frequency of the words having Positive Sentiment¶

Creating a dataframe for the most frequently used words in hashtags¶

Plotting the barplot for the 10 most frequent words used for hashtags¶

For Negative Tweets in the dataset¶

Counting the frequency of the words having Negative Sentiment¶

Creating a dataframe for the most frequently used words in hashtags¶

Plotting the barplot for the 10 most frequent words used for hashtags¶

Extracting Features from cleaned Tweets¶

Bag-of-Words Features¶

[‘He’ , ’She’ , ’lazy’ , 'boy’ , 'Smith’ , ’person’]¶

TF-IDF Features¶

Example:¶

Applying Machine Learning Models¶

Using the features from Bag-of-Words Model for training set¶

Using features from TF-IDF for training set¶

Splitting the data into training and validation set¶

Bag-of-Words Features¶

Using TF-IDF features¶

Logistic Regression¶

Using Bag-of-Words Features¶

Calculating the F1 score¶

Using TF-IDF Features¶

Calculating the F1 score¶

XGBoost¶

Using Bag-of-Words Features¶

Calculating the F1 score¶

Using TF-IDF Features¶

Calculating the F1 score¶

Decision Tree¶

Using Bag-of-Words Features¶

Using TF-IDF Features¶

Calculating F1 Score¶

Model Comparison¶

Using the best possible model to predict for the test data¶

From the above comaprison graph we can see that Logistic Regression trained using TF-IDF features gives us the best performance¶

Test dataset after prediction¶

Summary¶

Pre-processing¶

Data Visualisation¶

Word Embeddings used to convert words to features for our Machine Learning Model¶

Machine Learning Models used¶

Evaluation Metrics¶

Why use F1-Score instead of Accuracy ?¶