Data analysis on Movies Dataset

by Deepak Das

The main aim of the project is to find insights about the datset containing information about particular movies. This project uses the movie dataset avaliable form Movie lens.This dataset contains just 1000 movies for analysis. We have used libraries present in Python such as Matplotlib,Seaborn,Pandas for reading and visualization of the dataset.

mov

We import the packages required for visualization

In [81]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Data description

The data is in csv format.In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. Data are collected on 12 different informations of a movie,with rating being in the order of 1 (worst) and 10 (best) and the metascore being in the order from 1 (worst) and 100 (best).

In [82]:
df = pd.read_csv('Documents//movie.csv')

Attributes

  • Rank
  • Title
  • Genre
  • Description
  • Director
  • Actors
  • Year
  • Runtime
  • Rating
  • Votes
  • Revenue
  • Metascore

Here we call the head function and print the first 5 rows of the data

In [28]:
df.head()
Out[28]:
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0

Here we see the total size of the dataset

In [29]:
df.size
Out[29]:
12000

Here we see the shape of the dataset

In [30]:
df.shape
Out[30]:
(1000, 12)

Here we see all the mathematical aspects of the data

In [31]:
df.describe(include = 'all')
Out[31]:
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
count 1000.000000 1000 1000 1000 1000 1000 1000.000000 1000.000000 1000.000000 1.000000e+03 872.000000 936.000000
unique NaN 999 207 1000 644 996 NaN NaN NaN NaN NaN NaN
top NaN The Host Action,Adventure,Sci-Fi Col. Katherine Powell, a military officer in c... Ridley Scott Shia LaBeouf, Megan Fox, Josh Duhamel, Tyrese ... NaN NaN NaN NaN NaN NaN
freq NaN 2 50 1 8 2 NaN NaN NaN NaN NaN NaN
mean 500.500000 NaN NaN NaN NaN NaN 2012.783000 113.172000 6.723200 1.698083e+05 82.956376 58.985043
std 288.819436 NaN NaN NaN NaN NaN 3.205962 18.810908 0.945429 1.887626e+05 103.253540 17.194757
min 1.000000 NaN NaN NaN NaN NaN 2006.000000 66.000000 1.900000 6.100000e+01 0.000000 11.000000
25% 250.750000 NaN NaN NaN NaN NaN 2010.000000 100.000000 6.200000 3.630900e+04 13.270000 47.000000
50% 500.500000 NaN NaN NaN NaN NaN 2014.000000 111.000000 6.800000 1.107990e+05 47.985000 59.500000
75% 750.250000 NaN NaN NaN NaN NaN 2016.000000 123.000000 7.400000 2.399098e+05 113.715000 72.000000
max 1000.000000 NaN NaN NaN NaN NaN 2016.000000 191.000000 9.000000 1.791916e+06 936.630000 100.000000

We check for the missing values in the dataset

We find there are missing values present in the dataset so we will remove the missing values

In [32]:
df.isnull().sum()
Out[32]:
Rank                    0
Title                   0
Genre                   0
Description             0
Director                0
Actors                  0
Year                    0
Runtime (Minutes)       0
Rating                  0
Votes                   0
Revenue (Millions)    128
Metascore              64
dtype: int64

We fill the null values with the mean values in the dataset

In [33]:
df.mean()
Out[33]:
Rank                     500.500000
Year                    2012.783000
Runtime (Minutes)        113.172000
Rating                     6.723200
Votes                 169808.255000
Revenue (Millions)        82.956376
Metascore                 58.985043
dtype: float64

Filling the Null values of the Revenue column with mean values

In [293]:
df['Revenue (Millions)'] = df['Revenue (Millions)'].fillna(df['Revenue (Millions)'].mean())

Filling the Null values of the Metascore column with mean values

In [234]:
df['Metascore'] = df['Metascore'].fillna(df['Metascore'].mean())
In [235]:
df.isnull().sum()
Out[235]:
Rank                  0
Title                 0
Genre                 0
Description           0
Director              0
Actors                0
Year                  0
Runtime (Minutes)     0
Rating                0
Votes                 0
Revenue (Millions)    0
Metascore             0
dtype: int64

We drop the columns which are not required for analysis and for the Recommendation engine

In [236]:
df1 = df.drop(columns = ['Description' ,'Director','Actors'])
df1.head()
Out[236]:
Rank Title Genre Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy 2016 123 6.2 393727 325.02 40.0

Exploratory Data Analysis

mov

Univariate Analysis

Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. It doesn't deal with causes or relationships (unlike regression) and it's major purpose is to describe it takes data, summarizes that data and finds patterns in the data.The key pointers to the Univaraite analysis are to find out the outliers present in the data. We also tend to find the disitribution of the data on the dataset which can further help us for the Bivaraite/Multivariate analysis.

Rating

In [193]:
x = df1['Rating']
sns.distplot(x)
sns.despine()
In [182]:
sns.boxplot(df1['Rating'])
sns.despine()

Inferences :-

  • We observe that the plot is left skewed.
  • There are almost no outliers present in the plot
  • The ratings given are between 6-8 out of 10
  • So the users have been generous with their ratings

Runtime (Minutes)

In [186]:
x = df1['Runtime (Minutes)']
sns.distplot(x)
sns.despine()
In [190]:
sns.boxplot(df1['Runtime (Minutes)'])
sns.despine()

Inferences :-

  • The plot is little right-skewed
  • Here also the number of outliers present are less
  • The average runtime of movies is somewhere between 100-120 minutes
  • Movies with a runtime of more than 140 minutes are quite less in number

Metascore

In [165]:
x = df1['Metascore']
sns.distplot(x,);
In [189]:
sns.boxplot(df1['Metascore'])
sns.despine()

Inferences :-

  • The plot follows a normal distribution
  • The outliers for the plot are almost negligible
  • The average Metascore here is 60

Bivariate Analysis

Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

Co-relation

In [232]:
plt.figure(figsize =(15,15))
sns.heatmap(df.corr(),annot=True)
plt.show()

Year vs Rating

In [123]:
sns.regplot(x = 'Year',y = 'Rating',data = df1 , x_jitter=0.2, scatter_kws={'alpha':0.1})
sns.despine()
In [213]:
sns.jointplot(x='Year', y='Rating', data=df1, kind="kde")
sns.despine()

Inferences :-

  • We observe a decreasing trend from the plot
  • We have maximum data from the year 2016
  • We can also say that as years passed by people became more conscious about watching movies because of the decrease in the ratings

Year vs Metascore

In [209]:
sns.regplot(x = 'Year',y = 'Metascore',data = df1 , x_jitter=0.2, scatter_kws={'alpha':0.1})
sns.despine()
In [230]:
sns.jointplot(x='Year', y='Metascore', data=df1, kind="kde")
sns.despine()

Inferences :-

  • Here also we can see that there is a slight decreasing trend
  • Again the data for 2016 is maximum
  • The critics have been a little generous over the years so there is a very slight decrease in the Metascore

Year vs Runtime (Minutes)

In [215]:
sns.regplot(x = 'Year',y = 'Runtime (Minutes)',data = df1 , x_jitter=0.2, scatter_kws={'alpha':0.1})
sns.despine()
In [214]:
sns.jointplot(x="Year", y='Runtime (Minutes)', data=df1, kind="kde")
sns.despine()

Inferences :-

  • We observe a slight decreasing trend in the plot
  • The average runtime has been around 120 mins over the years

Multi-Variate Analysis

Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time.

Inferences from Runtime vs Rating in terms of Year

  • From the plots below we can say that the dataset contains maximum number of movies from 2016 and over the years the runtime of movies has been around 100-125 minutes
In [219]:
grid = sns.FacetGrid(df1, col='Year',col_wrap = 4)
grid.map(plt.scatter,'Runtime (Minutes)','Rating',alpha = 0.5)
sns.despine()

WordCloud

What are Word Clouds?

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.

We get to see the most common words used for a movie title in the dataset

In [221]:
import wordcloud
from wordcloud import WordCloud, STOPWORDS

# Create a wordcloud of the movie titles
df1['Title'] = df1['Title'].fillna("").astype('str')
title_corpus = ' '.join(df1['Title'])
title_wordcloud = WordCloud(stopwords=STOPWORDS,background_color='black', height=1500, width=4000).generate(title_corpus)

# Plot the wordcloud
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

We get to see the most common Genres of the Movie dataset

In [291]:
# Create a wordcloud of the movie Genres
df['Genre'] = df1['Genre'].fillna("").astype('str')
title_corpus = ' '.join(df1['Genre'])
title_wordcloud = WordCloud(stopwords=STOPWORDS,background_color='black', height=1500, width=4000).generate(title_corpus)

# Plot the wordcloud
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

Recommendation Systems

mov

Content Based Recommendation

mov

In [279]:
from sklearn.feature_extraction.text import TfidfVectorizer

mov

In [280]:
tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(df1['Genre'])
In [281]:
print(tfidf_matrix.todense().shape)
(1000, 21)
In [295]:
a = pd.DataFrame(tfidf_matrix.todense())
a
Out[295]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 0.402181 0.430870 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.571227 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.571227 0.000000 0.000000 0.0 0.0
1 0.000000 0.394839 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.523459 ... 0.000000 0.000000 0.000000 0.544136 0.000000 0.523459 0.000000 0.000000 0.0 0.0
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.764645 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.644452 0.0 0.0
3 0.000000 0.000000 0.658782 0.000000 0.374818 0.000000 0.000000 0.652317 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
4 0.477136 0.511172 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.714874 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
5 0.477136 0.511172 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.714874 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
6 0.000000 0.000000 0.000000 0.000000 0.391659 0.000000 0.287037 0.000000 0.000000 0.000000 ... 0.000000 0.874193 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
7 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
8 0.461223 0.494125 0.000000 0.736963 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
9 0.000000 0.569315 0.000000 0.000000 0.000000 0.000000 0.404068 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.715968 0.000000 0.000000 0.000000 0.0 0.0
10 0.000000 0.415354 0.000000 0.000000 0.000000 0.000000 0.000000 0.700049 0.580872 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
11 0.000000 0.000000 0.000000 0.588934 0.000000 0.000000 0.280259 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
12 0.402181 0.430870 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.571227 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.571227 0.000000 0.000000 0.0 0.0
13 0.000000 0.454774 0.774086 0.000000 0.440421 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
14 0.613760 0.000000 0.000000 0.000000 0.636790 0.000000 0.466687 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
15 0.000000 0.454774 0.774086 0.000000 0.440421 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
16 0.000000 0.000000 0.000000 0.588934 0.000000 0.000000 0.280259 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
17 0.640103 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.768289 0.0 0.0
18 0.000000 0.000000 0.000000 0.902971 0.000000 0.000000 0.429701 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
19 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.291748 0.000000 0.000000 0.544964 ... 0.000000 0.000000 0.000000 0.566490 0.000000 0.544964 0.000000 0.000000 0.0 0.0
20 0.000000 0.602049 0.000000 0.000000 0.000000 0.000000 0.427301 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.674500 0.0 0.0
21 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
22 0.000000 0.000000 0.000000 0.000000 0.000000 0.632779 0.364708 0.000000 0.000000 0.000000 ... 0.683066 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
23 0.000000 0.454774 0.774086 0.000000 0.440421 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
24 0.402181 0.430870 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.571227 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.571227 0.000000 0.000000 0.0 0.0
25 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
26 0.605680 0.648886 0.000000 0.000000 0.000000 0.000000 0.460543 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
27 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.764645 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.644452 0.0 0.0
28 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
29 0.605680 0.648886 0.000000 0.000000 0.000000 0.000000 0.460543 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
970 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.764645 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.644452 0.0 0.0
971 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.371085 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.720541 0.000000 0.000000 0.000000 0.585762 0.0 0.0
972 0.000000 0.000000 0.000000 0.000000 0.337613 0.000000 0.247428 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.908183 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
973 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.694240 0.000000 0.000000 0.719744 0.000000 0.000000 0.000000 0.000000 0.0 0.0
974 0.000000 0.000000 0.000000 0.555907 0.000000 0.000000 0.264542 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.788026 0.000000 0.0 0.0
975 0.000000 0.000000 0.000000 0.000000 0.418298 0.000000 0.000000 0.727988 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.543194 0.000000 0.000000 0.000000 0.0 0.0
976 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.371085 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.720541 0.000000 0.000000 0.000000 0.585762 0.0 0.0
977 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
978 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
979 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.535157 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.844752 0.0 0.0
980 0.000000 0.000000 0.000000 0.632014 0.000000 0.000000 0.300760 0.714214 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
981 0.000000 0.000000 0.000000 0.000000 0.467988 0.000000 0.342976 0.814466 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
982 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.233411 0.000000 0.459920 0.000000 ... 0.000000 0.000000 0.856734 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
983 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
984 0.000000 0.510266 0.000000 0.000000 0.000000 0.000000 0.000000 0.860017 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
985 0.000000 0.506781 0.000000 0.000000 0.490786 0.000000 0.000000 0.000000 0.708733 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
986 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.764645 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.644452 0.0 0.0
987 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.491495 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.870880 0.000000 0.000000 0.000000 0.0 0.0
988 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
989 0.000000 0.000000 0.000000 0.588934 0.000000 0.000000 0.280259 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
990 0.477136 0.511172 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.714874 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
991 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.250669 0.595264 0.000000 0.000000 ... 0.000000 0.763431 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
992 0.000000 0.000000 0.000000 0.000000 0.556983 0.000000 0.408199 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.723287 0.000000 0.000000 0.000000 0.0 0.0
993 0.489359 0.524267 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.696902 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
994 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
995 0.000000 0.000000 0.000000 0.000000 0.000000 0.622014 0.358504 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.696113 0.000000 0.000000 0.000000 0.000000 0.0 0.0
996 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
997 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.273025 0.000000 0.000000 0.000000 ... 0.000000 0.831517 0.000000 0.000000 0.483773 0.000000 0.000000 0.000000 0.0 0.0
998 0.000000 0.718352 0.000000 0.000000 0.695680 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
999 0.000000 0.000000 0.000000 0.000000 0.404418 0.000000 0.000000 0.703831 0.584010 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0

1000 rows × 21 columns

In [296]:
from sklearn.metrics.pairwise import cosine_similarity

mov

Building a matrix with cosine similarity scores

In [297]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
[[1.         0.76815261 0.         ... 0.         0.30951647 0.        ]
 [0.76815261 1.         0.         ... 0.         0.28363342 0.        ]
 [0.         0.         1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.30951647 0.28363342 0.         ... 0.         1.         0.28134523]
 [0.         0.         0.         ... 0.         0.28134523 1.        ]]

Build a 1-dimensional array with movie titles

In [298]:
titles = df1['Title']
indices = pd.Series(df.index, index=df1['Title'])
print(indices.head(10))
Title
Guardians of the Galaxy    0
Prometheus                 1
Split                      2
Sing                       3
Suicide Squad              4
The Great Wall             5
La La Land                 6
Mindhorn                   7
The Lost City of Z         8
Passengers                 9
dtype: int64

Function that get movie recommendations based on the cosine similarity score of movie genres

In [299]:
def genre_recommendations(title):
    sim_scores = list(enumerate(cosine_sim[indices[title]]))
    
    # Sorting the recommendation list according to cosine similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Printing the top 30 recommendations in desencding order of similarity score
    sim_scores = sim_scores[1:30]
    
    # Indexing with Integer Location and returning the values
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

Recommendation Based on the cosine simialrity and sorted according to similarity scores

In [300]:
genre_recommendations('Guardians of the Galaxy')
Out[300]:
12                               Rogue One
24            Independence Day: Resurgence
32                       X-Men: Apocalypse
35              Captain America: Civil War
48                        Star Trek Beyond
60      Batman v Superman: Dawn of Justice
67                      Mad Max: Fury Road
80                               Inception
85                          Jurassic World
94                 Avengers: Age of Ultron
126        Transformers: Age of Extinction
140                              Star Trek
156                            Pacific Rim
162             X-Men: Days of Future Past
195     Captain America: The First Avenger
200                       Edge of Tomorrow
203                               Iron Man
205                         X: First Class
212                           Transformers
216    Captain America: The Winter Soldier
220                         Hardcore Henry
227                              Predators
243                     Terminator Genisys
253               The Amazing Spider-Man 2
256                             Battleship
268               X-Men Origins: Wolverine
279                         Iron Man Three
287                      Jupiter Ascending
316                           The 5th Wave
Name: Title, dtype: object

Summary

From the above given dataset we built a content based Recommendation system or Personalized recommendation system.First we had to clean the datset as there were some missing values then we had to drop the columns which didn't seem necessary for out Analysis. After that we did some Exploratory Data Analysis on the movie dataset and found some trends and observations from the plots.

Then we used the concept of TF-IDF vector to convert our texts into a matrix form and find the cosine similarity so that we can predict the similar movies present in the dataset. At last we were sucessfull in building a very small version of a personalized recommendation system.