The main aim of the project is to find insights about the datset containing information about particular movies. This project uses the movie dataset avaliable form Movie lens.This dataset contains just 1000 movies for analysis. We have used libraries present in Python such as Matplotlib,Seaborn,Pandas for reading and visualization of the dataset.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
The data is in csv format.In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. Data are collected on 12 different informations of a movie,with rating being in the order of 1 (worst) and 10 (best) and the metascore being in the order from 1 (worst) and 100 (best).
df = pd.read_csv('Documents//movie.csv')
df.head()
df.size
df.shape
df.describe(include = 'all')
df.isnull().sum()
df.mean()
df['Revenue (Millions)'] = df['Revenue (Millions)'].fillna(df['Revenue (Millions)'].mean())
df['Metascore'] = df['Metascore'].fillna(df['Metascore'].mean())
df.isnull().sum()
df1 = df.drop(columns = ['Description' ,'Director','Actors'])
df1.head()
Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. It doesn't deal with causes or relationships (unlike regression) and it's major purpose is to describe it takes data, summarizes that data and finds patterns in the data.The key pointers to the Univaraite analysis are to find out the outliers present in the data. We also tend to find the disitribution of the data on the dataset which can further help us for the Bivaraite/Multivariate analysis.
x = df1['Rating']
sns.distplot(x)
sns.despine()
sns.boxplot(df1['Rating'])
sns.despine()
x = df1['Runtime (Minutes)']
sns.distplot(x)
sns.despine()
sns.boxplot(df1['Runtime (Minutes)'])
sns.despine()
x = df1['Metascore']
sns.distplot(x,);
sns.boxplot(df1['Metascore'])
sns.despine()
Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.
plt.figure(figsize =(15,15))
sns.heatmap(df.corr(),annot=True)
plt.show()
sns.regplot(x = 'Year',y = 'Rating',data = df1 , x_jitter=0.2, scatter_kws={'alpha':0.1})
sns.despine()
sns.jointplot(x='Year', y='Rating', data=df1, kind="kde")
sns.despine()
sns.regplot(x = 'Year',y = 'Metascore',data = df1 , x_jitter=0.2, scatter_kws={'alpha':0.1})
sns.despine()
sns.jointplot(x='Year', y='Metascore', data=df1, kind="kde")
sns.despine()
sns.regplot(x = 'Year',y = 'Runtime (Minutes)',data = df1 , x_jitter=0.2, scatter_kws={'alpha':0.1})
sns.despine()
sns.jointplot(x="Year", y='Runtime (Minutes)', data=df1, kind="kde")
sns.despine()
Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time.
grid = sns.FacetGrid(df1, col='Year',col_wrap = 4)
grid.map(plt.scatter,'Runtime (Minutes)','Rating',alpha = 0.5)
sns.despine()
Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.
import wordcloud
from wordcloud import WordCloud, STOPWORDS
# Create a wordcloud of the movie titles
df1['Title'] = df1['Title'].fillna("").astype('str')
title_corpus = ' '.join(df1['Title'])
title_wordcloud = WordCloud(stopwords=STOPWORDS,background_color='black', height=1500, width=4000).generate(title_corpus)
# Plot the wordcloud
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()
# Create a wordcloud of the movie Genres
df['Genre'] = df1['Genre'].fillna("").astype('str')
title_corpus = ' '.join(df1['Genre'])
title_wordcloud = WordCloud(stopwords=STOPWORDS,background_color='black', height=1500, width=4000).generate(title_corpus)
# Plot the wordcloud
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(df1['Genre'])
print(tfidf_matrix.todense().shape)
a = pd.DataFrame(tfidf_matrix.todense())
a
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
titles = df1['Title']
indices = pd.Series(df.index, index=df1['Title'])
print(indices.head(10))
def genre_recommendations(title):
sim_scores = list(enumerate(cosine_sim[indices[title]]))
# Sorting the recommendation list according to cosine similarity
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Printing the top 30 recommendations in desencding order of similarity score
sim_scores = sim_scores[1:30]
# Indexing with Integer Location and returning the values
movie_indices = [i[0] for i in sim_scores]
return titles.iloc[movie_indices]
genre_recommendations('Guardians of the Galaxy')
From the above given dataset we built a content based Recommendation system or Personalized recommendation system.First we had to clean the datset as there were some missing values then we had to drop the columns which didn't seem necessary for out Analysis. After that we did some Exploratory Data Analysis on the movie dataset and found some trends and observations from the plots.
Then we used the concept of TF-IDF vector to convert our texts into a matrix form and find the cosine similarity so that we can predict the similar movies present in the dataset. At last we were sucessfull in building a very small version of a personalized recommendation system.