<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Case Study: Movies Dataset </p>
<br>This notebook uses a dataset from Kaggle. We will describe the dataset further as we explore with it using *pandas*. 

## Download the Dataset

Please note that **you will need to download the dataset** from the course website. 

You can find the data at https://junyounglim.github.io/. Please unzip the file at a filepath of your choice. 

Here are instructions on how to unzip a file in Windows: https://support.microsoft.com/en-us/help/14200/windows-compress-uncompress-zip-files. 
For Macs, simply double-click on the file. 




<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Use Pandas to Read the Dataset<br>
</p>
<br>
In this notebook, we will be using a CSV file:
* **tmdb_5000_movies.csv :** 

The dataset contains about 5000 movies. 

The following are the features: 
    budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count


Using the *read_csv* function in pandas, we will transfer this information into our code. 

In [95]:
# import pandas and load data
import pandas as pd

filepath = 'tmdb_5000_movies.csv'
movies = pd.read_csv(filepath)

# or
movies = pd.read_csv('tmdb_5000_movies.csv')

In [74]:
# Now that we have the dataset we will start to get a feeling for its layout
movies.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",12/10/09,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",5/19/07,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",10/26/15,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",7/16/12,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",3/7/12,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Our dataset is loaded and looks ok, but it looks like there's some cleaning that needs to be done. Notice how multiple columns seem to be objects.

In [96]:
import re

def to_list_4(strng):
    return [item for index, item in enumerate(strng.split('"')) if (index + 1) % 6 == 0]

def to_list_2(strng):
    return [item for index, item in enumerate(strng.split('"')) if (index + 5) % 8 == 0]

def to_list_2_mod(strng):
    return [item for index, item in enumerate(strng.split('"')) if (index + 3) % 6 == 0]

movies.genres = movies.genres.apply(to_list_4)
movies.keywords = movies.keywords.apply(to_list_4)
movies.production_companies = movies.production_companies.apply(to_list_2_mod)
movies.production_countries = movies.production_countries.apply(to_list_2)
movies.spoken_languages = movies.spoken_languages.apply(to_list_2)
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[Action, Adventure, Fantasy, Science Fiction]",http://www.avatarmovie.com/,19995,"[culture clash, future, space war, space colon...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[Ingenious Film Partners, Twentieth Century Fo...","[US, GB]",12/10/09,2787965087,162.0,"[en, es]",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[Adventure, Fantasy, Action]",http://disney.go.com/disneypictures/pirates/,285,"[ocean, drug abuse, exotic island, east india ...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[Walt Disney Pictures, Jerry Bruckheimer Films...",[US],5/19/07,961000000,169.0,[en],Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[Action, Adventure, Crime]",http://www.sonypictures.com/movies/spectre/,206647,"[spy, based on novel, secret agent, sequel, mi...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[Columbia Pictures, Danjaq, B24]","[GB, US]",10/26/15,880674609,148.0,"[fr, en, es, it, de]",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[Action, Crime, Drama, Thriller]",http://www.thedarkknightrises.com/,49026,"[dc comics, crime fighter, terrorist, secret i...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[Legendary Pictures, Warner Bros., DC Entertai...",[US],7/16/12,1084939099,165.0,[en],Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[Action, Adventure, Science Fiction]",http://movies.disney.com/john-carter,49529,"[based on novel, mars, medallion, space travel...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,[Walt Disney Pictures],[US],3/7/12,284139100,132.0,[en],Released,"Lost in our world, found in another.",John Carter,6.1,2124


<h1 style="font-size:2em;color:#2467C0">Descriptive Statistics</h1>

Pandas also provides some basic quantitative functions to understand our data. 

In [76]:
movies['budget'].describe()

count    4.803000e+03
mean     2.904504e+07
std      4.072239e+07
min      0.000000e+00
25%      7.900000e+05
50%      1.500000e+07
75%      4.000000e+07
max      3.800000e+08
Name: budget, dtype: float64

In [77]:
movies.describe()

Unnamed: 0,budget,id,popularity,revenue,runtime,vote_average,vote_count
count,4803.0,4803.0,4803.0,4803.0,4753.0,4803.0,4803.0
mean,29045040.0,57165.484281,21.492301,82260640.0,106.853777,6.092172,690.217989
std,40722390.0,88694.614033,31.81665,162857100.0,22.614586,1.194612,1234.585891
min,0.0,5.0,0.0,0.0,0.0,0.0,0.0
25%,790000.0,9014.5,4.66807,0.0,94.0,5.6,54.0
50%,15000000.0,14629.0,12.921594,19170000.0,104.0,6.2,235.0
75%,40000000.0,58610.5,28.313505,92917190.0,117.0,6.8,737.0
max,380000000.0,459488.0,875.581305,2787965000.0,338.0,10.0,13752.0


In [78]:
movies['vote_average'].mean()

6.092171559442011

In [79]:
movies.mean()

budget          2.904504e+07
id              5.716548e+04
popularity      2.149230e+01
revenue         8.226064e+07
runtime         1.068538e+02
vote_average    6.092172e+00
vote_count      6.902180e+02
dtype: float64

In [80]:
movies.corr()

Unnamed: 0,budget,id,popularity,revenue,runtime,vote_average,vote_count
budget,1.0,-0.089377,0.505414,0.730823,0.267915,0.093146,0.59318
id,-0.089377,1.0,0.031202,-0.050425,-0.152379,-0.270595,-0.004128
popularity,0.505414,0.031202,1.0,0.644724,0.224205,0.273952,0.77813
revenue,0.730823,-0.050425,0.644724,1.0,0.249986,0.19715,0.781487
runtime,0.267915,-0.152379,0.224205,0.249986,1.0,0.377139,0.272182
vote_average,0.093146,-0.270595,0.273952,0.19715,0.377139,1.0,0.312997
vote_count,0.59318,-0.004128,0.77813,0.781487,0.272182,0.312997,1.0


We can also filter information conditionally. 

In [81]:
filter_1 = movies['vote_average'] > 8.0
print(filter_1.any())
movies[movies['vote_average'] > 8.0]

True


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
65,185000000,"[Drama, Action, Crime, Thriller]",http://thedarkknight.warnerbros.com/dvdsite/,155,"[dc comics, crime fighter, secret identity, sc...",en,The Dark Knight,Batman raises the stakes in his war on crime. ...,187.322927,"[DC Comics, Legendary Pictures, Warner Bros., ...","[GB, US]",7/16/08,1004558444,152.0,"[en, zh]",Released,Why So Serious?,The Dark Knight,8.2,12002
95,165000000,"[Adventure, Drama, Science Fiction]",http://www.interstellarmovie.net/,157336,"[saving the world, artificial intelligence, fa...",en,Interstellar,Interstellar chronicles the adventures of a gr...,724.247784,"[Paramount Pictures, Legendary Pictures, Warne...","[CA, US, GB]",11/5/14,675120017,169.0,[en],Released,Mankind was born on Earth. It was never meant ...,Interstellar,8.1,10867
96,160000000,"[Action, Thriller, Science Fiction, Mystery, A...",http://inceptionmovie.warnerbros.com/,27205,"[loss of lover, dream, kidnapping, sleep, subc...",en,Inception,"Cobb, a skilled thief who commits corporate es...",167.58371,"[Legendary Pictures, Warner Bros., Syncopy]","[GB, US]",7/14/10,825532764,148.0,"[en, ja, fr]",Released,Your mind is the scene of the crime.,Inception,8.1,13752
329,94000000,"[Adventure, Fantasy, Action]",http://www.lordoftherings.net,122,"[elves, orcs, middle-earth (tolkien), based on...",en,The Lord of the Rings: The Return of the King,Aragorn is revealed as the heir to the ancient...,123.630332,"[WingNut Films, New Line Cinema]","[NZ, US]",12/1/03,1118888979,201.0,[en],Released,The eye of the enemy is moving.,The Lord of the Rings: The Return of the King,8.1,8064
662,63000000,[Drama],http://www.foxmovies.com/movies/fight-club,550,"[support group, dual identity, nihilism, rage ...",en,Fight Club,A ticking-time-bomb insomniac and a slippery s...,146.757391,"[Regency Enterprises, Fox 2000 Pictures, Tauru...","[DE, US]",10/15/99,100853753,,[en],Released,Mischief. Mayhem. Soap.,Fight Club,8.3,9413
690,60000000,"[Fantasy, Drama, Crime]",http://thegreenmile.warnerbros.com/,497,"[southern usa, black people, mentally disabled...",en,The Green Mile,A supernatural tale set on death row in a Sout...,103.698022,"[Castle Rock Entertainment, Darkwoods Producti...",[US],12/10/99,284600000,189.0,"[fr, en]",Released,Miracles do happen.,The Green Mile,8.2,4048
809,55000000,"[Comedy, Drama, Romance]",,13,"[vietnam veteran, hippie, mentally disabled, r...",en,Forrest Gump,A man with a low IQ has accomplished great thi...,138.133331,[Paramount Pictures],[US],7/6/94,677945399,142.0,[en],Released,"The world will never be the same, once you've ...",Forrest Gump,8.2,7927
1553,33000000,"[Crime, Mystery, Thriller]",http://www.sevenmovie.com/,807,"[self-fulfilling prophecy, detective, s.w.a.t....",en,Se7en,Two homicide detectives are on a desperate hun...,79.579532,"[New Line Cinema, Juno Pix, Cecchi Gori Pictures]",[US],9/22/95,327311859,127.0,[en],Released,Seven deadly sins. Seven ways to die.,Se7en,8.1,5765
1663,30000000,"[Drama, Crime]",,311,"[life and death, corruption, street gang, rape...",en,Once Upon a Time in America,A former Prohibition-era Jewish gangster retur...,49.336397,"[Warner Bros., The Ladd Company]","[US, IT]",2/16/84,0,229.0,"[en, fr, it]",Released,"Crime, passion and lust for power - Sergio Leo...",Once Upon a Time in America,8.2,1069
1818,22000000,"[Drama, History, War]",http://www.schindlerslist.com/,424,"[factory, concentration camp, hero, holocaust,...",en,Schindler's List,The true story of how businessman Oskar Schind...,104.469351,"[Universal Pictures, Amblin Entertainment]",[US],11/29/93,321365567,195.0,"[de, pl, he, en]",Released,"Whoever saves one life, saves the world entire.",Schindler's List,8.3,4329


In [82]:
filter_2 = movies['vote_average'] > 8.0
filter_2.all()

False

<h1 style="font-size:2em;color:#2467C0">Handling Missing Data</h1>

In [83]:
movies.shape

(4803, 20)

In [97]:
#Check if there are Null values in each row
movies.isnull().any()

budget                  False
genres                  False
homepage                 True
id                      False
keywords                False
original_language       False
original_title          False
overview                 True
popularity              False
production_companies    False
production_countries    False
release_date             True
revenue                 False
runtime                  True
spoken_languages        False
status                  False
tagline                  True
title                   False
vote_average            False
vote_count              False
dtype: bool

Let's start with fixing the categorical NaN values.

In [98]:
movies_filled = movies

movies_filled['homepage'] = movies_filled['homepage'].fillna(value='None')
movies_filled['overview'] = movies_filled['overview'].fillna(value='')
movies_filled['tagline'] = movies_filled['tagline'].fillna(value='')

Now let's do the numerical NaN values.

In [99]:
print(movies_filled['runtime'].isnull().sum())
print(movies_filled['release_date'].isnull().sum())

50
1


In [102]:
movies_filled['runtime'] = movies_filled['runtime'].fillna(value=movies_filled['runtime'].median())

In [103]:
movies_filled = movies_filled.dropna()

In [104]:
movies_filled.isnull().any()

budget                  False
genres                  False
homepage                False
id                      False
keywords                False
original_language       False
original_title          False
overview                False
popularity              False
production_companies    False
production_countries    False
release_date            False
revenue                 False
runtime                 False
spoken_languages        False
status                  False
tagline                 False
title                   False
vote_average            False
vote_count              False
dtype: bool

Thats nice! No NULL rows!