{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reproducing the results of \"The Three Types Of Adam Sandler Movies\" on FiveThirtyEight: http://fivethirtyeight.com/datalab/the-three-types-of-adam-sandler-movies/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Versions"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sys\n",
"sys.version_info"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'2.6.2'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import requests\n",
"requests.__version__"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'4.3.2'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import bs4\n",
"from bs4 import BeautifulSoup\n",
"bs4.__version__"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'1.9.2'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"np.__version__"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'0.16.0'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"pd.__version__"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'0.16.1'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.cluster import KMeans\n",
"\n",
"import sklearn\n",
"sklearn.__version__"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
BokehJS successfully loaded.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'0.8.2'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import bokeh.plotting as plt\n",
"from bokeh.models import HoverTool\n",
"plt.output_notebook()\n",
"\n",
"import bokeh\n",
"bokeh.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Original article is based on [Rotten Tomatoes](http://www.rottentomatoes.com/celebrity/adam_sandler/) for ratings and [Opus Data](http://www.opusdata.com) for the Box Office Gross. The second one is behind a paywall so it was replaced it with the same the Box Office data on Rotten Tomatoes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We get the html from rottern tomatoes and pass it to beautiful soup so we can extract the content we want.\n",
"Then we used [selector gadget](http://selectorgadget.com/) to get the CSS selector of the table that we can pass to pandas and they will return a nice DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def get_soup(url):\n",
" r = requests.get(url)\n",
" return BeautifulSoup(r.text, 'html5lib')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"rotten_sandler_url = 'http://www.rottentomatoes.com/celebrity/adam_sandler/'"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"soup = get_soup(rotten_sandler_url)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"films_table = str(soup.select('#filmography_box table:first-child')[0])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rotten = pd.read_html(films_table)[0]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" RATING | \n",
" TITLE | \n",
" CREDIT | \n",
" BOX OFFICE | \n",
" YEAR | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
" Hello Ghost | \n",
" Actor Producer | \n",
" -- | \n",
" 2015 | \n",
"
\n",
" \n",
" 1 | \n",
" 9% | \n",
" The Cobbler | \n",
" Max Simkin | \n",
" -- | \n",
" 2015 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" Pixels | \n",
" Producer Screenwriter Sam Brenner | \n",
" -- | \n",
" 2015 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" Candy Land | \n",
" Actor | \n",
" -- | \n",
" 2015 | \n",
"
\n",
" \n",
" 4 | \n",
" 6% | \n",
" Paul Blart: Mall Cop 2 | \n",
" Producer | \n",
" $43.2M | \n",
" 2015 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" RATING TITLE CREDIT \\\n",
"0 NaN Hello Ghost Actor Producer \n",
"1 9% The Cobbler Max Simkin \n",
"2 NaN Pixels Producer Screenwriter Sam Brenner \n",
"3 NaN Candy Land Actor \n",
"4 6% Paul Blart: Mall Cop 2 Producer \n",
"\n",
" BOX OFFICE YEAR \n",
"0 -- 2015 \n",
"1 -- 2015 \n",
"2 -- 2015 \n",
"3 -- 2015 \n",
"4 $43.2M 2015 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rotten.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We convert the \"Rating\" and \"Box Office\" columns to numeric values with some simple transformations removing some text characters and also replacing empty values with `numpy.nan`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rotten.RATING = rotten.RATING.str.replace('%', '').astype(float)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"rotten['BOX OFFICE'] = rotten['BOX OFFICE'].str.replace('$', '').str.replace('M', '').str.replace('-', '0')\n",
"rotten['BOX OFFICE'] = rotten['BOX OFFICE'].astype(float)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rotten.loc[rotten['BOX OFFICE'] == 0, ['BOX OFFICE']] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" RATING | \n",
" TITLE | \n",
" CREDIT | \n",
" BOX OFFICE | \n",
" YEAR | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
" Hello Ghost | \n",
" Actor Producer | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 1 | \n",
" 9 | \n",
" The Cobbler | \n",
" Max Simkin | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" Pixels | \n",
" Producer Screenwriter Sam Brenner | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" Candy Land | \n",
" Actor | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 4 | \n",
" 6 | \n",
" Paul Blart: Mall Cop 2 | \n",
" Producer | \n",
" 43.2 | \n",
" 2015 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" RATING TITLE CREDIT \\\n",
"0 NaN Hello Ghost Actor Producer \n",
"1 9 The Cobbler Max Simkin \n",
"2 NaN Pixels Producer Screenwriter Sam Brenner \n",
"3 NaN Candy Land Actor \n",
"4 6 Paul Blart: Mall Cop 2 Producer \n",
"\n",
" BOX OFFICE YEAR \n",
"0 NaN 2015 \n",
"1 NaN 2015 \n",
"2 NaN 2015 \n",
"3 NaN 2015 \n",
"4 43.2 2015 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rotten.head()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rotten = rotten.set_index('TITLE')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We finaly save the dataset."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"rotten.to_csv('rotten.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chart"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the original chart for comparison"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from IPython.display import Image"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Image(url='https://espnfivethirtyeight.files.wordpress.com/2015/04/hickey-datalab-sandler.png?w=610&h=634')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We load the saved data into a DataFrame and we just plot using bokeh that gives some nice interactive features that the original chart do not have, this makes it easier to explore the different movies."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"rotten = pd.read_csv('rotten.csv', index_col=0)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rotten = rotten.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"37"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(rotten)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Paul Blart: Mall Cop 2', 'Blended', 'Top Five', 'Grown Ups 2', 'That's My Boy', 'Hotel Transylvania', 'Here Comes the Boom', 'Jack and Jill', 'Zookeeper', 'Just Go with It', 'Bucky Larson: Born to Be a Star', 'Grown Ups', 'Funny People', 'Paul Blart: Mall Cop', 'You Don't Mess With the Zohan', 'The House Bunny', 'Bedtime Stories', 'Strange Wilderness', 'I Now Pronounce You Chuck & Larry', 'Reign Over Me', 'Grandma's Boy', 'Click', 'The Benchwarmers', 'Deuce Bigalow: European Gigolo', 'The Longest Yard', 'Spanglish', '50 First Dates', 'Dickie Roberts: Former Child Star', 'Anger Management', 'The Hot Chick', 'Mr. Deeds', 'Adam Sandler's Eight Crazy Nights', 'The Master of Disguise', 'Punch-Drunk Love', 'Joe Dirt', 'The Animal', 'Little Nicky'], dtype='object')"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rotten.index"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"source = plt.ColumnDataSource(\n",
" data=dict(\n",
" rating=rotten.RATING,\n",
" gross=rotten['BOX OFFICE'],\n",
" movie=rotten.index,\n",
" )\n",
")\n",
"\n",
"p = plt.figure(tools='reset,save,hover', x_range=[0, 100], title='',\n",
" x_axis_label=\"Rotten Tomatoes rating\", y_axis_label=\"Box Office Gross\")\n",
"p.scatter(rotten.RATING, rotten['BOX OFFICE'], size=10, source=source)\n",
"\n",
"hover = p.select(dict(type=HoverTool))\n",
"\n",
"hover.tooltips = [\n",
" (\"Movie\", \"@movie\"),\n",
" (\"Rating\", \"@rating\"),\n",
" (\"Box Office Gross\", \"@gross\"),\n",
"]\n",
"\n",
"plt.show(p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clusters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The articles also mentioned some simple clustering on the dataset. We can reproduce that with scikit-learn."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X = rotten[['RATING', 'BOX OFFICE']].values"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"clf = KMeans(n_clusters=3)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,\n",
" n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,\n",
" verbose=0)"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf.fit(X)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 2, 1, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 1, 0, 1, 0, 1, 2, 0, 1, 0,\n",
" 0, 1, 2, 1, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0], dtype=int32)"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clusters = clf.predict(X)\n",
"clusters"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"colors = clusters.astype(str)\n",
"colors[clusters == 0] = 'green'\n",
"colors[clusters == 1] = 'red'\n",
"colors[clusters == 2] = 'gold'"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"source = plt.ColumnDataSource(\n",
" data=dict(\n",
" rating=rotten.RATING,\n",
" gross=rotten['BOX OFFICE'],\n",
" movie=rotten.index,\n",
" )\n",
")\n",
"\n",
"p = plt.figure(tools='reset,save,hover', x_range=[0, 100], title='',\n",
" x_axis_label=\"Rotten Tomatoes rating\", y_axis_label=\"Box Office Gross\")\n",
"p.scatter(rotten.RATING, rotten['BOX OFFICE'], size=10, source=source, color=colors)\n",
"\n",
"hover = p.select(dict(type=HoverTool))\n",
"\n",
"hover.tooltips = [\n",
" (\"Movie\", \"@movie\"),\n",
" (\"Rating\", \"@rating\"),\n",
" (\"Box Office Gross\", \"@gross\"),\n",
"]\n",
"\n",
"plt.show(p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see a similar result as the original article mentioned there is some differences in the Box Office Gross as it was mentioned before so the result is not exactly the same."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## IMDB"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What happens if we use IMDB ratings instead of Rotten Tomatoes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMDB: Ratings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We apply a similar procedure for getting the data from IMDB with a basic crawler."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"imdb_sandler_url = 'http://www.imdb.com/name/nm0001191/'"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"soup = get_soup(imdb_sandler_url)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"a_tags = soup.select('div#filmo-head-actor + div b a')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[The Ridiculous 6,\n",
" Hotel Transylvania 2,\n",
" Pixels,\n",
" The Cobbler,\n",
" Men, Women & Children]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a_tags[:5]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"movies = {}\n",
"for a_tag in a_tags:\n",
" movie_name = a_tag.text\n",
" movie_url = 'http://www.imdb.com' + a_tag['href']\n",
" soup = get_soup(movie_url)\n",
" rating = soup.select('.star-box-giga-star')\n",
" if len(rating) == 1:\n",
" movies[movie_name] = float(rating[0].text)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ratings = pd.DataFrame.from_dict(movies, orient='index')\n",
"ratings.columns = ['rating']"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" The Hot Chick | \n",
" 5.5 | \n",
"
\n",
" \n",
" The Animal | \n",
" 4.8 | \n",
"
\n",
" \n",
" Deuce Bigalow: Male Gigolo | \n",
" 5.7 | \n",
"
\n",
" \n",
" Happy Gilmore | \n",
" 7.0 | \n",
"
\n",
" \n",
" Eight Crazy Nights | \n",
" 5.4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rating\n",
"The Hot Chick 5.5\n",
"The Animal 4.8\n",
"Deuce Bigalow: Male Gigolo 5.7\n",
"Happy Gilmore 7.0\n",
"Eight Crazy Nights 5.4"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"53"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(ratings)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ratings.index.name = 'Title'"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ratings.to_csv('imdb-ratings.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMDB: Box Office Mojo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"IMDB also provides the Box Office Gross information from [Box Office Mojo](http://www.boxofficemojo.com)."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"box_sandler_url = 'http://www.boxofficemojo.com/people/chart/?view=Actor&id=adamsandler.htm'"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"soup = get_soup(box_sandler_url)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"box_gross_table = str(soup.select('br + table')[0])"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"gross = pd.read_html(box_gross_table, header=0)[0]"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Title (click to view) | \n",
" Studio | \n",
" Lifetime Gross / Theaters | \n",
" Opening / Theaters | \n",
" Rank | \n",
" Unnamed: 6 | \n",
" Unnamed: 7 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10/1/14 | \n",
" Men, Women & Children | \n",
" Par. | \n",
" $705,908 | \n",
" 608 | \n",
" $48,024 | \n",
" 17 | \n",
" 30 | \n",
"
\n",
" \n",
" 1 | \n",
" 5/23/14 | \n",
" Blended | \n",
" WB | \n",
" $46,294,610 | \n",
" 3555 | \n",
" $14,284,031 | \n",
" 3555 | \n",
" 18 | \n",
"
\n",
" \n",
" 2 | \n",
" 7/12/13 | \n",
" Grown Ups 2 | \n",
" Sony | \n",
" $133,668,525 | \n",
" 3491 | \n",
" $41,508,572 | \n",
" 3491 | \n",
" 8 | \n",
"
\n",
" \n",
" 3 | \n",
" 9/28/12 | \n",
" Hotel Transylvania(Voice) | \n",
" Sony | \n",
" $148,313,048 | \n",
" 3375 | \n",
" $42,522,194 | \n",
" 3349 | \n",
" 5 | \n",
"
\n",
" \n",
" 4 | \n",
" 6/15/12 | \n",
" That's My Boy | \n",
" Sony | \n",
" $36,931,089 | \n",
" 3030 | \n",
" $13,453,714 | \n",
" 3030 | \n",
" 22 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Title (click to view) Studio Lifetime Gross / Theaters \\\n",
"0 10/1/14 Men, Women & Children Par. $705,908 \n",
"1 5/23/14 Blended WB $46,294,610 \n",
"2 7/12/13 Grown Ups 2 Sony $133,668,525 \n",
"3 9/28/12 Hotel Transylvania(Voice) Sony $148,313,048 \n",
"4 6/15/12 That's My Boy Sony $36,931,089 \n",
"\n",
" Opening / Theaters Rank Unnamed: 6 Unnamed: 7 \n",
"0 608 $48,024 17 30 \n",
"1 3555 $14,284,031 3555 18 \n",
"2 3491 $41,508,572 3491 8 \n",
"3 3375 $42,522,194 3349 5 \n",
"4 3030 $13,453,714 3030 22 "
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gross.head()"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"gross.drop('Unnamed: 6', axis=1, inplace=True)\n",
"gross.drop('Unnamed: 7', axis=1, inplace=True)\n",
"gross.drop('Opening / Theaters', axis=1, inplace=True)\n",
"gross.drop('Rank', axis=1, inplace=True)\n",
"gross.drop('Studio', axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"gross.columns = ['Date', 'Title', 'Gross']"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"gross.set_index('Title', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"gross.Gross = gross.Gross.str.replace(r'[$,]', '').astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Gross | \n",
"
\n",
" \n",
" Title | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Men, Women & Children | \n",
" 10/1/14 | \n",
" 705908 | \n",
"
\n",
" \n",
" Blended | \n",
" 5/23/14 | \n",
" 46294610 | \n",
"
\n",
" \n",
" Grown Ups 2 | \n",
" 7/12/13 | \n",
" 133668525 | \n",
"
\n",
" \n",
" Hotel Transylvania(Voice) | \n",
" 9/28/12 | \n",
" 148313048 | \n",
"
\n",
" \n",
" That's My Boy | \n",
" 6/15/12 | \n",
" 36931089 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Gross\n",
"Title \n",
"Men, Women & Children 10/1/14 705908\n",
"Blended 5/23/14 46294610\n",
"Grown Ups 2 7/12/13 133668525\n",
"Hotel Transylvania(Voice) 9/28/12 148313048\n",
"That's My Boy 6/15/12 36931089"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gross.head()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"gross.to_csv('imdb-gross.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMDB: Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load both datasets and plot the same values"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ratings = pd.read_csv('imdb-ratings.csv', index_col=0)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"gross = pd.read_csv('imdb-gross.csv', index_col=0)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"gross.Gross = gross.Gross / 1e6"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"53"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(ratings)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"37"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(gross)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"gross.ix['Just Go with It'] = gross.ix['Just Go With It']\n",
"gross = gross.drop('Just Go With It')"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"gross.ix['I Now Pronounce You Chuck & Larry'] = gross.ix['I Now Pronounce You Chuck and Larry']\n",
"gross = gross.drop('I Now Pronounce You Chuck and Larry')"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"imdb = gross.join(ratings)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(37, 33)"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(imdb), len(imdb.dropna())"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"imdb = imdb.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"source = plt.ColumnDataSource(\n",
" data=dict(\n",
" rating=imdb.rating,\n",
" gross=imdb.Gross,\n",
" movie=imdb.index,\n",
" )\n",
")\n",
"\n",
"p = plt.figure(tools='reset,save,hover', x_range=[0, 10], title='',\n",
" x_axis_label=\"Rotten Tomatoes rating\", y_axis_label=\"Box Office Gross\")\n",
"p.scatter(imdb.rating, imdb.Gross, size=10, source=source)\n",
"\n",
"hover = p.select(dict(type=HoverTool))\n",
"hover.tooltips = [\n",
" (\"Movie\", \"@movie\"),\n",
" (\"Rating\", \"@rating\"),\n",
" (\"Box Office Gross\", \"@gross\"),\n",
"]\n",
"\n",
"plt.show(p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Interesting, the result is very different you can see two clusters in this case: the Greater than 100M and less than 100M movies."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X = imdb[['rating', 'Gross']].values"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"clf = KMeans(n_clusters=2)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,\n",
" n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,\n",
" verbose=0)"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf.fit(X)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], dtype=int32)"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clusters = clf.predict(X)\n",
"clusters"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"colors = clusters.astype(str)\n",
"colors[clusters == 0] = 'green'\n",
"colors[clusters == 1] = 'red'"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"source = plt.ColumnDataSource(\n",
" data=dict(\n",
" rating=imdb.rating,\n",
" gross=imdb.Gross,\n",
" movie=imdb.index,\n",
" )\n",
")\n",
"\n",
"p = plt.figure(tools='reset,save,hover', x_range=[0, 10], title='',\n",
" x_axis_label=\"Rotten Tomatoes rating\", y_axis_label=\"Box Office Gross\")\n",
"p.scatter(imdb.rating, imdb.Gross, size=10, source=source, color=colors)\n",
"\n",
"hover = p.select(dict(type=HoverTool))\n",
"hover.tooltips = [\n",
" (\"Movie\", \"@movie\"),\n",
" (\"Rating\", \"@rating\"),\n",
" (\"Box Office Gross\", \"@gross\"),\n",
"]\n",
"\n",
"plt.show(p)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}