{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Reproducing the results of \"The Three Types Of Adam Sandler Movies\" on FiveThirtyEight: http://fivethirtyeight.com/datalab/the-three-types-of-adam-sandler-movies/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Versions" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sys\n", "sys.version_info" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'2.6.2'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import requests\n", "requests.__version__" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'4.3.2'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import bs4\n", "from bs4 import BeautifulSoup\n", "bs4.__version__" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'1.9.2'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.__version__" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'0.16.0'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.__version__" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'0.16.1'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.cluster import KMeans\n", "\n", "import sklearn\n", "sklearn.__version__" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", " \n", " \n", " \n", "
\n", " \n", " BokehJS successfully loaded.\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'0.8.2'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import bokeh.plotting as plt\n", "from bokeh.models import HoverTool\n", "plt.output_notebook()\n", "\n", "import bokeh\n", "bokeh.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Original article is based on [Rotten Tomatoes](http://www.rottentomatoes.com/celebrity/adam_sandler/) for ratings and [Opus Data](http://www.opusdata.com) for the Box Office Gross. The second one is behind a paywall so it was replaced it with the same the Box Office data on Rotten Tomatoes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get the html from rottern tomatoes and pass it to beautiful soup so we can extract the content we want.\n", "Then we used [selector gadget](http://selectorgadget.com/) to get the CSS selector of the table that we can pass to pandas and they will return a nice DataFrame." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def get_soup(url):\n", " r = requests.get(url)\n", " return BeautifulSoup(r.text, 'html5lib')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "rotten_sandler_url = 'http://www.rottentomatoes.com/celebrity/adam_sandler/'" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "soup = get_soup(rotten_sandler_url)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "films_table = str(soup.select('#filmography_box table:first-child')[0])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rotten = pd.read_html(films_table)[0]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RATINGTITLECREDITBOX OFFICEYEAR
0NaNHello GhostActor Producer--2015
19%The CobblerMax Simkin--2015
2NaNPixelsProducer Screenwriter Sam Brenner--2015
3NaNCandy LandActor--2015
46%Paul Blart: Mall Cop 2Producer$43.2M2015
\n", "
" ], "text/plain": [ " RATING TITLE CREDIT \\\n", "0 NaN Hello Ghost Actor Producer \n", "1 9% The Cobbler Max Simkin \n", "2 NaN Pixels Producer Screenwriter Sam Brenner \n", "3 NaN Candy Land Actor \n", "4 6% Paul Blart: Mall Cop 2 Producer \n", "\n", " BOX OFFICE YEAR \n", "0 -- 2015 \n", "1 -- 2015 \n", "2 -- 2015 \n", "3 -- 2015 \n", "4 $43.2M 2015 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rotten.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We convert the \"Rating\" and \"Box Office\" columns to numeric values with some simple transformations removing some text characters and also replacing empty values with `numpy.nan`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rotten.RATING = rotten.RATING.str.replace('%', '').astype(float)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "rotten['BOX OFFICE'] = rotten['BOX OFFICE'].str.replace('$', '').str.replace('M', '').str.replace('-', '0')\n", "rotten['BOX OFFICE'] = rotten['BOX OFFICE'].astype(float)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rotten.loc[rotten['BOX OFFICE'] == 0, ['BOX OFFICE']] = np.nan" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RATINGTITLECREDITBOX OFFICEYEAR
0NaNHello GhostActor ProducerNaN2015
19The CobblerMax SimkinNaN2015
2NaNPixelsProducer Screenwriter Sam BrennerNaN2015
3NaNCandy LandActorNaN2015
46Paul Blart: Mall Cop 2Producer43.22015
\n", "
" ], "text/plain": [ " RATING TITLE CREDIT \\\n", "0 NaN Hello Ghost Actor Producer \n", "1 9 The Cobbler Max Simkin \n", "2 NaN Pixels Producer Screenwriter Sam Brenner \n", "3 NaN Candy Land Actor \n", "4 6 Paul Blart: Mall Cop 2 Producer \n", "\n", " BOX OFFICE YEAR \n", "0 NaN 2015 \n", "1 NaN 2015 \n", "2 NaN 2015 \n", "3 NaN 2015 \n", "4 43.2 2015 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rotten.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rotten = rotten.set_index('TITLE')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We finaly save the dataset." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "rotten.to_csv('rotten.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the original chart for comparison" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from IPython.display import Image" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='https://espnfivethirtyeight.files.wordpress.com/2015/04/hickey-datalab-sandler.png?w=610&h=634')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We load the saved data into a DataFrame and we just plot using bokeh that gives some nice interactive features that the original chart do not have, this makes it easier to explore the different movies." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "rotten = pd.read_csv('rotten.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rotten = rotten.dropna()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "37" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(rotten)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['Paul Blart: Mall Cop 2', 'Blended', 'Top Five', 'Grown Ups 2', 'That's My Boy', 'Hotel Transylvania', 'Here Comes the Boom', 'Jack and Jill', 'Zookeeper', 'Just Go with It', 'Bucky Larson: Born to Be a Star', 'Grown Ups', 'Funny People', 'Paul Blart: Mall Cop', 'You Don't Mess With the Zohan', 'The House Bunny', 'Bedtime Stories', 'Strange Wilderness', 'I Now Pronounce You Chuck & Larry', 'Reign Over Me', 'Grandma's Boy', 'Click', 'The Benchwarmers', 'Deuce Bigalow: European Gigolo', 'The Longest Yard', 'Spanglish', '50 First Dates', 'Dickie Roberts: Former Child Star', 'Anger Management', 'The Hot Chick', 'Mr. Deeds', 'Adam Sandler's Eight Crazy Nights', 'The Master of Disguise', 'Punch-Drunk Love', 'Joe Dirt', 'The Animal', 'Little Nicky'], dtype='object')" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rotten.index" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "source = plt.ColumnDataSource(\n", " data=dict(\n", " rating=rotten.RATING,\n", " gross=rotten['BOX OFFICE'],\n", " movie=rotten.index,\n", " )\n", ")\n", "\n", "p = plt.figure(tools='reset,save,hover', x_range=[0, 100], title='',\n", " x_axis_label=\"Rotten Tomatoes rating\", y_axis_label=\"Box Office Gross\")\n", "p.scatter(rotten.RATING, rotten['BOX OFFICE'], size=10, source=source)\n", "\n", "hover = p.select(dict(type=HoverTool))\n", "\n", "hover.tooltips = [\n", " (\"Movie\", \"@movie\"),\n", " (\"Rating\", \"@rating\"),\n", " (\"Box Office Gross\", \"@gross\"),\n", "]\n", "\n", "plt.show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clusters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The articles also mentioned some simple clustering on the dataset. We can reproduce that with scikit-learn." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X = rotten[['RATING', 'BOX OFFICE']].values" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "clf = KMeans(n_clusters=3)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,\n", " n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,\n", " verbose=0)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 2, 1, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 1, 0, 1, 0, 1, 2, 0, 1, 0,\n", " 0, 1, 2, 1, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0], dtype=int32)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters = clf.predict(X)\n", "clusters" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "colors = clusters.astype(str)\n", "colors[clusters == 0] = 'green'\n", "colors[clusters == 1] = 'red'\n", "colors[clusters == 2] = 'gold'" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "source = plt.ColumnDataSource(\n", " data=dict(\n", " rating=rotten.RATING,\n", " gross=rotten['BOX OFFICE'],\n", " movie=rotten.index,\n", " )\n", ")\n", "\n", "p = plt.figure(tools='reset,save,hover', x_range=[0, 100], title='',\n", " x_axis_label=\"Rotten Tomatoes rating\", y_axis_label=\"Box Office Gross\")\n", "p.scatter(rotten.RATING, rotten['BOX OFFICE'], size=10, source=source, color=colors)\n", "\n", "hover = p.select(dict(type=HoverTool))\n", "\n", "hover.tooltips = [\n", " (\"Movie\", \"@movie\"),\n", " (\"Rating\", \"@rating\"),\n", " (\"Box Office Gross\", \"@gross\"),\n", "]\n", "\n", "plt.show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see a similar result as the original article mentioned there is some differences in the Box Office Gross as it was mentioned before so the result is not exactly the same." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## IMDB" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens if we use IMDB ratings instead of Rotten Tomatoes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IMDB: Ratings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We apply a similar procedure for getting the data from IMDB with a basic crawler." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "imdb_sandler_url = 'http://www.imdb.com/name/nm0001191/'" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "soup = get_soup(imdb_sandler_url)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [], "source": [ "a_tags = soup.select('div#filmo-head-actor + div b a')" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[The Ridiculous 6,\n", " Hotel Transylvania 2,\n", " Pixels,\n", " The Cobbler,\n", " Men, Women & Children]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a_tags[:5]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [], "source": [ "movies = {}\n", "for a_tag in a_tags:\n", " movie_name = a_tag.text\n", " movie_url = 'http://www.imdb.com' + a_tag['href']\n", " soup = get_soup(movie_url)\n", " rating = soup.select('.star-box-giga-star')\n", " if len(rating) == 1:\n", " movies[movie_name] = float(rating[0].text)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ratings = pd.DataFrame.from_dict(movies, orient='index')\n", "ratings.columns = ['rating']" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rating
The Hot Chick5.5
The Animal4.8
Deuce Bigalow: Male Gigolo5.7
Happy Gilmore7.0
Eight Crazy Nights5.4
\n", "
" ], "text/plain": [ " rating\n", "The Hot Chick 5.5\n", "The Animal 4.8\n", "Deuce Bigalow: Male Gigolo 5.7\n", "Happy Gilmore 7.0\n", "Eight Crazy Nights 5.4" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratings.head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "53" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(ratings)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ratings.index.name = 'Title'" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ratings.to_csv('imdb-ratings.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IMDB: Box Office Mojo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "IMDB also provides the Box Office Gross information from [Box Office Mojo](http://www.boxofficemojo.com)." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "box_sandler_url = 'http://www.boxofficemojo.com/people/chart/?view=Actor&id=adamsandler.htm'" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": true }, "outputs": [], "source": [ "soup = get_soup(box_sandler_url)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [], "source": [ "box_gross_table = str(soup.select('br + table')[0])" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gross = pd.read_html(box_gross_table, header=0)[0]" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTitle (click to view)StudioLifetime Gross / TheatersOpening / TheatersRankUnnamed: 6Unnamed: 7
010/1/14Men, Women & ChildrenPar.$705,908608$48,0241730
15/23/14BlendedWB$46,294,6103555$14,284,031355518
27/12/13Grown Ups 2Sony$133,668,5253491$41,508,57234918
39/28/12Hotel Transylvania(Voice)Sony$148,313,0483375$42,522,19433495
46/15/12That's My BoySony$36,931,0893030$13,453,714303022
\n", "
" ], "text/plain": [ " Date Title (click to view) Studio Lifetime Gross / Theaters \\\n", "0 10/1/14 Men, Women & Children Par. $705,908 \n", "1 5/23/14 Blended WB $46,294,610 \n", "2 7/12/13 Grown Ups 2 Sony $133,668,525 \n", "3 9/28/12 Hotel Transylvania(Voice) Sony $148,313,048 \n", "4 6/15/12 That's My Boy Sony $36,931,089 \n", "\n", " Opening / Theaters Rank Unnamed: 6 Unnamed: 7 \n", "0 608 $48,024 17 30 \n", "1 3555 $14,284,031 3555 18 \n", "2 3491 $41,508,572 3491 8 \n", "3 3375 $42,522,194 3349 5 \n", "4 3030 $13,453,714 3030 22 " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gross.head()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gross.drop('Unnamed: 6', axis=1, inplace=True)\n", "gross.drop('Unnamed: 7', axis=1, inplace=True)\n", "gross.drop('Opening / Theaters', axis=1, inplace=True)\n", "gross.drop('Rank', axis=1, inplace=True)\n", "gross.drop('Studio', axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gross.columns = ['Date', 'Title', 'Gross']" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gross.set_index('Title', inplace=True)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gross.Gross = gross.Gross.str.replace(r'[$,]', '').astype(int)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateGross
Title
Men, Women & Children10/1/14705908
Blended5/23/1446294610
Grown Ups 27/12/13133668525
Hotel Transylvania(Voice)9/28/12148313048
That's My Boy6/15/1236931089
\n", "
" ], "text/plain": [ " Date Gross\n", "Title \n", "Men, Women & Children 10/1/14 705908\n", "Blended 5/23/14 46294610\n", "Grown Ups 2 7/12/13 133668525\n", "Hotel Transylvania(Voice) 9/28/12 148313048\n", "That's My Boy 6/15/12 36931089" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gross.head()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": true }, "outputs": [], "source": [ "gross.to_csv('imdb-gross.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IMDB: Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load both datasets and plot the same values" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ratings = pd.read_csv('imdb-ratings.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": true }, "outputs": [], "source": [ "gross = pd.read_csv('imdb-gross.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gross.Gross = gross.Gross / 1e6" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "53" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(ratings)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "37" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(gross)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gross.ix['Just Go with It'] = gross.ix['Just Go With It']\n", "gross = gross.drop('Just Go With It')" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gross.ix['I Now Pronounce You Chuck & Larry'] = gross.ix['I Now Pronounce You Chuck and Larry']\n", "gross = gross.drop('I Now Pronounce You Chuck and Larry')" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [], "source": [ "imdb = gross.join(ratings)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(37, 33)" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(imdb), len(imdb.dropna())" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": true }, "outputs": [], "source": [ "imdb = imdb.dropna()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "source = plt.ColumnDataSource(\n", " data=dict(\n", " rating=imdb.rating,\n", " gross=imdb.Gross,\n", " movie=imdb.index,\n", " )\n", ")\n", "\n", "p = plt.figure(tools='reset,save,hover', x_range=[0, 10], title='',\n", " x_axis_label=\"Rotten Tomatoes rating\", y_axis_label=\"Box Office Gross\")\n", "p.scatter(imdb.rating, imdb.Gross, size=10, source=source)\n", "\n", "hover = p.select(dict(type=HoverTool))\n", "hover.tooltips = [\n", " (\"Movie\", \"@movie\"),\n", " (\"Rating\", \"@rating\"),\n", " (\"Box Office Gross\", \"@gross\"),\n", "]\n", "\n", "plt.show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interesting, the result is very different you can see two clusters in this case: the Greater than 100M and less than 100M movies." ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X = imdb[['rating', 'Gross']].values" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": true }, "outputs": [], "source": [ "clf = KMeans(n_clusters=2)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,\n", " n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,\n", " verbose=0)" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], dtype=int32)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters = clf.predict(X)\n", "clusters" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": true }, "outputs": [], "source": [ "colors = clusters.astype(str)\n", "colors[clusters == 0] = 'green'\n", "colors[clusters == 1] = 'red'" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "source = plt.ColumnDataSource(\n", " data=dict(\n", " rating=imdb.rating,\n", " gross=imdb.Gross,\n", " movie=imdb.index,\n", " )\n", ")\n", "\n", "p = plt.figure(tools='reset,save,hover', x_range=[0, 10], title='',\n", " x_axis_label=\"Rotten Tomatoes rating\", y_axis_label=\"Box Office Gross\")\n", "p.scatter(imdb.rating, imdb.Gross, size=10, source=source, color=colors)\n", "\n", "hover = p.select(dict(type=HoverTool))\n", "hover.tooltips = [\n", " (\"Movie\", \"@movie\"),\n", " (\"Rating\", \"@rating\"),\n", " (\"Box Office Gross\", \"@gross\"),\n", "]\n", "\n", "plt.show(p)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }