{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# When did mixtapes become so popular?\n", "\n", "First, load up the data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# core libraries\n", "import sqlite3\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# plotting libraries\n", "from plotly.offline import init_notebook_mode, iplot\n", "from plotly.graph_objs import Scatter, Figure, Layout\n", "init_notebook_mode()\n", "\n", "# get the data...\n", "con = sqlite3.connect('data.db')\n", "torrents = pd.read_sql_query('SELECT * from torrents;', con)\n", "con.close()\n", "\n", "# define mixtape and album subset\n", "mixtapes = torrents.loc[torrents.releaseType == 'mixtape']\n", "albums = torrents.loc[torrents.releaseType == 'album']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 'popularity' of a release type is not an objective measure. The best metric we have is the number of *snatches* from each release: the number of times each release has been downloaded by users of What.CD. The popularity of a release *type* (mixtape vs. album) can be measured as average number of snatches per release. \n", "\n", "Problematically, the distribution of the number of snatches per release is *very* skewed: most releases have fewer than 5 snatches, but there are several releases with over 20,000. To make the data a little more normally distributed, I use a log transform." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# define year range\n", "years = np.arange(1991, 2017)\n", "\n", "# get average and standard error of log snatches per release \n", "snatches = pd.DataFrame(0, index = years, columns = ['Mixtapes','Albums'])\n", "stderror = pd.DataFrame(0, index = years, columns = ['Mixtapes','Albums'])\n", "\n", "# compute data for each year\n", "for i in years:\n", " \n", " # index releases from current year\n", " year_mixtapes = mixtapes.loc[mixtapes.groupYear == i]\n", " year_albums = albums.loc[albums.groupYear == i]\n", " \n", " # take log transform -- add one to prevent log(0) error\n", " year_mixtapes = np.log(year_mixtapes.totalSnatched + 1)\n", " year_albums = np.log(year_albums.totalSnatched + 1)\n", " \n", " # get average snatches per release\n", " snatches.loc[i,'Mixtapes'] = year_mixtapes.mean() \n", " snatches.loc[i,'Albums'] = year_albums.mean() \n", " \n", " # get standard error\n", " stderror.loc[i,'Mixtapes'] = year_mixtapes.std() / np.sqrt(len(year_mixtapes))\n", " stderror.loc[i,'Albums'] = year_albums.std() / np.sqrt(len(year_albums)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "linespecs = {\n", " 'Mixtapes': dict(color = 'blue', width = 2),\n", " 'Albums': dict(color = 'red', width = 2),\n", " }\n", "\n", "\n", "handles = []\n", "for k in linespecs.keys():\n", " handles.append( Scatter(\n", " x = years, \n", " y = snatches[k],\n", " name = k, \n", " hoverinfo = 'x+name',\n", " line = linespecs[k],\n", " error_y = dict(type='data', \n", " array = stderror[k], \n", " color = linespecs[k]['color'])\n", " )\n", " )\n", "\n", " \n", "layout = Layout(\n", " xaxis = dict(\n", " tickmode = 'auto',\n", " nticks = 20, \n", " tickangle = -60, \n", " showgrid = False\n", " ),\n", " yaxis = dict(title = 'Log Snatches Per Release'),\n", " hovermode = 'closest',\n", " legend = dict(x = 0.55, y = 0.15),\n", ")\n", "\n", "\n", "fh = Figure(data=handles, layout=layout)\n", "iplot(fh)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results\n", "\n", "Whereas albums are the dominant format throughout the '90s, mixtapes rise in popularity from 1997 to 2001 and are competitive with albums thereafter.\n", "\n", "Another unique pattern here is the sharp decline in snatches per release starting in 2009. Since the popularity score (log snatches) is an *average*, the decline could be due to:\n", "\n", "1. An increase in the number of releases without an increase in the number of snatches.\n", "2. A decline in the number of snatches without a decline in the overall number of releases.\n", "\n", "These possibilities can be straightforwardly evaluated:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# aggregate over mixtapes and albums\n", "releases = torrents.loc[torrents.releaseType.isin(['mixtape', 'album'])]\n", "data = pd.DataFrame(index = years, columns = ['Snatches', 'Releases'])\n", "\n", "# compute data for each year\n", "for i in years:\n", " year_releases = releases.loc[releases.groupYear == i]\n", " data.loc[i,'Snatches'] = np.sum(np.log(year_releases.totalSnatched + 1))\n", " data.loc[i,'Releases'] = year_releases.shape[0] \n", "\n", " \n", "# plot as scatter\n", "labels = [\"'\" + str(i)[2:] for i in years]\n", "sh = Scatter(\n", " x = data.Releases, y = data.Snatches,\n", " mode = 'text', text = labels,\n", " textposition='center',\n", " hoverinfo = 'none',\n", " textfont = dict( family='monospace', size=14, color='red'),\n", " name = None\n", ")\n", "\n", "# a quick reference line\n", "slope = 4.1\n", "lh = Scatter(\n", " x = [min(data.Releases), max(data.Releases)], \n", " y = [slope*min(data.Releases), slope*max(data.Releases)],\n", " mode = 'lines', line = dict(color = 'gray', width = 1),\n", " name = '2009 Extrapolation'\n", ")\n", "\n", " \n", "layout = Layout(\n", " yaxis = dict(title = 'Total Log Snatches'),\n", " xaxis = dict(title = 'Number of Releases'),\n", " hovermode = 'closest',\n", " showlegend=False,\n", " annotations= [dict(\n", " x = 4000, y = 4000 * slope,\n", " text = 'log(Snatches) = 4.1 * Releases',\n", " font = dict(family = 'serif', size = 14),\n", " showarrow = False, bgcolor = 'white'\n", " )]\n", ")\n", "\n", "fh = Figure(data=[lh, sh], layout=layout)\n", "iplot(fh)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clearly, there is an increase in the number of releases starting at about 2008. The increases look nonlinear: whereas there are only incremental gains from 1991 to 2002, we get huge increases between 2008 and 2015. \n", "\n", "On the figure I have plotted as a reference a linear model: *log(Snatches) = 4.1 x Releases*, where the slope (4.1) reflects the ratio between log snatches and the number of releases in 2009. This identifies the source of the decline: the snatches have not kept up with the releases. \n", "\n", "I think there's a couple things going on here. The boring explanation is that snatches are *cumulative*: people have had more time to download and listen to older records, which inflates their snatch rate.\n", "\n", "A more interesting dynamic (interesting to me, anyway) is that it takes time for any record's impact to be felt on the genre. Not only were there fewer hip hop records being released in the '90s, but these records were also foundational in the genre, and they're likely to be very popular torrents. So, if we take these data at face value, it would then appear that it takes about 7 years (2009-2016) for hip hop records to enjoy this degree of seniority." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }