{ "cells": [ { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "# Data Loading and Cleaning" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "## Status\n", "\n", "It's now 2020.\n", "Last year we didn't do a countdown.\n", "This year, it's not A to Z, but the top 2020 songs of all time.\n", "It just started this morning, \n", "but I bet we can fit that into the same format." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "## Overview\n", "\n", "Everyone of these countdowns is just a bit different,\n", "and they are a bit of a surprise as they evolve.\n", "So the data collection and clean up is usually evolving under time pressure.\n", "But as we're up to year three, patterns emerge\n", "and I've managed to clean some of it up.\n", "\n", "What follows is the latest start of the art,\n", "with a bit of clutter from the accumulated countdown data." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "## Setup\n", "\n", "Under the covers it mostly a combo of [requests](http://docs.python-requests.org/en/master/) and [lxml](http://lxml.de/) for web-scraping and [pandas](https://pandas.pydata.org/) for data munging. Before we get started, set up the imports." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "%matplotlib inline\n", "from IPython.display import display, HTML\n", "import requests \n", "from lxml import html\n", "import pandas as pd\n", "import numpy as np\n", "from datetime import date, datetime, time\n", "from os import path, mkdir\n", "import re\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Set up cache/data directories\n", "\n", "Whenever possible, we'll cache the data.\n", "Partially for speed when rerunning the notebooks during the countdown,\n", "but also to make the notebooks reproducible later,\n", "if the data ends up moving." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "cache_dir = './cache'\n", "playlist_cache_dir = path.join(cache_dir, 'playlists')\n", "a2z_cache_dir = path.join(cache_dir, 'a2z')\n", "a2z70s_cache_dir = path.join(cache_dir, 'a2z70s')\n", "a2z80s_cache_dir = path.join(cache_dir, 'a2z80s')\n", "xpn8080_cache_idir = path.join(cache_dir, 'xpn2020')\n", "musicbrainz_cache_dir = path.join(cache_dir, 'musicbrainz')\n", "data_dir = './data'\n", "\n", "for d in (cache_dir, playlist_cache_dir, a2z_cache_dir, a2z70s_cache_dir,\n", " a2z80s_cache_dir, data_dir, musicbrainz_cache_dir):\n", " if not path.exists(d): mkdir(d)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "## Generic XPN Playlist scraping\n", "\n", "Originally I tended to rely on the one-off countdown pages for playlists.\n", "But eventually I ended up using the generic playlist at [http://xpn.org/playlists/xpn-playlist](http://xpn.org/playlists/xpn-playlist).\n", "I've done this enough, it's past time to turn it into something reusable." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "def fetch_daily_playlist(day, cache_dir=None, verbose = False):\n", " \"\"\"\n", " Fetches the XPN playlist for a given date\n", " \n", " Args:\n", " day (datetime.date) : The day to fetch the playlist for\n", " cache_dir (string) : Path to the cache directory, or None to avoid caching\n", " \n", " Returns:\n", " DataFrame containing Artist and Title as Strings and Airtime as Timestamp\n", " \"\"\"\n", " songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Air Time'])\n", " if cache_dir is not None:\n", " cache_file = path.join(cache_dir, \"%04d-%02d-%02d.csv\" % \\\n", " (day.year, day.month, day.day))\n", " if cache_file is not None and path.exists(cache_file):\n", " songs = pd.read_csv(cache_file, encoding='utf-8')\n", " songs['Air Time'] = pd.to_datetime(songs['Air Time'], errors='coerce')\n", " if verbose: print \"Got %d rows from %s\" % (len(songs), cache_file)\n", " else:\n", " day_s = '%02d-%02d-%04d' % (day.month, day.day, day.year)\n", " page = requests.post('https://xpn.org/playlists/xpn-playlist',\n", " data = {'playlistdate': day_s})\n", " if verbose: print \"fetching %s returned status %s\" % (day_s, page.status_code)\n", " \n", " # play list pages claim to be utf-8, but the rare non-ascii character\n", " # is always latin-1\n", " #tree = html.fromstring(page.content.decode('latin-1'))\n", " tree = html.fromstring(page.content)\n", " tracks = tree.xpath('//h3/a/text()')\n", " # not all rows are tracks, some are membership callouts\n", " # but real tracks start with times and are formatted\n", " # HH:MM [am|pm] Artist - Title\n", " # Note that I've seen titles with embedded dashes,\n", " # but so far no artist names with them. This may be luck.\n", " # Special programs like World Cafe, Echos, ...\n", " # also start with an air time, but don't have useful track info\n", " # but those list the program inside bars\n", " # eg |World Cafe| - \"Wednesday 11-2-2016 Hour 2, Part 7\"\n", " date_regex = re.compile(\"^\\d{2}:\\d{2}\\s\")\n", " line_count= 0\n", " track_count = 0\n", " for track in tracks:\n", " line_count += 1\n", " if date_regex.match(track) and track[9:10] != '|':\n", " (artist, title) = track[9:].split(' - ', 1)\n", " dt = datetime.strptime(track[:8], '%I:%M %p')\n", " air_time = datetime.combine(day, dt.time())\n", " if verbose: print \"adding %s %s %s\" % (artist, title, air_time)\n", " songs = songs.append({'Artist': artist,\n", " 'Title': title,\n", " 'Air Time': air_time},\n", " ignore_index = True)\n", " if verbose: print \"size = %d\" % len(songs)\n", " track_count += 1\n", " \n", " if verbose: print 'read %d line and added %d tracks' % (line_count, track_count)\n", " # Drop any duplicates, which are not uncommon\n", " songs = songs.drop_duplicates()\n", " if cache_file is not None:\n", " songs.to_csv(cache_file, index=False, encoding='utf-8')\n", " if verbose: print 'write %d rows to %s' % (len(songs), cache_file)\n", " \n", " return songs\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "def fetch_playlist(start, end, cache_dir=None):\n", " \"\"\"\n", " Fetch all the playlist entries for a range of time.\n", " \n", " Args:\n", " start (datetime.datetime) : The inclusive start time to fetch entries for\n", " end (datetime.datetime) : The exclusive end time to fetch entries for\n", " cache_dir (string) : path to the cache directory, or None to avoid caching\n", " \n", " Returns:\n", " Dataframe containing Artist and Title as strings, and Airtime as timestamp\n", " \"\"\"\n", " songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Air Time'])\n", " for day in pd.date_range(start.date(), end.date()):\n", " songs = songs.append(fetch_daily_playlist(day, cache_dir), ignore_index=True)\n", " songs = songs[songs['Air Time'] >= start]\n", " songs = songs[songs['Air Time'] < end]\n", " # sometimes the playlist entries are duplicated\n", " song = songs.drop_duplicates()\n", " songs = songs.sort_values(by = 'Air Time')\n", " return songs" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Load the playlists\n", "\n", "Since this is the third time, and I've pulled the data prep into one notebook,\n", "to save redundancy, there are a few play lists to load." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "#### XPN 2020 \n", "\n", "For 2020, XPN is doing a listener curated \"top 2020 songs\" countdown.\n", "It just started this morning, Thursday December 10 at 8:00 am.\n", "\n", "Two twists this year.\n", "First, like 2018, there is a pause.\n", "The countdown stopped at numbrer 101, just after midnight on December 17,\n", "and picked back up for the last 100 at 8am.\n", "Also, from 6am to 8am, there is a mini-list of One Vote Wonders,\n", "songs that only got one vote in the XPN 2020 polling.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "got 1920 rows\n" ] } ], "source": [ "xpn2020 = fetch_playlist(datetime(2020, 12, 10, 8, 0), datetime(2020, 12, 17, 0, 41),\n", " playlist_cache_dir)\n", "print \"got %d rows\" % len(xpn2020)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "got 24 rows\n" ] } ], "source": [ "xpn2020_onsies = fetch_playlist(datetime(2020, 12, 17, 6, 0), datetime(2020, 12, 17, 8, 0),\n", " playlist_cache_dir)\n", "print \"got %d rows\" % len(xpn2020_onsies)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "got 100 rows\n" ] } ], "source": [ "xpn2020pt2 = fetch_playlist(datetime(2020, 12, 17, 8, 0), datetime(2020, 12, 17, 18, 46),\n", " playlist_cache_dir)\n", "print \"got %d rows\" % len(xpn2020pt2)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "Before going further, let's take a quick look at the data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
194Booker T. & The MG'sTime Is Tight2020-12-10 08:02:00
193AC/DCT.N.T.2020-12-10 08:05:00
192Peter FramptonShow Me the Way2020-12-10 08:11:00
191The DriftersUnder The Boardwalk2020-12-10 08:16:00
190AdeleRumor Has It2020-12-10 08:19:00
" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(xpn2020.head(5).to_html())" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
2123Foo FightersEverlong2020-12-17 00:08:00
2122Bob Marley & The WailersThree Little Birds2020-12-17 00:12:00
2121Pearl JamAlive2020-12-17 00:17:00
2120Joni MitchellBoth Sides Now2020-12-17 00:23:00
2119Elton JohnFuneral For A Friend/Love Lies Bleeding2020-12-17 00:30:00
" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(xpn2020.tail(5).to_html())" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
123J. J. CaleAfter Midnight2020-12-17 06:03:00
122Shocking BlueVenus2020-12-17 06:05:00
121Ben Folds FiveBrick2020-12-17 06:10:00
120Sarah McLachlanSweet Surrender2020-12-17 06:14:00
119World PartyWhen The Rainbow Comes2020-12-17 06:22:00
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(xpn2020_onsies.head(5).to_html())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
104Missy ElliottWork It2020-12-17 07:32:00
103Wu-Tang ClanC.R.E.A.M. (Cash Rules Everything Around Me)2020-12-17 07:36:00
102Robert HazardEscalator Of Life2020-12-17 07:42:00
101Rickie Lee JonesChuck E's In Love2020-12-17 07:45:00
100Taj MahalAin't Gwine Whistle Dixie2020-12-17 07:52:00
" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(xpn2020_onsies.tail(5).to_html())" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
99The DoorsLight My Fire2020-12-17 08:02:00
98Miles DavisAll Blues2020-12-17 08:10:00
97John PrineHello In There2020-12-17 08:24:00
96Joni MitchellRiver2020-12-17 08:28:00
95Jason Isbell & The 400 UnitIf We Were Vampires2020-12-17 08:32:00
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(xpn2020pt2.head(5).to_html())" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
4Bruce SpringsteenBorn To Run2020-12-17 18:15:00
3The Rolling StonesGimme Shelter2020-12-17 18:20:00
2Bob DylanLike A Rolling Stone2020-12-17 18:26:00
1John LennonImagine2020-12-17 18:33:00
0Bruce SpringsteenThunder Road2020-12-17 18:39:00
" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(xpn2020pt2.tail(5).to_html())" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "#### XPN 80's A to Z playlist\n", "\n", "The 80's playlist started on Wednesday November 28 2018 at 8:00 am.\n", "As of this writing it just ended yesterday.\n", "However, something unusual happened this time:\n", "we took a break from 1am to 6am on 12-09.\n", "So it's easier to treat it as two playlists,\n", "and merge them after we calculate durations.\n", "The alternative is to allow passing in lists of breaks \n", "to duration calculation, if there were more breaks, we might." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "got 3360 rows\n" ] } ], "source": [ "eighties = fetch_playlist(datetime(2018, 11, 28, 8, 0), datetime(2018,12,9, 1, 0),\n", " playlist_cache_dir)\n", "print \"got %d rows\" % len(eighties)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "got 71 rows\n" ] } ], "source": [ "eighties2 = fetch_playlist(datetime(2018, 12, 9, 6, 0), datetime(2018, 12, 9, 11, 49), playlist_cache_dir)\n", "print \"got %d rows\" % len(eighties2)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "Before going an further, let's take a quick look at what we loaded:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
209Warren ZevonA Certain Girl2018-11-28 08:01:00
208U2A Day Without Me2018-11-28 08:04:00
207The CureA Forest2018-11-28 08:07:00
206The WaterboysA Girl Called Johnny2018-11-28 08:13:00
205Romeo VoidA Girl in Trouble (Is a Temporary Thing)2018-11-28 08:18:00
" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(eighties.head(5).to_html())" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
3613Crowded HouseWorld Where You Live2018-12-09 00:38:00
3612Captain SensibleWot2018-12-09 00:41:00
3611EurythmicsWould I Lie To You?2018-12-09 00:44:00
3610Nik KershawWouldn't It Be Good 12\"2018-12-09 00:49:00
3609Black FlagWound Up2018-12-09 00:56:00
" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(eighties.tail(5).to_html())" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
160Ringo StarrWrack My Brain2018-12-09 06:01:00
159The Fabulous ThunderbirdsWrap It Up2018-12-09 06:03:00
158The PoliceWrapped Around Your Finger2018-12-09 06:06:00
157Bruce SpringsteenWreck On The Highway2018-12-09 06:12:00
156Neil YoungWrecking Ball2018-12-09 06:16:00
" ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(eighties2.head(5).to_html())" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
count336033603360
unique108832883360
topBruce SpringsteenHeartbeat2018-12-08 04:08:00
freq5031
first2018-11-28 08:01:00
last2018-12-09 00:56:00
" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(eighties.describe(include='all', percentiles=[]).to_html(na_rep=''))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "#### 80s Leftovers\n", "\n", "In 2016, there was a follow on \"leftovers\" list for parentheticals,\n", "numbers and other random non-alphabeticals.\n", "In 2018, the playlist transitioned right into the leftovers.\n", "But since it doesn't align with any of the comparisons,\n", "I'm going to treat it separately for now." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "got 42 rows\n" ] } ], "source": [ "eighties_leftovers = fetch_playlist(datetime(2018, 12, 9, 11, 50), datetime(2018, 12, 9, 15, 0),\n", " playlist_cache_dir)\n", "print \"got %d rows\" % len(eighties_leftovers)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
89Minutemen#1 Hit Song2018-12-09 11:50:00
88The Blow Monkeys with Curtis Mayfield(Celebrate) The Day After You2018-12-09 11:52:00
87Ministry(Every Day Is) Halloween2018-12-09 12:00:00
86Cutting Crew(I Just) Died in Your Arms2018-12-09 12:07:00
85Joan Armatrading(I Love It When You) Call Me Names2018-12-09 12:12:00
" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(eighties_leftovers.head(5).to_html())" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
count424242
unique414242
topU2(It's Not Me) Talking2018-12-09 13:51:00
freq211
first2018-12-09 11:50:00
last2018-12-09 14:57:00
" ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(eighties_leftovers.describe(include='all', percentiles=[]).to_html(na_rep=''))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "#### XPN 70's A to Z playlist\n", "\n", "The 70s AtoZ started at 6:00 am on Nov 29 2107,\n", "and ended at 7:00 pm on Dec 12 2017.\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "got 4157 rows\n" ] } ], "source": [ "seventies = fetch_playlist(datetime(2017, 11, 29, 6, 0), datetime(2017, 12, 12, 19, 0), playlist_cache_dir)\n", "\n", "# Cover what looks like a Free at Noon slid into the play list\n", "seventies = seventies[seventies['Title'] != 'The Runner']\n", "\n", "print \"got %d rows\" % len(seventies)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
219Steeleye SpanA Calling-On Song2017-11-29 06:02:00
218Joni MitchellA Case Of You2017-11-29 06:03:00
217Boz ScaggsA Clue2017-11-29 06:07:00
216Todd RundgrenA Dream Goes On Forever2017-11-29 06:13:00
215Lou ReedA Gift2017-11-29 06:16:00
" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(seventies.head(5).to_html())" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
count415741574157
unique102840004154
topDavid BowieShe's Gone2017-12-10 23:17:00
freq6332
first2017-11-29 06:02:00
last2017-12-12 18:54:00
" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(seventies.describe(include='all', percentiles=[]).to_html(na_rep=''))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### The Original A-Z Playlist\n", "\n", "The original A-Z playlist ran in 2016 from November 30 at 6:00 am\n", "until December 17 at 1:30 pm." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "got 5691 rows\n" ] } ], "source": [ "originals = fetch_playlist(datetime(2016, 11, 30, 6, 0), datetime(2016, 12, 17, 13, 30), playlist_cache_dir)\n", "\n", "print \"got %d rows\" % len(originals)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
245Jackson 5ABC2016-11-30 06:01:00
244Elvis PresleyA Big Hunk O' Love2016-11-30 06:04:00
243Johnny CashA Boy Named Sue (live)2016-11-30 06:06:00
242Joni MitchellA Case Of You2016-11-30 06:10:00
241Ernie K-DoeA Certain Girl2016-11-30 06:16:00
" ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(originals.head(5).to_html())" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir Time
count569156915691
unique165852945687
topThe BeatlesHold On2016-12-10 17:46:00
freq14152
first2016-11-30 06:01:00
last2016-12-17 13:25:00
" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(originals.describe(include='all', percentiles=[]).to_html(na_rep=''))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "hideCode": false, "hidePrompt": false }, "source": [ "## Augmenting the Data" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Scraping the Playlist Specific Pages\n", "\n", "For the original, 70s, and 80s A-Z, but not the A-Z leftovers,\n", "the station put up countdown specific pages with play lists\n", "in a slightly different format.\n", "One advantage of using them is that they only include tracks from the countdown,\n", "avoiding any need for time checking the date range.\n", "Another is that for the 70s A-Z,\n", "they added lists by year.\n", "Given the pain it was to search MusicBrainz for songs and\n", "figure out the year, that's worth having.\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "#### 70s A-Z Page\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "##### Alphabetical Lists\n", "Now that I've moved to the main playlist,\n", "I don't know that the alphabetical lists buy much.\n", "Getting the first letter ourselves is pretty easy.\n", "But since older versions of the code used it,\n", "we'll at least archive them" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "#alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'\n", "#\n", "#seventies_by_letter = pd.DataFrame(None, columns = ['Title', 'Artist', 'Letter'])\n", "#for letter in alphabet:\n", "# cache_file = path.join(a2z70s_cache_dir, '%s.csv' % letter)\n", "# if path.exists(cache_file):\n", "# df = pd.read_csv(cache_file)\n", "# else:\n", "# rows = []\n", "# page = requests.get('http://xpn.org/static/az2017.php?q=%s' % letter)\n", "# tree = html.fromstring(page.content)\n", "# songs = tree.xpath('//li/text()')\n", "# for song in songs:\n", "# rows.append(song.rsplit(' - ', 1) + [letter])\n", "# df = pd.DataFrame(rows, columns=['Title', 'Artist', 'Letter'])\n", "# df.to_csv(cache_file, index=False)\n", "# seventies_by_letter = seventies_by_letter.append(df, ignore_index=True)\n", "#\n", "#print \"got %d songs by letter\" % len(seventies_by_letter)\n", "# was 4202 before commenting out" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "##### Lists by Year" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "#years = map(str, range(1970,1980))\n", "#seventies_by_year = pd.DataFrame(None, columns = ['Title', 'Artist', 'Year'])\n", "#for year in years:\n", "# cache_file = path.join(a2z70s_cache_dir, '%s.csv' % year)\n", "# if path.exists(cache_file):\n", "# df = pd.read_csv(cache_file)\n", "# else:\n", "# rows = []\n", "# page = requests.get('http://xpn.org/static/az2017v2.php?q=%s' % year)\n", "# tree = html.fromstring(page.content)\n", "# songs = tree.xpath('//li/text()')\n", "# for song in songs:\n", "# rows.append(song.rsplit(' - ', 1) + [year])\n", "# df = pd.DataFrame(rows, columns=['Title', 'Artist', 'Year'])\n", "# df.to_csv(cache_file, index=False)\n", "# seventies_by_year = seventies_by_year.append(df, ignore_index=True)\n", "#\n", "#seventies_by_year.to_csv(path.join(data_dir, 'seventies_by_year.csv'))\n", "#print 'got %d songs by year' % len(seventies_by_year)\n", "# was 3699 before comment out" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Best and Worst\n", "\n", "Before the A-Z countdowns, \n", "there used to be the \"885\" countdowns.\n", "Each year had a theme. 2014's theme was \"All Time Greatest and Wort Songs\",\n", "where there was the traditional 885 countdown for best\n", "and a side 88 Worst list.\n", "As people comment on what got included in the A-Z countdowns,\n", "which are curated by the station,\n", "it's fun to compare against the best and worst which were based\n", "on listener voting." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "def fetch_best(playlist_url, pagecount):\n", " \"\"\"\n", " Fetch data from the 885 best or 88 worst playlists.\n", " Both use the same format, just different urls and \n", " more or fewer pages.\n", " \n", " Args:\n", " playlist_url (string) : base url for the playlist\n", " pagecount (int) : number of pages to ge\n", " Returns:\n", " DataFrame containing the track data\n", " \"\"\"\n", " \n", " rows = []\n", " \n", " for page_no in range(1, pagecount + 1):\n", " args = {'page': page_no}\n", " page = requests.get(playlist_url, params = args)\n", " tree = html.fromstring(page.content)\n", " tracks = tree.xpath(\"//*/tr[@class='countdown']\")\n", " for track in tracks:\n", " artist = track.xpath('./td[2]/text()')[0]\n", " title = track.xpath('./td[@class=\"song\"]/text()')[0]\n", " rows.append([title, artist])\n", " df = pd.DataFrame(rows, columns = ['Title', 'Artist'])\n", " return df\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleArtist
0Thunder RoadBruce Springsteen
1Like A Rolling StoneBob Dylan
2ImagineJohn Lennon
3A Day In The LifeThe Beatles
4Born To RunBruce Springsteen
" ], "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "best885_file = path.join(data_dir, '885best.csv')\n", "if not path.exists(best885_file):\n", " best885 = fetch_best('http://www.xpn.org/music-artist/885-countdown/2014/885-countdown-2014',18)\n", " best885.to_csv(best885_file, index=False)\n", "else:\n", " best885 = pd.read_csv(best885_file)\n", " \n", "HTML(best885.head(5).to_html())\n", " " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleArtist
0We Built This CityStarship
1Who Let The Dogs OutBaha Men
2Achy Breaky HeartBilly Ray Cyrus
3(You're) Having My BabyPaul Anka
4MacarenaLos Del Rio
" ], "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "worst88_file = path.join(data_dir, '88worst.csv')\n", "if not path.exists(worst88_file):\n", " worst88 = fetch_best('http://www.xpn.org/music-artist/885-countdown/2014/885-countdown-2014-88-worst',2)\n", " worst88.to_csv(worst88_file, index=False)\n", "else:\n", " worst88 = pd.read_csv(worst88_file)\n", " \n", "HTML(worst88.head(5).to_html())" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Putting it together\n", "\n", "One might think that we can just join up the data.\n", "However there is a catch.\n", "There are some cases where one or more of the URLs will return legitimate duplicates.\n", "For example two entries for the same song / artist at the same time in the main playlist page.\n", "However there are also valid entries for the same song / artist,\n", "at different times, released in different years.\n", "The catch is that there is no common key between our three sources to join on.\n", "If we dedupe on title and artist we drop real tracks.\n", "But doing a Cartesian product would generate 4 apparent tracks for two tracks.\n", "So we need to build an artificial key." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "#seventies = seventies.sort_values(by='Air Time')\n", "#seventies['Play'] = pd.Series([0 for x in range(len(seventies.index))], index=seventies.index)\n", "#last = None\n", "#count = 0\n", "#for idx, row in seventies.iterrows():\n", "# if last is None or last != (row['Title'], row['Artist']):\n", "# last = (row['Title'], row['Artist'])\n", "# count = 0\n", "# else:\n", "# count += 1\n", "# seventies.loc[idx, 'Play'] = count\n", "#\n", "#seventies_by_letter = seventies_by_letter.drop_duplicates()\n", "#\n", "#seventies_by_year = seventies_by_year.sort_values(by=['Title', 'Artist'])\n", "#seventies_by_year['Play'] = pd.Series([0 for x in range(len(seventies_by_year.index))], index=seventies_by_year.index)\n", "#last = None\n", "#count = 0\n", "#for idx, row in seventies_by_year.iterrows():\n", "# if last is None or last != (row['Title'], row['Artist']):\n", "# last = (row['Title'], row['Artist'])\n", "# count = 0\n", "# else:\n", "# count += 1\n", "# seventies_by_year.loc[idx, 'Play'] = count\n", "#\n", "#seventies = seventies.merge(seventies_by_year, how='left', on=['Artist', 'Title', 'Play'])\n", "#seventies = seventies.merge(seventies_by_letter, how='left', on=['Artist', 'Title'])\n", "#seventies['Year'] = seventies['Year'].fillna(0.0).astype(int)\n", "#seventies['Air Time'] = pd.to_datetime(seventies['Air Time'], errors='coerce')\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Extracting Initial Letters\n", "\n", "For the moment, let's ignore the countdown specific pages.\n", "We likely need to resort to MusicBrainz for year of publication data\n", "for the 80s countdown.\n", "\n", "And first letter is pretty easy.\n", "Well nothing is ever 100% easy.\n", "I've seen leading spaces (could be stripped during initial load)\n", "and words that start with leading apostrophes such as *'Til*.\n", "So we need to scan past any non-alphabetics." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "def first_char(s):\n", " for c in s:\n", " if type(c) is str and c.isalpha():\n", " return c.upper()\n", " return s[0]\n", " \n", "originals = originals.join(originals.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))\n", "seventies = seventies.join(seventies.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))\n", "eighties = eighties.join(eighties.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))\n", "eighties2 = eighties2.join(eighties2.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))\n", "xpn2020 = xpn2020.join(xpn2020.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))\n", "xpn2020_onsies = xpn2020_onsies.join(xpn2020_onsies.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))\n", "xpn2020pt2 = xpn2020pt2.join(xpn2020pt2.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "For the non-alphabetic leftovers, we'll do first character instead,\n", "so no skipping past non-alphabetics." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "eighties_leftovers = eighties_leftovers.join(eighties_leftovers.apply(lambda x: x[1][0].upper(), axis=1).to_frame('First Character'))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Extracting First Words" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "from nltk.tokenize import RegexpTokenizer\n", "custom_tokenize = RegexpTokenizer(\"[\\w'\\-]+|[^\\w'\\s\\-]\").tokenize\n", "originals = originals.join(originals.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))\n", "seventies = seventies.join(seventies.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))\n", "eighties = eighties.join(eighties.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))\n", "eighties2 = eighties2.join(eighties2.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))\n", "xpn2020 = xpn2020.join(xpn2020.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))\n", "xpn2020_onsies = xpn2020_onsies.join(xpn2020_onsies.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))\n", "xpn2020pt2 = xpn2020pt2.join(xpn2020pt2.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Estimating Durations\n", "\n", "Since we have air times, we can approximate durations by subtracting the air time from the next track's air times. There are a couple catches with this\n", "- we need to pass in an explicit end time for the last track, but that's minor\n", "- we need to add some logic to 'skip over' the free at noons that happen on Fridays form 12 noon till \"like 12:40 or so\" and don't appear in the playlist at all\n", "- the granularity is a bit course, as it is on a one minute basis. We could be off by almost two minutes per song, but it ought to even out.\n", "- there's no clear way to account for \"non-song time\" like station promos, hosts introducing songs, station ids, and so forth. Fortunately, the percentage of time that is really music is pretty high thanks to XPN being listener supported.\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "def estimate_durations(playlist, end_time=None):\n", " \"\"\"\n", " Estimate the song durations\n", " Args: \n", " playlist (DataFrame): playlist with minimally an 'Air Time' attribute\n", " end_time (datetime): end time of the play list, or None if still going\n", " Return:\n", " modified DataFrame with 'Duration' attribute added.\n", " \"\"\"\n", " \n", " playlist['Duration'] = pd.Series([0 for x in range(len(playlist.index))], index=playlist.index)\n", " previous = None\n", " last_idx = None\n", " for idx, row in playlist.iterrows():\n", " if not previous is None:\n", " if row['Air Time'].date().weekday() == 4 and previous.hour == 11 and row['Air Time'].hour == 12:\n", " # We just fell into a free at noon\n", " playlist.loc[last_idx, 'Duration'] = 60 - previous.minute\n", " else:\n", " # just subtract this start from the previous\n", " delta = row['Air Time'] - previous\n", " playlist.loc[last_idx, 'Duration'] = delta.seconds / 60\n", " previous = row['Air Time']\n", " last_idx = idx\n", "\n", " # fixup the last row\n", " if end_time is not None: \n", " delta = end_time - playlist.loc[last_idx,'Air Time']\n", " playlist.loc[last_idx, 'Duration'] = delta.seconds / 60\n", " \n", " return playlist\n", " " ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "orginals = estimate_durations(originals, datetime(2016, 12, 17, 13, 30))\n", "seventies = estimate_durations(seventies, datetime(2017, 12, 12, 19, 0))\n", "eighties = estimate_durations(eighties, datetime(2018, 12, 9, 1, 0))\n", "eighties2 = estimate_durations(eighties2, datetime(2018, 12, 9, 11, 49))\n", "eighties_leftovers = estimate_durations(eighties_leftovers, datetime(2018, 12, 9, 15, 0))\n", "xpn2020 = estimate_durations(xpn2020,datetime(2020, 12, 17, 0, 41))\n", "xpn2020_onsies = estimate_durations(xpn2020_onsies, datetime(2020, 12, 17, 8, 0))\n", "xpn2020pt2 = estimate_durations(xpn2020pt2, datetime(2020, 12, 17, 18, 46))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "And now we can concatenate the 80s back into one data frame.\n", "And the same for 2020." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "eighties = pd.concat([eighties, eighties2])\n", "xpn2020 = pd.concat([xpn2020, xpn2020pt2])" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "And fix up the one remaining implausible duration.\n", "I'm going to assume that no 24 minute cut of \n", "Third World's You're Playing Us Too Close exists.\n", "The longest I can find it 7 minutes.\n", "Odds are we're still missing a couple tracks from where\n", "the playlist feed died about that time." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "eighties.loc[eighties['Title'] == \"You're Playing Us Too Close\", 'Duration' ] = 7" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### MusicBrainz Data\n", "\n", "[MusicBrainz](https://musicbrainz.org/) is an free online music database,\n", "with an [external XML Web-service](https://wiki.musicbrainz.org/Development/XML_Web_Service/Version_2)\n", "that is supported in [Python](https://www.python.org/)\n", "via the [musicbrainzngs](https://pypi.org/project/musicbrainzngs/) library.\n", "I'd originally used it to get publication year for the 2016 countdown,\n", "but abandoned it in 2017 since the [2017 playlist page](http://xpn.org/music-artist/885-countdown/2017/xpn-a-z) had lists by year.\n", "Since there's no year data on the [2018 playlist](http://www.xpn.org/music-artist/xpn-a-z),\n", "I'm bringing it back.\n", "\n", "There are a couple of potential issues with querying MusicBrainz\n", " \n", " - MusicBrainz has its own rules about how to enter data,\n", " that don't always match those at WXPN,\n", " so sometimes searches fail for data mismatches.\n", " - As a free volunteer based service, there's no guarantee that\n", " the data is there, though their data-set is very complete.\n", " - Finding the *right* recording is an art at best.\n", " My general approach has been to look for the oldest official \n", " release for any recording matching the title and artist.\n", " That *mostly* works.\n", "\n", "So we'll get what we can programmatically via Musicbrainz.\n", "Then we'll look up the outliers manually,\n", "using some combination of Discos and random stuff we find on Google,\n", "and prefill the cache file manually for those.\n", "For some really deep cuts, I've resorted to reading the date\n", "off of pictures of 45s for sale on EBay.\n", "No one answer works, it's ugly, but sometimes so is the recording industry.\n", "\n", "One consequence is that we'll always lag on publication year data\n", "during the running of the playlists.\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "def add_musicbrainz_data(playlist, min_year = 1900, cache_file = None):\n", " \"\"\"\n", " Add data from the musicbrainz database. Currently just first year of publication.\n", " The input data frame should contain at least Title and Artist fields\n", " and the resulting dataframe will have a new Year field.\n", " The cache file if used, should have been generated by a previous run of\n", " this function.\n", " Using a cache is strongly encouraged,\n", " as the MusicBrainz search interface is rate limited to one search per second\n", " so this can be very slow for large playlists.\n", " \n", " Args:\n", " playlist (Dataframe) : playlist to update\n", " min_year (int) : miminum year to consider\n", " cache_file (string) : path to cache file\n", " \n", " Returns:\n", " Dataframe containing the augmented playlist\n", " \"\"\"\n", " import musicbrainzngs as mb\n", " mb.set_useragent('xpn-a2z', '0.1','https://github.com/asudell/a2z')\n", " \n", " # keep a list of artists named differently\n", " # at MusicBrainz than XPN, so we can 'fix' them\n", " artist_names = {\n", " \"R. E. M.\": \"REM\",\n", " \"Run-DMC\": \"Run-D.M.C.\",\n", " \"The Ramones\": \"Ramones\"\n", " }\n", " \n", " # load the cache if we have one\n", " if cache_file is not None and path.exists(cache_file):\n", " years = pd.read_csv(cache_file, encoding='utf-8')\n", " years = years.drop_duplicates()\n", " else:\n", " years = pd.DataFrame(None, columns=('Title','Artist', 'Year'))\n", " \n", " augmented = playlist.merge(years, how = 'left')\n", " \n", " # Lookup any unaugmented rows\n", " new_mb_rows = []\n", " for index, row in augmented[augmented['Year'].isnull()].iterrows():\n", " if row['Artist'] in artist_names:\n", " artist = artist_names[row['Artist']]\n", " else:\n", " artist = row['Artist']\n", " result = mb.search_recordings(row['Title'],\n", " artist = artist,\n", " status = 'official',\n", " strict = True,\n", " limit = 25)\n", " rel_year = None\n", " for recording in result['recording-list']:\n", " if recording['release-list']:\n", " for release in recording['release-list']:\n", " if 'date' in release and len(release['date']) > 0:\n", " y = int(release['date'].split('-')[0])\n", " if rel_year is None or rel_year > y:\n", " if y >= min_year:\n", " # assume years before 1900 are typos\n", " rel_year = y\n", " if rel_year is not None:\n", " new_mb_rows.append([row['Title'], row['Artist'], rel_year])\n", " \n", " new_years = pd.DataFrame(new_mb_rows, columns=('Title','Artist', 'Year'))\n", " # if we found new data, resave the cache and rebuild the augmented data\n", " if len(new_years) > 0:\n", " years = years.append(new_years, ignore_index=True)\n", " years = years.drop_duplicates()\n", " if cache_file is not None:\n", " years.to_csv(cache_file, index=False, encoding='utf-8')\n", " augmented = playlist.merge(years, how = 'left')\n", " \n", " return augmented\n", " \n", " \n", " " ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "xpn2020 = add_musicbrainz_data(xpn2020, 1920, path.join(musicbrainz_cache_dir, 'xpn2020_years.csv'))\n", "# save a copy of anything without a year for manual review\n", "xpn2020_missing = xpn2020[xpn2020['Year'].isnull()][['Title', 'Artist']]\n", "xpn2020_missing.to_csv(path.join(musicbrainz_cache_dir, 'xpn2020_need_years.csv'),\n", " index=False, encoding='utf-8')\n", "# need to do this?\n", "mb_cache = pd.read_csv(path.join(musicbrainz_cache_dir, 'xpn2020_years.csv'))\n", "mb_cache.to_csv(path.join(musicbrainz_cache_dir, 'xpn2020_years.csv'), index=False, encoding='utf-8')" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": true }, "outputs": [], "source": [ "xpn2020_onsies = add_musicbrainz_data(xpn2020_onsies, 1920, path.join(musicbrainz_cache_dir, 'xpn2020_onsies_years.csv'))\n", "# save a copy of anything without a year for manual review\n", "xpn2020_onsies_missing = xpn2020_onsies[xpn2020_onsies['Year'].isnull()][['Title', 'Artist']]\n", "xpn2020_onsies_missing.to_csv(path.join(musicbrainz_cache_dir, 'xpn2020_onsies_need_years.csv'),\n", " index=False, encoding='utf-8')\n", "# need to do this?\n", "mb_cache = pd.read_csv(path.join(musicbrainz_cache_dir, 'xpn2020_onsies_years.csv'))\n", "mb_cache.to_csv(path.join(musicbrainz_cache_dir, 'xpn2020_onsies_years.csv'), index=False, encoding='utf-8')" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "eighties = add_musicbrainz_data(eighties, 1980, path.join(musicbrainz_cache_dir, '80s_years.csv'))\n", "# Some recordings get released a lot, toss anything outside the 80s\n", "# and we pick them up for manual review\n", "for index, row in eighties.iterrows():\n", " if row['Year'] < 1980 or row['Year'] > 1989:\n", " eighties.at[index, 'Year'] = np.nan\n", "# and save a copy of anything without a year for manual review\n", "eighties_missing = eighties[eighties['Year'].isnull()][['Title', 'Artist']]\n", "eighties_missing.to_csv(path.join(musicbrainz_cache_dir, '80s_need_years.csv'),\n", " index=False, encoding='utf-8')\n", "# finally, prune any out of range entries from the cache, as\n", "# we will keep growing them and duplicating records on joins\n", "mb_cache = pd.read_csv(path.join(musicbrainz_cache_dir, '80s_years.csv'))\n", "mb_cache = mb_cache[(mb_cache['Year'] >= 1980) & (mb_cache['Year'] <= 1989)]\n", "mb_cache.to_csv(path.join(musicbrainz_cache_dir, '80s_years.csv'), index=False, encoding='utf-8')\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "and the same for the leftovers ..." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "eighties_leftovers = add_musicbrainz_data(eighties_leftovers, 1980, path.join(musicbrainz_cache_dir, '80s_leftovers_years.csv'))\n", "# Some recordings get released a lot, toss anything outside the 80s\n", "# and we pick them up for manual review\n", "for index, row in eighties_leftovers.iterrows():\n", " if row['Year'] < 1980 or row['Year'] > 1989:\n", " eighties_leftovers.at[index, 'Year'] = np.nan\n", "# and save a copy of anything without a year for manual review\n", "eighties_leftovers_missing = eighties_leftovers[eighties_leftovers['Year'].isnull()][['Title', 'Artist']]\n", "eighties_leftovers_missing.to_csv(path.join(musicbrainz_cache_dir, '80s_leftovers_need_years.csv'),\n", " index=False, encoding='utf-8')\n", "# finally, prune any out of range entries from the cache, as\n", "# we will keep growing them and duplicating records on joins\n", "mb_cache = pd.read_csv(path.join(musicbrainz_cache_dir, '80s_leftovers_years.csv'))\n", "mb_cache = mb_cache[(mb_cache['Year'] >= 1980) & (mb_cache['Year'] <= 1989)]\n", "mb_cache.to_csv(path.join(musicbrainz_cache_dir, '80s_leftovers_years.csv'), index=False, encoding='utf-8')\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "Just for reference the manual additions were\n", " - [80s_manual_years.csv](cache/musicbrainz/80s_manual_years.csv)\n", " - [80s_leftovers_manual_years.csv](cache/musicbrainz/80s_leftovers_manual_years.csv)\n", " \n", "The ones I couldn't ever find good years far are left in\n", " - [80s_need_years.csv](cache/musicbrainz/80s_need_years.csv)\n", " - [80s_leftovers_need_years.csv](cache/musicbrainz/80s_leftovers_need_years.csv)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "## Checking the Results" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir TimeLetterFirst WordDurationYear
0Booker T. & The MG'sTime Is Tight2020-12-10 08:02:00TTime31980
1AC/DCT.N.T.2020-12-10 08:05:00TT61975
2Peter FramptonShow Me the Way2020-12-10 08:11:00SShow51975
3The DriftersUnder The Boardwalk2020-12-10 08:16:00UUnder31989
4AdeleRumor Has It2020-12-10 08:19:00RRumor52011
5SmithBaby It's You2020-12-10 08:24:00BBaby41969
6Aretha FranklinCall Me2020-12-10 08:28:00CCall31970
7Marvin Gaye & Kim WestonIt Takes Two2020-12-10 08:31:00IIt31966
8Curtis MayfieldSuperfly2020-12-10 08:34:00SSuperfly71973
9Shawn ColvinI Don't Know Why2020-12-10 08:41:00II41992
" ], "text/plain": [ "" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(xpn2020.head(10).to_html())" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir TimeLetterFirst WordDurationYear
0J. J. CaleAfter Midnight2020-12-17 06:03:00AAfter2NaN
1Shocking BlueVenus2020-12-17 06:05:00VVenus51969
2Ben Folds FiveBrick2020-12-17 06:10:00BBrick41994
3Sarah McLachlanSweet Surrender2020-12-17 06:14:00SSweet81997
4World PartyWhen The Rainbow Comes2020-12-17 06:22:00WWhen51990
5Suzanne VegaMarlene On The Wall2020-12-17 06:27:00MMarlene41985
6Big Joe TurnerShake Rattle And Roll2020-12-17 06:31:00SShake31992
7Wilbert HarrisonLet's Work Together (Parts 1 & 2)2020-12-17 06:34:00LLet's71994
8The PretendersTalk Of The Town2020-12-17 06:41:00TTalk31980
9The ProclaimersI'm Gonna Be (500 Miles)2020-12-17 06:44:00II'm71987
" ], "text/plain": [ "" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(xpn2020_onsies.head(10).to_html())" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir TimeLetterFirst WordDurationYear
0Warren ZevonA Certain Girl2018-11-28 08:01:00AA31980
1U2A Day Without Me2018-11-28 08:04:00AA31980
2The CureA Forest2018-11-28 08:07:00AA61980
3The WaterboysA Girl Called Johnny2018-11-28 08:13:00AA51983
4Romeo VoidA Girl in Trouble (Is a Temporary Thing)2018-11-28 08:18:00AA71984
5The SmithereensA Girl Like You2018-11-28 08:25:00AA41989
6Albert CollinsA Good Fool Is Hard To Find2018-11-28 08:29:00AA41986
7Phil CollinsA Groovy Kind Of Love2018-11-28 08:33:00AA51988
8The WeirdosA Life Of Crime2018-11-28 08:38:00AA151985
9ErasureA Little Respect2018-11-28 08:53:00AA21988
" ], "text/plain": [ "" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(eighties.head(10).to_html())" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir TimeFirst CharacterDurationYear
count4242424242.00000042.000000
unique41424212
topU2(It's Not Me) Talking2018-12-09 13:51:00(
freq21115
first2018-12-09 11:50:00
last2018-12-09 14:57:00
mean4.5238101984.142857
std1.6999012.824729
min2.0000001980.000000
50%4.0000001984.000000
max9.0000001989.000000
" ], "text/plain": [ "" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(eighties_leftovers.describe(include='all', percentiles=[]).to_html(na_rep=''))" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir TimeLetterFirst WordDuration
219Steeleye SpanA Calling-On Song2017-11-29 06:02:00AA1
218Joni MitchellA Case Of You2017-11-29 06:03:00AA4
217Boz ScaggsA Clue2017-11-29 06:07:00AA6
216Todd RundgrenA Dream Goes On Forever2017-11-29 06:13:00AA3
215Lou ReedA Gift2017-11-29 06:16:00AA7
214PocoA Good Feelin' To Know2017-11-29 06:23:00AA3
213Mac DavisA Little Less Conversation2017-11-29 06:26:00AA3
212Neil YoungA Man Needs A Maid2017-11-29 06:29:00AA4
211Lou RawlsA Natural Man2017-11-29 06:33:00AA3
210David BowieA New Career In A New Town2017-11-29 06:36:00AA5
" ], "text/plain": [ "" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(seventies.head(10).to_html())" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistTitleAir TimeLetterFirst WordDuration
245Jackson 5ABC2016-11-30 06:01:00AABC3
244Elvis PresleyA Big Hunk O' Love2016-11-30 06:04:00AA2
243Johnny CashA Boy Named Sue (live)2016-11-30 06:06:00AA4
242Joni MitchellA Case Of You2016-11-30 06:10:00AA6
241Ernie K-DoeA Certain Girl2016-11-30 06:16:00AA3
240Warren ZevonA Certain Girl2016-11-30 06:19:00AA5
239Sheryl CrowA Change2016-11-30 06:24:00AA4
238Sam CookeA Change Is Gonna Come2016-11-30 06:28:00AA3
237The BeatlesA Day In The Life2016-11-30 06:31:00AA5
236Ray BarrettoA Deeper Shade Of Soul2016-11-30 06:36:00AA4
" ], "text/plain": [ "" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(originals.head(10).to_html())" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Saving the data" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false, "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "#originals_data_file = path.join(data_dir, 'A2Z.csv')\n", "#originals.to_csv(originals_data_file, index=False)\n", "#seventies_data_file = path.join(data_dir, '70sA2Z.csv')\n", "#seventies.to_csv(seventies_data_file, index=False)\n", "#eighties['Year'] = eighties['Year'].fillna(value=0).astype(int)\n", "#eighties_data_file = path.join(data_dir, '80sA2Z.csv')\n", "#eighties.to_csv(eighties_data_file, index=False, encoding='utf8')\n", "#eighties_leftovers['Year'] = eighties_leftovers['Year'].fillna(value=0).astype(int)\n", "#eighties_leftovers_data_file = path.join(data_dir, '80sLeftovers.csv')\n", "#eighties_leftovers.to_csv(eighties_leftovers_data_file, index=False, encoding='utf8')xpn\n", "xpn2020['Year'] = xpn2020['Year'].fillna(value=0).astype(int)\n", "xpn2020_data_file = path.join(data_dir, 'xpn2020.csv')\n", "xpn2020.to_csv(xpn2020_data_file, index=False, encoding='utf8')\n", "xpn2020_onsies['Year'] = xpn2020_onsies['Year'].fillna(value=0).astype(int)\n", "xpn2020_onsies_data_file = path.join(data_dir, 'xpn2020_onsies.csv')\n", "xpn2020_onsies.to_csv(xpn2020_onsies_data_file, index=False, encoding='utf8')" ] } ], "metadata": { "hide_code_all_hidden": false, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }