{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [FMA: A Dataset For Music Analysis](https://github.com/mdeff/fma)\n",
    "\n",
    "Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.\n",
    "\n",
    "## Creation\n",
    "\n",
    "From `raw_*.csv`, this notebook generates:\n",
    "* `tracks.csv`: per-track / album / artist metadata.\n",
    "* `genres.csv`: genre hierarchy.\n",
    "* `echonest.csv`: cleaned Echonest features.\n",
    "\n",
    "A companion script, [creation.py](creation.py):\n",
    "1. Query the [API](https://freemusicarchive.org/api) and store metadata in `raw_tracks.csv`, `raw_albums.csv`, `raw_artists.csv` and `raw_genres.csv`.\n",
    "2. Download the audio for each track.\n",
    "3. Trim the audio to 30s clips.\n",
    "4. Normalize the permissions and modification / access times.\n",
    "5. Create the `.zip` archives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import ast\n",
    "import pickle\n",
    "\n",
    "import IPython.display as ipd\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import utils\n",
    "import creation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "AUDIO_DIR = os.environ.get('AUDIO_DIR')\n",
    "BASE_DIR = os.path.abspath(os.path.dirname(AUDIO_DIR))\n",
    "FMA_FULL = os.path.join(BASE_DIR, 'fma_full')\n",
    "FMA_LARGE = os.path.join(BASE_DIR, 'fma_large')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 Retrieve metadata and audio from FMA\n",
    "\n",
    "1. Crawl the tracks, albums and artists metadata through their [API](https://freemusicarchive.org/api).\n",
    "2. Download original `.mp3` by HTTPS for each track id (only if we don't have it already).\n",
    "\n",
    "Todo:\n",
    "* Scrap curators.\n",
    "* Download images (`track_image_file`, `album_image_file`, `artist_image_file`). Beware the quality.\n",
    "* Verify checksum for some random tracks.\n",
    "\n",
    "Dataset update:\n",
    "* To add new tracks: iterate from largest known track id to the most recent only.\n",
    "* To update user data: we need to get all tracks again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ./creation.py metadata\n",
    "# ./creation.py data /path/to/fma/fma_full\n",
    "# ./creation.py clips /path/to/fma\n",
    "\n",
    "#!cat creation.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# converters={'genres': ast.literal_eval}\n",
    "tracks = pd.read_csv('raw_tracks.csv', index_col=0)\n",
    "albums = pd.read_csv('raw_albums.csv', index_col=0)\n",
    "artists = pd.read_csv('raw_artists.csv', index_col=0)\n",
    "genres = pd.read_csv('raw_genres.csv', index_col=0)\n",
    "\n",
    "not_found = pickle.load(open('not_found.pickle', 'rb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_fs_tids(audio_dir):\n",
    "    tids = []\n",
    "    for _, dirnames, files in os.walk(audio_dir):\n",
    "        if dirnames == []:\n",
    "            tids.extend(int(file[:-4]) for file in files)\n",
    "    return tids\n",
    "\n",
    "audio_tids = get_fs_tids(FMA_FULL)\n",
    "clips_tids = get_fs_tids(FMA_LARGE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tracks: 109727 collected (45594 not found, 155320 max id)\n",
      "albums: 15234 collected (480 not found, 15714 in tracks)\n",
      "artists: 16916 collected (250 not found, 17166 in tracks)\n",
      "genres: 164 collected\n",
      "audio: 110668 collected (180 not found, 1121 not in tracks)\n",
      "clips: 109261 collected (286 not found, 0 not in tracks)\n"
     ]
    }
   ],
   "source": [
    "print('tracks: {} collected ({} not found, {} max id)'.format(\n",
    "    len(tracks), len(not_found['tracks']), tracks.index.max()))\n",
    "print('albums: {} collected ({} not found, {} in tracks)'.format(\n",
    "    len(albums), len(not_found['albums']), len(tracks['album_id'].unique())))\n",
    "print('artists: {} collected ({} not found, {} in tracks)'.format(\n",
    "    len(artists), len(not_found['artists']), len(tracks['artist_id'].unique())))\n",
    "print('genres: {} collected'.format(len(genres)))\n",
    "print('audio: {} collected ({} not found, {} not in tracks)'.format(\n",
    "    len(audio_tids), len(not_found['audio']), len(set(audio_tids).difference(tracks.index))))\n",
    "print('clips: {} collected ({} not found, {} not in tracks)'.format(\n",
    "    len(clips_tids), len(not_found['clips']), len(set(clips_tids).difference(tracks.index))))\n",
    "assert sum(tracks.index.isin(audio_tids)) + len(not_found['audio']) == len(tracks)\n",
    "assert sum(tracks.index.isin(clips_tids)) + len(not_found['clips']) == sum(tracks.index.isin(audio_tids))\n",
    "assert len(clips_tids) + len(not_found['clips']) + len(not_found['audio']) == len(tracks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>album_id</th>\n",
       "      <th>album_title</th>\n",
       "      <th>album_url</th>\n",
       "      <th>artist_id</th>\n",
       "      <th>artist_name</th>\n",
       "      <th>artist_url</th>\n",
       "      <th>artist_website</th>\n",
       "      <th>license_image_file</th>\n",
       "      <th>license_image_file_large</th>\n",
       "      <th>license_parent_id</th>\n",
       "      <th>...</th>\n",
       "      <th>track_information</th>\n",
       "      <th>track_instrumental</th>\n",
       "      <th>track_interest</th>\n",
       "      <th>track_language_code</th>\n",
       "      <th>track_listens</th>\n",
       "      <th>track_lyricist</th>\n",
       "      <th>track_number</th>\n",
       "      <th>track_publisher</th>\n",
       "      <th>track_title</th>\n",
       "      <th>track_url</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>track_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>AWOL - A Way Of Life</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/AWOL_-_...</td>\n",
       "      <td>1</td>\n",
       "      <td>AWOL</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/</td>\n",
       "      <td>http://www.AzillionRecords.blogspot.com</td>\n",
       "      <td>http://i.creativecommons.org/l/by-nc-sa/3.0/us...</td>\n",
       "      <td>http://fma-files.s3.amazonaws.com/resources/im...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>4656</td>\n",
       "      <td>en</td>\n",
       "      <td>1293</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Food</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/AWOL_-_...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>AWOL - A Way Of Life</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/AWOL_-_...</td>\n",
       "      <td>1</td>\n",
       "      <td>AWOL</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/</td>\n",
       "      <td>http://www.AzillionRecords.blogspot.com</td>\n",
       "      <td>http://i.creativecommons.org/l/by-nc-sa/3.0/us...</td>\n",
       "      <td>http://fma-files.s3.amazonaws.com/resources/im...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1470</td>\n",
       "      <td>en</td>\n",
       "      <td>514</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Electric Ave</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/AWOL_-_...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1.0</td>\n",
       "      <td>AWOL - A Way Of Life</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/AWOL_-_...</td>\n",
       "      <td>1</td>\n",
       "      <td>AWOL</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/</td>\n",
       "      <td>http://www.AzillionRecords.blogspot.com</td>\n",
       "      <td>http://i.creativecommons.org/l/by-nc-sa/3.0/us...</td>\n",
       "      <td>http://fma-files.s3.amazonaws.com/resources/im...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1933</td>\n",
       "      <td>en</td>\n",
       "      <td>1151</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>This World</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/AWOL_-_...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>6.0</td>\n",
       "      <td>Constant Hitmaker</td>\n",
       "      <td>http://freemusicarchive.org/music/Kurt_Vile/Co...</td>\n",
       "      <td>6</td>\n",
       "      <td>Kurt Vile</td>\n",
       "      <td>http://freemusicarchive.org/music/Kurt_Vile/</td>\n",
       "      <td>http://kurtvile.com</td>\n",
       "      <td>http://i.creativecommons.org/l/by-nc-nd/3.0/88...</td>\n",
       "      <td>http://fma-files.s3.amazonaws.com/resources/im...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>54881</td>\n",
       "      <td>en</td>\n",
       "      <td>50135</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Freeway</td>\n",
       "      <td>http://freemusicarchive.org/music/Kurt_Vile/Co...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>4.0</td>\n",
       "      <td>Niris</td>\n",
       "      <td>http://freemusicarchive.org/music/Chris_and_Ni...</td>\n",
       "      <td>4</td>\n",
       "      <td>Nicky Cook</td>\n",
       "      <td>http://freemusicarchive.org/music/Chris_and_Ni...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>http://i.creativecommons.org/l/by-nc-nd/3.0/88...</td>\n",
       "      <td>http://fma-files.s3.amazonaws.com/resources/im...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>978</td>\n",
       "      <td>en</td>\n",
       "      <td>361</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Spiritual Level</td>\n",
       "      <td>http://freemusicarchive.org/music/Chris_and_Ni...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 38 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          album_id           album_title  \\\n",
       "track_id                                   \n",
       "2              1.0  AWOL - A Way Of Life   \n",
       "3              1.0  AWOL - A Way Of Life   \n",
       "5              1.0  AWOL - A Way Of Life   \n",
       "10             6.0     Constant Hitmaker   \n",
       "20             4.0                 Niris   \n",
       "\n",
       "                                                  album_url  artist_id  \\\n",
       "track_id                                                                 \n",
       "2         http://freemusicarchive.org/music/AWOL/AWOL_-_...          1   \n",
       "3         http://freemusicarchive.org/music/AWOL/AWOL_-_...          1   \n",
       "5         http://freemusicarchive.org/music/AWOL/AWOL_-_...          1   \n",
       "10        http://freemusicarchive.org/music/Kurt_Vile/Co...          6   \n",
       "20        http://freemusicarchive.org/music/Chris_and_Ni...          4   \n",
       "\n",
       "         artist_name                                         artist_url  \\\n",
       "track_id                                                                  \n",
       "2               AWOL            http://freemusicarchive.org/music/AWOL/   \n",
       "3               AWOL            http://freemusicarchive.org/music/AWOL/   \n",
       "5               AWOL            http://freemusicarchive.org/music/AWOL/   \n",
       "10         Kurt Vile       http://freemusicarchive.org/music/Kurt_Vile/   \n",
       "20        Nicky Cook  http://freemusicarchive.org/music/Chris_and_Ni...   \n",
       "\n",
       "                                   artist_website  \\\n",
       "track_id                                            \n",
       "2         http://www.AzillionRecords.blogspot.com   \n",
       "3         http://www.AzillionRecords.blogspot.com   \n",
       "5         http://www.AzillionRecords.blogspot.com   \n",
       "10                            http://kurtvile.com   \n",
       "20                                            NaN   \n",
       "\n",
       "                                         license_image_file  \\\n",
       "track_id                                                      \n",
       "2         http://i.creativecommons.org/l/by-nc-sa/3.0/us...   \n",
       "3         http://i.creativecommons.org/l/by-nc-sa/3.0/us...   \n",
       "5         http://i.creativecommons.org/l/by-nc-sa/3.0/us...   \n",
       "10        http://i.creativecommons.org/l/by-nc-nd/3.0/88...   \n",
       "20        http://i.creativecommons.org/l/by-nc-nd/3.0/88...   \n",
       "\n",
       "                                   license_image_file_large  \\\n",
       "track_id                                                      \n",
       "2         http://fma-files.s3.amazonaws.com/resources/im...   \n",
       "3         http://fma-files.s3.amazonaws.com/resources/im...   \n",
       "5         http://fma-files.s3.amazonaws.com/resources/im...   \n",
       "10        http://fma-files.s3.amazonaws.com/resources/im...   \n",
       "20        http://fma-files.s3.amazonaws.com/resources/im...   \n",
       "\n",
       "          license_parent_id  \\\n",
       "track_id                      \n",
       "2                       5.0   \n",
       "3                       5.0   \n",
       "5                       5.0   \n",
       "10                      NaN   \n",
       "20                      NaN   \n",
       "\n",
       "                                ...                         track_information  \\\n",
       "track_id                        ...                                             \n",
       "2                               ...                                       NaN   \n",
       "3                               ...                                       NaN   \n",
       "5                               ...                                       NaN   \n",
       "10                              ...                                       NaN   \n",
       "20                              ...                                       NaN   \n",
       "\n",
       "         track_instrumental track_interest  track_language_code  \\\n",
       "track_id                                                          \n",
       "2                         0           4656                   en   \n",
       "3                         0           1470                   en   \n",
       "5                         0           1933                   en   \n",
       "10                        0          54881                   en   \n",
       "20                        0            978                   en   \n",
       "\n",
       "          track_listens track_lyricist track_number track_publisher  \\\n",
       "track_id                                                              \n",
       "2                  1293            NaN            3             NaN   \n",
       "3                   514            NaN            4             NaN   \n",
       "5                  1151            NaN            6             NaN   \n",
       "10                50135            NaN            1             NaN   \n",
       "20                  361            NaN            3             NaN   \n",
       "\n",
       "              track_title                                          track_url  \n",
       "track_id                                                                      \n",
       "2                    Food  http://freemusicarchive.org/music/AWOL/AWOL_-_...  \n",
       "3            Electric Ave  http://freemusicarchive.org/music/AWOL/AWOL_-_...  \n",
       "5              This World  http://freemusicarchive.org/music/AWOL/AWOL_-_...  \n",
       "10                Freeway  http://freemusicarchive.org/music/Kurt_Vile/Co...  \n",
       "20        Spiritual Level  http://freemusicarchive.org/music/Chris_and_Ni...  \n",
       "\n",
       "[5 rows x 38 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>album_comments</th>\n",
       "      <th>album_date_created</th>\n",
       "      <th>album_date_released</th>\n",
       "      <th>album_engineer</th>\n",
       "      <th>album_favorites</th>\n",
       "      <th>album_handle</th>\n",
       "      <th>album_image_file</th>\n",
       "      <th>album_images</th>\n",
       "      <th>album_information</th>\n",
       "      <th>album_listens</th>\n",
       "      <th>album_producer</th>\n",
       "      <th>album_title</th>\n",
       "      <th>album_tracks</th>\n",
       "      <th>album_type</th>\n",
       "      <th>album_url</th>\n",
       "      <th>artist_name</th>\n",
       "      <th>artist_url</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>album_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>11/26/2008 01:44:45 AM</td>\n",
       "      <td>1/05/2009</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>AWOL_-_A_Way_Of_Life</td>\n",
       "      <td>https://freemusicarchive.org/file/images/album...</td>\n",
       "      <td>[{'image_id': '1955', 'image_file': 'https://f...</td>\n",
       "      <td>&lt;p&gt;&lt;/p&gt;</td>\n",
       "      <td>6073</td>\n",
       "      <td>NaN</td>\n",
       "      <td>AWOL - A Way Of Life</td>\n",
       "      <td>7</td>\n",
       "      <td>Album</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/AWOL_-_...</td>\n",
       "      <td>AWOL</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>0</td>\n",
       "      <td>11/26/2008 01:55:44 AM</td>\n",
       "      <td>1/09/2009</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>On_Opaque_Things</td>\n",
       "      <td>https://freemusicarchive.org/file/images/album...</td>\n",
       "      <td>[{'image_id': '4403', 'image_file': 'https://f...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5613</td>\n",
       "      <td>NaN</td>\n",
       "      <td>On Opaque Things</td>\n",
       "      <td>4</td>\n",
       "      <td>Album</td>\n",
       "      <td>http://freemusicarchive.org/music/Bird_Names/O...</td>\n",
       "      <td>Bird Names</td>\n",
       "      <td>http://freemusicarchive.org/music/Bird_Names/</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000</th>\n",
       "      <td>0</td>\n",
       "      <td>12/04/2008 09:28:49 AM</td>\n",
       "      <td>10/26/2008</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>DMBQ_Live_at_2008_Record_Fair_on_WFMU_Record_F...</td>\n",
       "      <td>https://freemusicarchive.org/file/images/album...</td>\n",
       "      <td>[{'image_id': '31997', 'image_file': 'https://...</td>\n",
       "      <td>&lt;p&gt;http://blog.wfmu.org/freeform/2008/10/what-...</td>\n",
       "      <td>1092</td>\n",
       "      <td>NaN</td>\n",
       "      <td>DMBQ Live at 2008 Record Fair on WFMU Record F...</td>\n",
       "      <td>4</td>\n",
       "      <td>Live Performance</td>\n",
       "      <td>http://freemusicarchive.org/music/DMBQ/DMBQ_Li...</td>\n",
       "      <td>DMBQ</td>\n",
       "      <td>http://freemusicarchive.org/music/DMBQ/</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10000</th>\n",
       "      <td>0</td>\n",
       "      <td>9/05/2011 04:42:57 PM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>Live_at_CKUT_on_Montreal_Sessions_1434</td>\n",
       "      <td>https://freemusicarchive.org/file/images/album...</td>\n",
       "      <td>[{'image_id': '12266', 'image_file': 'https://...</td>\n",
       "      <td>&lt;p&gt;Live Set on the Montreal Session February 2...</td>\n",
       "      <td>1001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Live at CKUT on Montreal Sessions</td>\n",
       "      <td>1</td>\n",
       "      <td>Radio Program</td>\n",
       "      <td>http://freemusicarchive.org/music/Sundrips/Liv...</td>\n",
       "      <td>Sundrips</td>\n",
       "      <td>http://freemusicarchive.org/music/Sundrips/</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10001</th>\n",
       "      <td>0</td>\n",
       "      <td>9/06/2011 12:02:58 AM</td>\n",
       "      <td>1/01/2006</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>Grounds_Dream_Cosmic_Love</td>\n",
       "      <td>https://freemusicarchive.org/file/images/album...</td>\n",
       "      <td>[{'image_id': '24091', 'image_file': 'https://...</td>\n",
       "      <td>&lt;p&gt;Recorded in Linnavuori, Finland, 2005 (with...</td>\n",
       "      <td>504</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Ground's Dream Cosmic Love</td>\n",
       "      <td>1</td>\n",
       "      <td>Album</td>\n",
       "      <td>http://freemusicarchive.org/music/Uton/Grounds...</td>\n",
       "      <td>Uton</td>\n",
       "      <td>http://freemusicarchive.org/music/Uton/</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          album_comments      album_date_created album_date_released  \\\n",
       "album_id                                                               \n",
       "1                      0  11/26/2008 01:44:45 AM           1/05/2009   \n",
       "100                    0  11/26/2008 01:55:44 AM           1/09/2009   \n",
       "1000                   0  12/04/2008 09:28:49 AM          10/26/2008   \n",
       "10000                  0   9/05/2011 04:42:57 PM                 NaN   \n",
       "10001                  0   9/06/2011 12:02:58 AM           1/01/2006   \n",
       "\n",
       "         album_engineer  album_favorites  \\\n",
       "album_id                                   \n",
       "1                   NaN                4   \n",
       "100                 NaN                0   \n",
       "1000                NaN                0   \n",
       "10000               NaN                0   \n",
       "10001               NaN                0   \n",
       "\n",
       "                                               album_handle  \\\n",
       "album_id                                                      \n",
       "1                                      AWOL_-_A_Way_Of_Life   \n",
       "100                                        On_Opaque_Things   \n",
       "1000      DMBQ_Live_at_2008_Record_Fair_on_WFMU_Record_F...   \n",
       "10000                Live_at_CKUT_on_Montreal_Sessions_1434   \n",
       "10001                             Grounds_Dream_Cosmic_Love   \n",
       "\n",
       "                                           album_image_file  \\\n",
       "album_id                                                      \n",
       "1         https://freemusicarchive.org/file/images/album...   \n",
       "100       https://freemusicarchive.org/file/images/album...   \n",
       "1000      https://freemusicarchive.org/file/images/album...   \n",
       "10000     https://freemusicarchive.org/file/images/album...   \n",
       "10001     https://freemusicarchive.org/file/images/album...   \n",
       "\n",
       "                                               album_images  \\\n",
       "album_id                                                      \n",
       "1         [{'image_id': '1955', 'image_file': 'https://f...   \n",
       "100       [{'image_id': '4403', 'image_file': 'https://f...   \n",
       "1000      [{'image_id': '31997', 'image_file': 'https://...   \n",
       "10000     [{'image_id': '12266', 'image_file': 'https://...   \n",
       "10001     [{'image_id': '24091', 'image_file': 'https://...   \n",
       "\n",
       "                                          album_information  album_listens  \\\n",
       "album_id                                                                     \n",
       "1                                                   <p></p>           6073   \n",
       "100                                                     NaN           5613   \n",
       "1000      <p>http://blog.wfmu.org/freeform/2008/10/what-...           1092   \n",
       "10000     <p>Live Set on the Montreal Session February 2...           1001   \n",
       "10001     <p>Recorded in Linnavuori, Finland, 2005 (with...            504   \n",
       "\n",
       "         album_producer                                        album_title  \\\n",
       "album_id                                                                     \n",
       "1                   NaN                               AWOL - A Way Of Life   \n",
       "100                 NaN                                   On Opaque Things   \n",
       "1000                NaN  DMBQ Live at 2008 Record Fair on WFMU Record F...   \n",
       "10000               NaN                  Live at CKUT on Montreal Sessions   \n",
       "10001               NaN                         Ground's Dream Cosmic Love   \n",
       "\n",
       "          album_tracks        album_type  \\\n",
       "album_id                                   \n",
       "1                    7             Album   \n",
       "100                  4             Album   \n",
       "1000                 4  Live Performance   \n",
       "10000                1     Radio Program   \n",
       "10001                1             Album   \n",
       "\n",
       "                                                  album_url artist_name  \\\n",
       "album_id                                                                  \n",
       "1         http://freemusicarchive.org/music/AWOL/AWOL_-_...        AWOL   \n",
       "100       http://freemusicarchive.org/music/Bird_Names/O...  Bird Names   \n",
       "1000      http://freemusicarchive.org/music/DMBQ/DMBQ_Li...        DMBQ   \n",
       "10000     http://freemusicarchive.org/music/Sundrips/Liv...    Sundrips   \n",
       "10001     http://freemusicarchive.org/music/Uton/Grounds...        Uton   \n",
       "\n",
       "                                             artist_url tags  \n",
       "album_id                                                      \n",
       "1               http://freemusicarchive.org/music/AWOL/   []  \n",
       "100       http://freemusicarchive.org/music/Bird_Names/   []  \n",
       "1000            http://freemusicarchive.org/music/DMBQ/   []  \n",
       "10000       http://freemusicarchive.org/music/Sundrips/   []  \n",
       "10001           http://freemusicarchive.org/music/Uton/   []  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>artist_active_year_begin</th>\n",
       "      <th>artist_active_year_end</th>\n",
       "      <th>artist_associated_labels</th>\n",
       "      <th>artist_bio</th>\n",
       "      <th>artist_comments</th>\n",
       "      <th>artist_contact</th>\n",
       "      <th>artist_date_created</th>\n",
       "      <th>artist_donation_url</th>\n",
       "      <th>artist_favorites</th>\n",
       "      <th>artist_flattr_name</th>\n",
       "      <th>...</th>\n",
       "      <th>artist_location</th>\n",
       "      <th>artist_longitude</th>\n",
       "      <th>artist_members</th>\n",
       "      <th>artist_name</th>\n",
       "      <th>artist_paypal_name</th>\n",
       "      <th>artist_related_projects</th>\n",
       "      <th>artist_url</th>\n",
       "      <th>artist_website</th>\n",
       "      <th>artist_wikipedia_page</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>artist_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2006.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;p&gt;A Way Of Life, A Collective of Hip-Hop from...</td>\n",
       "      <td>0</td>\n",
       "      <td>Brown Bum aka Choke</td>\n",
       "      <td>11/26/2008 01:42:32 AM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>New Jersey</td>\n",
       "      <td>-74.405661</td>\n",
       "      <td>Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...</td>\n",
       "      <td>AWOL</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The list of past projects is 2 long but every1...</td>\n",
       "      <td>http://freemusicarchive.org/music/AWOL/</td>\n",
       "      <td>http://www.AzillionRecords.blogspot.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>['awol']</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mistletone, Marriage Records</td>\n",
       "      <td>&lt;p&gt;\"Lucky Dragons\" means any recorded or perfo...</td>\n",
       "      <td>3</td>\n",
       "      <td>Lukey Dargons</td>\n",
       "      <td>11/26/2008 01:43:35 AM</td>\n",
       "      <td>http://glaciersofnice.com/shop/</td>\n",
       "      <td>111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>Los Angeles, CA</td>\n",
       "      <td>-118.243685</td>\n",
       "      <td>Luke Fischbeck\\nSarah Rara</td>\n",
       "      <td>Lucky Dragons</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>http://freemusicarchive.org/music/Lucky_Dragons/</td>\n",
       "      <td>http://hawksandsparrows.org/</td>\n",
       "      <td>NaN</td>\n",
       "      <td>['lucky dragons']</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>2004.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Captcha Records (HBSP-2X), Pickled Egg (Europe)</td>\n",
       "      <td>&lt;p&gt;&lt;span style=\"font-family:Verdana, Geneva, A...</td>\n",
       "      <td>1</td>\n",
       "      <td>Chris Kalis</td>\n",
       "      <td>11/26/2008 02:05:22 AM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>Chicago, IL</td>\n",
       "      <td>-87.629798</td>\n",
       "      <td>Chris Kalis, Harry Brenner, Scott McGaughey, B...</td>\n",
       "      <td>Chandeliers</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Killer Whales, \\nMichael Columbia\\nMandate\\nMr...</td>\n",
       "      <td>http://freemusicarchive.org/music/Chandeliers/</td>\n",
       "      <td>thechandeliers.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>['chandeliers']</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;p&gt;&lt;a href=\"http://marzipanmarzipan.com\"&gt;Marzi...</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12/04/2008 09:24:35 AM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12.567380</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Marzipan Marzipan</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>http://freemusicarchive.org/music/Marzipan_Mar...</td>\n",
       "      <td>https://soundcloud.com/marzipanmarzipan</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;p&gt;&lt;span style=\"font-family:'Times New Roman',...</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1/21/2011 02:11:31 PM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Jack Hertz\\nPHOBoS\\nBlue Hell</td>\n",
       "      <td>Jack Hertz, PHOBoS, Blue Hell</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>http://freemusicarchive.org/music/Jack_Hertz_P...</td>\n",
       "      <td>http://surrism.phonoethics.com/surrism-phonoet...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>['jack hertz phobos blue hell']</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           artist_active_year_begin  artist_active_year_end  \\\n",
       "artist_id                                                     \n",
       "1                            2006.0                     NaN   \n",
       "10                              NaN                     NaN   \n",
       "100                          2004.0                     NaN   \n",
       "1000                            NaN                     NaN   \n",
       "10000                           NaN                     NaN   \n",
       "\n",
       "                                  artist_associated_labels  \\\n",
       "artist_id                                                    \n",
       "1                                                      NaN   \n",
       "10                            Mistletone, Marriage Records   \n",
       "100        Captcha Records (HBSP-2X), Pickled Egg (Europe)   \n",
       "1000                                                   NaN   \n",
       "10000                                                  NaN   \n",
       "\n",
       "                                                  artist_bio  artist_comments  \\\n",
       "artist_id                                                                       \n",
       "1          <p>A Way Of Life, A Collective of Hip-Hop from...                0   \n",
       "10         <p>\"Lucky Dragons\" means any recorded or perfo...                3   \n",
       "100        <p><span style=\"font-family:Verdana, Geneva, A...                1   \n",
       "1000       <p><a href=\"http://marzipanmarzipan.com\">Marzi...                0   \n",
       "10000      <p><span style=\"font-family:'Times New Roman',...                0   \n",
       "\n",
       "                artist_contact     artist_date_created  \\\n",
       "artist_id                                                \n",
       "1          Brown Bum aka Choke  11/26/2008 01:42:32 AM   \n",
       "10               Lukey Dargons  11/26/2008 01:43:35 AM   \n",
       "100                Chris Kalis  11/26/2008 02:05:22 AM   \n",
       "1000                       NaN  12/04/2008 09:24:35 AM   \n",
       "10000                      NaN   1/21/2011 02:11:31 PM   \n",
       "\n",
       "                       artist_donation_url  artist_favorites  \\\n",
       "artist_id                                                      \n",
       "1                                      NaN                 9   \n",
       "10         http://glaciersofnice.com/shop/               111   \n",
       "100                                    NaN                 8   \n",
       "1000                                   NaN                 0   \n",
       "10000                                  NaN                 1   \n",
       "\n",
       "          artist_flattr_name               ...                 \\\n",
       "artist_id                                  ...                  \n",
       "1                        NaN               ...                  \n",
       "10                       NaN               ...                  \n",
       "100                      NaN               ...                  \n",
       "1000                     NaN               ...                  \n",
       "10000                    NaN               ...                  \n",
       "\n",
       "           artist_location artist_longitude  \\\n",
       "artist_id                                     \n",
       "1               New Jersey       -74.405661   \n",
       "10         Los Angeles, CA      -118.243685   \n",
       "100            Chicago, IL       -87.629798   \n",
       "1000                   NaN        12.567380   \n",
       "10000                  NaN              NaN   \n",
       "\n",
       "                                              artist_members  \\\n",
       "artist_id                                                      \n",
       "1          Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...   \n",
       "10                                Luke Fischbeck\\nSarah Rara   \n",
       "100        Chris Kalis, Harry Brenner, Scott McGaughey, B...   \n",
       "1000                                                     NaN   \n",
       "10000                          Jack Hertz\\nPHOBoS\\nBlue Hell   \n",
       "\n",
       "                             artist_name artist_paypal_name  \\\n",
       "artist_id                                                     \n",
       "1                                   AWOL                NaN   \n",
       "10                         Lucky Dragons                NaN   \n",
       "100                          Chandeliers                NaN   \n",
       "1000                   Marzipan Marzipan                NaN   \n",
       "10000      Jack Hertz, PHOBoS, Blue Hell                NaN   \n",
       "\n",
       "                                     artist_related_projects  \\\n",
       "artist_id                                                      \n",
       "1          The list of past projects is 2 long but every1...   \n",
       "10                                                       NaN   \n",
       "100        Killer Whales, \\nMichael Columbia\\nMandate\\nMr...   \n",
       "1000                                                     NaN   \n",
       "10000                                                    NaN   \n",
       "\n",
       "                                                  artist_url  \\\n",
       "artist_id                                                      \n",
       "1                    http://freemusicarchive.org/music/AWOL/   \n",
       "10          http://freemusicarchive.org/music/Lucky_Dragons/   \n",
       "100           http://freemusicarchive.org/music/Chandeliers/   \n",
       "1000       http://freemusicarchive.org/music/Marzipan_Mar...   \n",
       "10000      http://freemusicarchive.org/music/Jack_Hertz_P...   \n",
       "\n",
       "                                              artist_website  \\\n",
       "artist_id                                                      \n",
       "1                    http://www.AzillionRecords.blogspot.com   \n",
       "10                              http://hawksandsparrows.org/   \n",
       "100                                       thechandeliers.com   \n",
       "1000                 https://soundcloud.com/marzipanmarzipan   \n",
       "10000      http://surrism.phonoethics.com/surrism-phonoet...   \n",
       "\n",
       "          artist_wikipedia_page                             tags  \n",
       "artist_id                                                         \n",
       "1                           NaN                         ['awol']  \n",
       "10                          NaN                ['lucky dragons']  \n",
       "100                         NaN                  ['chandeliers']  \n",
       "1000                        NaN                               []  \n",
       "10000                       NaN  ['jack hertz phobos blue hell']  \n",
       "\n",
       "[5 rows x 24 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>genre_color</th>\n",
       "      <th>genre_handle</th>\n",
       "      <th>genre_parent_id</th>\n",
       "      <th>genre_title</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>genre_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>#006666</td>\n",
       "      <td>Avant-Garde</td>\n",
       "      <td>38.0</td>\n",
       "      <td>Avant-Garde</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>#CC3300</td>\n",
       "      <td>International</td>\n",
       "      <td>NaN</td>\n",
       "      <td>International</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#000099</td>\n",
       "      <td>Blues</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Blues</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>#990099</td>\n",
       "      <td>Jazz</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Jazz</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>#8A8A65</td>\n",
       "      <td>Classical</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Classical</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         genre_color   genre_handle  genre_parent_id    genre_title\n",
       "genre_id                                                           \n",
       "1            #006666    Avant-Garde             38.0    Avant-Garde\n",
       "2            #CC3300  International              NaN  International\n",
       "3            #000099          Blues              NaN          Blues\n",
       "4            #990099           Jazz              NaN           Jazz\n",
       "5            #8A8A65      Classical              NaN      Classical"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "N = 5\n",
    "ipd.display(tracks.head(N))\n",
    "ipd.display(albums.head(N))\n",
    "ipd.display(artists.head(N))\n",
    "ipd.display(genres.head(N))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 Format metadata\n",
    "\n",
    "Todo:\n",
    "* Sanitize values, e.g. list of words for tags, valid links in `artist_wikipedia_page`, remove html markup in free-form text.\n",
    "    * Clean tags. E.g. some tags are just artist names.\n",
    "* Fill metadata about encoding: length, number of samples, sample rate, bit rate, channels (mono/stereo), 16bits?.\n",
    "* Update duration from audio\n",
    "    * 2624 is marked as 05:05:50 (18350s) although it is reported as 00:21:15.15 by ffmpeg.\n",
    "    * 112067: 3714s --> 01:59:55.06, 112808: 3718s --> 01:59:59.56\n",
    "    * ffmpeg: Estimating duration from bitrate, this may be inaccurate\n",
    "    * Solution, decode the complete mp3: `ffmpeg -i input.mp3 -f null -`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 null, 109727 non-null\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]                                                                                                                                                                             85881\n",
       "['interiors c1964', 'existential', 'hardcore-punk', 'pop-punk', 'punk-rock', 'internet boyfriend', 'rew starr', 'public domain', 'creative commons', 'microsong challenge']      314\n",
       "['classwar karaoke']                                                                                                                                                             239\n",
       "['all styles experimental']                                                                                                                                                      215\n",
       "['improvisation', 'not normal music', 'all styles experimental']                                                                                                                 195\n",
       "['era 1']                                                                                                                                                                        176\n",
       "['all styles experimental', 'harsh noise', 'not normal music']                                                                                                                   150\n",
       "['music is a belief', 'chary', 'nishad', 'uju', 'ibiene', 'nazeem', 'deepu', 'maneet', 'azedine', 'mohammad']                                                                    140\n",
       "['new zealand']                                                                                                                                                                  140\n",
       "['improvisation', 'all styles experimental', 'not normal music']                                                                                                                 128\n",
       "Name: tags, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df, column = tracks, 'tags'\n",
    "null = sum(df[column].isnull())\n",
    "print('{} null, {} non-null'.format(null, df.shape[0] - null))\n",
    "df[column].value_counts().head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Tracks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop = [\n",
    "    'license_image_file', 'license_image_file_large', 'license_parent_id', 'license_url',  # keep title only\n",
    "    'track_file', 'track_image_file',  # used to download only\n",
    "    'track_url', 'album_url', 'artist_url',  # only relevant on website\n",
    "    'track_copyright_c', 'track_copyright_p',  # present for ~1000 tracks only\n",
    "    # 'track_composer', 'track_lyricist', 'track_publisher',  # present for ~4000, <1000 and <2000 tracks\n",
    "    'track_disc_number',  # different from 1 for <1000 tracks\n",
    "    'track_explicit', 'track_explicit_notes',  # present for <4000 tracks\n",
    "    'track_instrumental'  # ~6000 tracks have a 1, there is an instrumental genre\n",
    "]\n",
    "tracks.drop(drop, axis=1, inplace=True)\n",
    "tracks.rename(columns={'license_title': 'track_license', 'tags': 'track_tags'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracks['track_duration'] = tracks['track_duration'].map(creation.convert_duration)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_datetime(df, column, format=None):\n",
    "    df[column] = pd.to_datetime(df[column], infer_datetime_format=True, format=format)\n",
    "convert_datetime(tracks, 'track_date_created')\n",
    "convert_datetime(tracks, 'track_date_recorded')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracks['album_id'].fillna(-1, inplace=True)\n",
    "tracks['track_bit_rate'].fillna(-1, inplace=True)\n",
    "tracks = tracks.astype({'album_id': int, 'track_bit_rate': int})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_genres(genres):\n",
    "    genres = ast.literal_eval(genres)\n",
    "    return [int(genre['genre_id']) for genre in genres]\n",
    "\n",
    "tracks['track_genres'].fillna('[]', inplace=True)\n",
    "tracks['track_genres'] = tracks['track_genres'].map(convert_genres)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['album_id', 'album_title', 'artist_id', 'artist_name', 'artist_website',\n",
       "       'track_license', 'track_tags', 'track_bit_rate', 'track_comments',\n",
       "       'track_composer', 'track_date_created', 'track_date_recorded',\n",
       "       'track_duration', 'track_favorites', 'track_genres',\n",
       "       'track_information', 'track_interest', 'track_language_code',\n",
       "       'track_listens', 'track_lyricist', 'track_number', 'track_publisher',\n",
       "       'track_title'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tracks.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Albums"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop = [\n",
    "    'artist_name', 'album_url', 'artist_url',  # in tracks already (though it can be different)\n",
    "    'album_handle',\n",
    "    'album_image_file', 'album_images',  # todo: shall be downloaded\n",
    "    #'album_producer', 'album_engineer',  # present for ~2400 albums only\n",
    "]\n",
    "albums.drop(drop, axis=1, inplace=True)\n",
    "albums.rename(columns={'tags': 'album_tags'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "convert_datetime(albums, 'album_date_created')\n",
    "convert_datetime(albums, 'album_date_released')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['album_comments', 'album_date_created', 'album_date_released',\n",
       "       'album_engineer', 'album_favorites', 'album_information',\n",
       "       'album_listens', 'album_producer', 'album_title', 'album_tracks',\n",
       "       'album_type', 'album_tags'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "albums.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Artists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop = [\n",
    "    'artist_website', 'artist_url',  # in tracks already (though it can be different)\n",
    "    'artist_handle',\n",
    "    'artist_image_file', 'artist_images',  # todo: shall be downloaded\n",
    "    'artist_donation_url', 'artist_paypal_name', 'artist_flattr_name',  # ~1600 & ~400 & ~70, not relevant\n",
    "    'artist_contact',  # ~1500, not very useful data\n",
    "    # 'artist_active_year_begin', 'artist_active_year_end',  # ~1400, ~500 only\n",
    "    # 'artist_associated_labels',  # ~1000\n",
    "    # 'artist_related_projects',  # only ~800, but can be combined with bio\n",
    "]\n",
    "artists.drop(drop, axis=1, inplace=True)\n",
    "artists.rename(columns={'tags': 'artist_tags'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "convert_datetime(artists, 'artist_date_created')\n",
    "for column in ['artist_active_year_begin', 'artist_active_year_end']:\n",
    "    artists[column].replace(0.0, np.nan, inplace=True)\n",
    "    convert_datetime(artists, column, format='%Y.0')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['artist_active_year_begin', 'artist_active_year_end',\n",
       "       'artist_associated_labels', 'artist_bio', 'artist_comments',\n",
       "       'artist_date_created', 'artist_favorites', 'artist_latitude',\n",
       "       'artist_location', 'artist_longitude', 'artist_members', 'artist_name',\n",
       "       'artist_related_projects', 'artist_wikipedia_page', 'artist_tags'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "artists.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Merge DataFrames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "not_found['albums'].remove(None)\n",
    "not_found['albums'].append(-1)\n",
    "not_found['albums'] = [int(i) for i in not_found['albums']]\n",
    "not_found['artists'] = [int(i) for i in not_found['artists']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3674 tracks without extended album information (1041 tracks without album_id)\n"
     ]
    }
   ],
   "source": [
    "tracks = tracks.merge(albums, left_on='album_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))\n",
    "\n",
    "n = sum(tracks['album_title_dup'].isnull())\n",
    "print('{} tracks without extended album information ({} tracks without album_id)'.format(\n",
    "    n, sum(tracks['album_id'] == -1)))\n",
    "assert sum(tracks['album_id'].isin(not_found['albums'])) == n\n",
    "assert sum(tracks['album_title'] != tracks['album_title_dup']) == n\n",
    "\n",
    "tracks.drop('album_title_dup', axis=1, inplace=True)\n",
    "assert not any('dup' in col for col in tracks.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Album artist can be different than track artist. Keep track artist.\n",
    "#tracks[tracks['artist_name'] != tracks['artist_name_dup']].select(lambda x: 'artist_name' in x, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "974 tracks without extended artist information\n"
     ]
    }
   ],
   "source": [
    "tracks = tracks.merge(artists, left_on='artist_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))\n",
    "\n",
    "n = sum(tracks['artist_name_dup'].isnull())\n",
    "print('{} tracks without extended artist information'.format(n))\n",
    "assert sum(tracks['artist_id'].isin(not_found['artists'])) == n\n",
    "assert sum(tracks['artist_name'] != tracks[('artist_name_dup')]) == n\n",
    "\n",
    "tracks.drop('artist_name_dup', axis=1, inplace=True)\n",
    "assert not any('dup' in col for col in tracks.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = []\n",
    "for name in tracks.columns:\n",
    "    names = name.split('_')\n",
    "    columns.append((names[0], '_'.join(names[1:])))\n",
    "tracks.columns = pd.MultiIndex.from_tuples(columns)\n",
    "assert all(label in ['track', 'album', 'artist'] for label in tracks.columns.get_level_values(0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Todo: fill other columns ?\n",
    "tracks['album', 'tags'].fillna('[]', inplace=True)\n",
    "tracks['artist', 'tags'].fillna('[]', inplace=True)\n",
    "\n",
    "columns = [('album', 'favorites'), ('album', 'comments'), ('album', 'listens'), ('album', 'tracks'),\n",
    "           ('artist', 'favorites'), ('artist', 'comments')]\n",
    "for column in columns:\n",
    "    tracks[column].fillna(-1, inplace=True)\n",
    "columns = {column: int for column in columns}\n",
    "tracks = tracks.astype(columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 Data cleaning\n",
    "\n",
    "Todo: duplicates (metadata and audio)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 lost, 109727 left\n"
     ]
    }
   ],
   "source": [
    "def keep(index, df):\n",
    "    old = len(df)\n",
    "    df = df.loc[index]\n",
    "    new = len(df)\n",
    "    print('{} lost, {} left'.format(old - new, new))\n",
    "    return df\n",
    "\n",
    "tracks = keep(tracks.index, tracks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "180 lost, 109547 left\n",
      "286 lost, 109261 left\n"
     ]
    }
   ],
   "source": [
    "# Audio not found or could not be trimmed.\n",
    "tracks = keep(tracks.index.difference(not_found['audio']), tracks)\n",
    "tracks = keep(tracks.index.difference(not_found['clips']), tracks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Errors from the `features.py` script.\n",
    "* IndexError('index 0 is out of bounds for axis 0 with size 0',)\n",
    "    * ffmpeg: Header missing\n",
    "    * ffmpeg: Could not find codec parameters for stream 0 (Audio: mp3, 0 channels, s16p): unspecified frame size. Consider increasing the value for the 'analyzeduration' and 'probesize' options\n",
    "    * tids: 117759\n",
    "* NoBackendError()\n",
    "    * ffmpeg: Format mp3 detected only with low score of 1, misdetection possible!\n",
    "    * tids: 80015, 115235\n",
    "* UserWarning('Trying to estimate tuning from empty frequency set.',)\n",
    "    * librosa error\n",
    "    * tids: 1440, 26436, 38903, 57603, 62095, 62954, 62956, 62957, 62959, 62971, 86079, 96426, 104623, 106719, 109714, 114501, 114528, 118003, 118004, 127827, 130298, 130296, 131076, 135804, 154923\n",
    "* ParameterError('Filter pass-band lies beyond Nyquist',)\n",
    "    * librosa error\n",
    "    * tids: 152204, 28106, 29166, 29167, 29169, 29168, 29170, 29171, 29172, 29173, 29179, 43903, 56757, 59361, 75461, 92346, 92345, 92347, 92349, 92350, 92351, 92353, 92348, 92352, 92354, 92355, 92356, 92358, 92359, 92361, 92360, 114448, 136486, 144769, 144770, 144771, 144773, 144774, 144775, 144778, 144776, 144777"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "71 lost, 109190 left\n"
     ]
    }
   ],
   "source": [
    "# Feature extraction failed.\n",
    "FAILED = [1440, 26436, 28106, 29166, 29167, 29168, 29169, 29170, 29171, 29172,\n",
    "          29173, 29179, 38903, 43903, 56757, 57603, 59361, 62095, 62954, 62956,\n",
    "          62957, 62959, 62971, 75461, 80015, 86079, 92345, 92346, 92347, 92348,\n",
    "          92349, 92350, 92351, 92352, 92353, 92354, 92355, 92356, 92357, 92358,\n",
    "          92359, 92360, 92361, 96426, 104623, 106719, 109714, 114448, 114501,114528,\n",
    "          115235, 117759, 118003, 118004, 127827, 130296, 130298, 131076, 135804, 136486,\n",
    "          144769, 144770, 144771, 144773, 144774, 144775, 144776, 144777, 144778, 152204,\n",
    "          154923]\n",
    "tracks = keep(tracks.index.difference(FAILED), tracks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2616 lost, 106574 left\n",
      "114 licenses\n"
     ]
    }
   ],
   "source": [
    "# License forbids redistribution.\n",
    "tracks = keep(tracks['track', 'license'] != 'FMA-Limited: Download Only', tracks)\n",
    "print('{} licenses'.format(len(tracks[('track', 'license')].unique())))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "#sum(tracks['track', 'title'].duplicated())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4 Genres"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "genres.drop(['genre_handle', 'genre_color'], axis=1, inplace=True)\n",
    "genres.rename(columns={'genre_parent_id': 'parent', 'genre_title': 'title'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "genres['parent'].fillna(0, inplace=True)\n",
    "genres = genres.astype({'parent': int})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 13 (Easy Listening) has parent 126 which is missing\n",
    "# --> a root genre on the website, although not in the genre menu\n",
    "genres.at[13, 'parent'] = 0\n",
    "\n",
    "# 580 (Abstract Hip-Hop) has parent 1172 which is missing\n",
    "# --> listed as child of Hip-Hop on the website\n",
    "genres.at[580, 'parent'] = 21\n",
    "\n",
    "# 810 (Nu-Jazz) has parent 51 which is missing\n",
    "# --> listed as child of Easy Listening on website\n",
    "genres.at[810, 'parent'] = 13\n",
    "\n",
    "# 763 (Holiday) has parent 763 which is itself\n",
    "# --> listed as child of Sound Effects on website\n",
    "genres.at[763, 'parent'] = 16\n",
    "\n",
    "# Todo: should novelty be under Experimental? It is alone on website."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "34 tracks have genre 806\n"
     ]
    }
   ],
   "source": [
    "# Genre 806 (hiphop) should not exist. Replace it by 21 (Hip-Hop).\n",
    "print('{} tracks have genre 806'.format(\n",
    "    sum(tracks['track', 'genres'].map(lambda genres: 806 in genres))))\n",
    "def change_genre(genres):\n",
    "    return [genre if genre != 806 else 21 for genre in genres]\n",
    "tracks['track', 'genres'] = tracks['track', 'genres'].map(change_genre)\n",
    "genres.drop(806, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_parent(genre, track_all_genres=None):\n",
    "    parent = genres.at[genre, 'parent']\n",
    "    if track_all_genres is not None:\n",
    "        track_all_genres.append(genre)\n",
    "    return genre if parent == 0 else get_parent(parent, track_all_genres)\n",
    "\n",
    "# Get all genres, i.e. all genres encountered when walking from leafs to roots.\n",
    "def get_all_genres(track_genres):\n",
    "    track_all_genres = list()\n",
    "    for genre in track_genres:\n",
    "        get_parent(genre, track_all_genres)\n",
    "    return list(set(track_all_genres))\n",
    "\n",
    "tracks['track', 'genres_all'] = tracks['track', 'genres'].map(get_all_genres)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>parent</th>\n",
       "      <th>title</th>\n",
       "      <th>#tracks</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>genre_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>175</th>\n",
       "      <td>86</td>\n",
       "      <td>Bollywood</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>178</th>\n",
       "      <td>4</td>\n",
       "      <td>Be-Bop</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          parent      title  #tracks\n",
       "genre_id                            \n",
       "175           86  Bollywood        0\n",
       "178            4     Be-Bop        0"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Number of tracks per genre.\n",
    "def count_genres(subset=tracks.index):\n",
    "    count = pd.Series(0, index=genres.index)\n",
    "    for _, track_all_genres in tracks.loc[subset, ('track', 'genres_all')].items():\n",
    "        for genre in track_all_genres:\n",
    "            count[genre] += 1\n",
    "    return count\n",
    "\n",
    "genres['#tracks'] = count_genres()\n",
    "genres[genres['#tracks'] == 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_top_genre(track_genres):\n",
    "    top_genres = set(genres.at[genres.at[genre, 'top_level'], 'title'] for genre in track_genres)\n",
    "    return top_genres.pop() if len(top_genres) == 1 else np.nan\n",
    "\n",
    "# Top-level genre.\n",
    "genres['top_level'] = genres.index.map(get_parent)\n",
    "tracks['track', 'genre_top'] = tracks['track', 'genres'].map(get_top_genre)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>parent</th>\n",
       "      <th>title</th>\n",
       "      <th>#tracks</th>\n",
       "      <th>top_level</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>genre_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>38</td>\n",
       "      <td>Avant-Garde</td>\n",
       "      <td>8693</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>International</td>\n",
       "      <td>5271</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>Blues</td>\n",
       "      <td>1752</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>Jazz</td>\n",
       "      <td>4126</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>Classical</td>\n",
       "      <td>4106</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>38</td>\n",
       "      <td>Novelty</td>\n",
       "      <td>914</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>20</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>217</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0</td>\n",
       "      <td>Old-Time / Historic</td>\n",
       "      <td>868</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0</td>\n",
       "      <td>Country</td>\n",
       "      <td>1987</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0</td>\n",
       "      <td>Pop</td>\n",
       "      <td>13845</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          parent                title  #tracks  top_level\n",
       "genre_id                                                 \n",
       "1             38          Avant-Garde     8693         38\n",
       "2              0        International     5271          2\n",
       "3              0                Blues     1752          3\n",
       "4              0                 Jazz     4126          4\n",
       "5              0            Classical     4106          5\n",
       "6             38              Novelty      914         38\n",
       "7             20               Comedy      217         20\n",
       "8              0  Old-Time / Historic      868          8\n",
       "9              0              Country     1987          9\n",
       "10             0                  Pop    13845         10"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "genres.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5 Subsets: large, medium, small"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 Large\n",
    "\n",
    "Main characteristic: the full set with clips trimmed to a manageable size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 Medium\n",
    "\n",
    "Main characteristic: clean metadata (includes 1 top-level genre) and quality audio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "fma_medium = pd.DataFrame(tracks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3529 lost, 103045 left\n",
      "598 lost, 102447 left\n",
      "1 lost, 102446 left\n",
      "674 lost, 101772 left\n",
      "65 lost, 101707 left\n"
     ]
    }
   ],
   "source": [
    "# Missing meta-information.\n",
    "\n",
    "# Missing extended album and artist information.\n",
    "fma_medium = keep(~fma_medium['album', 'id'].isin(not_found['albums']), fma_medium)\n",
    "fma_medium = keep(~fma_medium['artist', 'id'].isin(not_found['artists']), fma_medium)\n",
    "\n",
    "# Untitled track or album.\n",
    "fma_medium = keep(~fma_medium['track', 'title'].isnull(), fma_medium)\n",
    "fma_medium = keep(fma_medium['track', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)\n",
    "fma_medium = keep(fma_medium['album', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)\n",
    "\n",
    "# One tag is often just the artist name. Tags too scarce for tracks and albums.\n",
    "#keep(fma_medium['artist', 'tags'].map(len) >= 2, fma_medium)\n",
    "\n",
    "# Too scarce.\n",
    "#fma_medium = keep(~fma_medium['album', 'information'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'bio'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'website'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'wikipedia_page'].isnull(), fma_medium)\n",
    "\n",
    "# Too scarce.\n",
    "#fma_medium = keep(~fma_medium['artist', 'location'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'latitude'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'longitude'].isnull(), fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1326 lost, 100381 left\n"
     ]
    }
   ],
   "source": [
    "# Technical quality.\n",
    "# Todo: sample rate\n",
    "fma_medium = keep(fma_medium['track', 'bit_rate'] > 100000, fma_medium)\n",
    "\n",
    "# Choosing standard bit rates discards all VBR.\n",
    "#fma_medium = keep(fma_medium['track', 'bit_rate'].isin([320000, 256000, 192000, 160000, 128000]), fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4736 lost, 95645 left\n",
      "5399 lost, 90246 left\n",
      "466 lost, 89780 left\n",
      "5353 lost, 84427 left\n"
     ]
    }
   ],
   "source": [
    "fma_medium = keep(fma_medium['track', 'duration'] >= 60, fma_medium)\n",
    "fma_medium = keep(fma_medium['track', 'duration'] <= 600, fma_medium)\n",
    "\n",
    "fma_medium = keep(fma_medium['album', 'tracks'] >= 1, fma_medium)\n",
    "fma_medium = keep(fma_medium['album', 'tracks'] <= 50, fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4941 lost, 79486 left\n",
      "1064 lost, 78422 left\n",
      "1769 lost, 76653 left\n"
     ]
    }
   ],
   "source": [
    "# Lower popularity bound.\n",
    "fma_medium = keep(fma_medium['track', 'listens'] >= 100, fma_medium)\n",
    "fma_medium = keep(fma_medium['track', 'interest'] >= 200, fma_medium)\n",
    "fma_medium = keep(fma_medium['album', 'listens'] >= 1000, fma_medium);\n",
    "\n",
    "# Favorites and comments are very scarce.\n",
    "#fma_medium = keep(fma_medium['artist', 'favorites'] >= 1, fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "42495 lost, 34158 left\n"
     ]
    }
   ],
   "source": [
    "# Targeted genre classification.\n",
    "fma_medium = keep(~fma_medium['track', 'genre_top'].isnull(), fma_medium);\n",
    "#keep(fma_medium['track', 'genres'].map(len) == 1, fma_medium);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9158 lost, 25000 left\n"
     ]
    }
   ],
   "source": [
    "# Adjust size with popularity measure. Should be of better quality.\n",
    "N_TRACKS = 25000\n",
    "\n",
    "# Observations\n",
    "# * More albums killed than artists --> be sure not to kill diversity\n",
    "# * Favorites and preterites genres differently --> do it per genre?\n",
    "# Normalization\n",
    "# * mean, median, std, max\n",
    "# * tracks per album or artist\n",
    "# Test\n",
    "# * 4/5 of same tracks were selected with various set of measures\n",
    "# * <5% diff with max and mean\n",
    "\n",
    "popularity_measures = [('track', 'listens'), ('track', 'interest')]  # ('album', 'listens')\n",
    "# ('track', 'favorites'), ('track', 'comments'),\n",
    "# ('album', 'favorites'), ('album', 'comments'),\n",
    "# ('artist', 'favorites'), ('artist', 'comments'),\n",
    "\n",
    "normalization = {measure: fma_medium[measure].max() for measure in popularity_measures}\n",
    "def popularity_measure(track):\n",
    "    return sum(track[measure] / normalization[measure] for measure in popularity_measures)\n",
    "fma_medium['popularity_measure'] = fma_medium.apply(popularity_measure, axis=1)\n",
    "fma_medium = keep(fma_medium.sort_values('popularity_measure', ascending=False).index[:N_TRACKS], fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>genre_id</th>\n",
       "      <th>parent</th>\n",
       "      <th>#tracks</th>\n",
       "      <th>top_level</th>\n",
       "      <th>#tracks_medium</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>title</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Rock</th>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "      <td>32923</td>\n",
       "      <td>12</td>\n",
       "      <td>7103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Electronic</th>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>34413</td>\n",
       "      <td>15</td>\n",
       "      <td>6314</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Experimental</th>\n",
       "      <td>38</td>\n",
       "      <td>0</td>\n",
       "      <td>38154</td>\n",
       "      <td>38</td>\n",
       "      <td>2251</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hip-Hop</th>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "      <td>8389</td>\n",
       "      <td>21</td>\n",
       "      <td>2201</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Folk</th>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>12706</td>\n",
       "      <td>17</td>\n",
       "      <td>1519</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Instrumental</th>\n",
       "      <td>1235</td>\n",
       "      <td>0</td>\n",
       "      <td>14938</td>\n",
       "      <td>1235</td>\n",
       "      <td>1350</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pop</th>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "      <td>13845</td>\n",
       "      <td>10</td>\n",
       "      <td>1186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>International</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>5271</td>\n",
       "      <td>2</td>\n",
       "      <td>1018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Classical</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>4106</td>\n",
       "      <td>5</td>\n",
       "      <td>619</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Old-Time / Historic</th>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>868</td>\n",
       "      <td>8</td>\n",
       "      <td>510</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jazz</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>4126</td>\n",
       "      <td>4</td>\n",
       "      <td>384</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Country</th>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>1987</td>\n",
       "      <td>9</td>\n",
       "      <td>178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Soul-RnB</th>\n",
       "      <td>14</td>\n",
       "      <td>0</td>\n",
       "      <td>1499</td>\n",
       "      <td>14</td>\n",
       "      <td>154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Spoken</th>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>1876</td>\n",
       "      <td>20</td>\n",
       "      <td>118</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Blues</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1752</td>\n",
       "      <td>3</td>\n",
       "      <td>74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Easy Listening</th>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>730</td>\n",
       "      <td>13</td>\n",
       "      <td>21</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     genre_id  parent  #tracks  top_level  #tracks_medium\n",
       "title                                                                    \n",
       "Rock                       12       0    32923         12            7103\n",
       "Electronic                 15       0    34413         15            6314\n",
       "Experimental               38       0    38154         38            2251\n",
       "Hip-Hop                    21       0     8389         21            2201\n",
       "Folk                       17       0    12706         17            1519\n",
       "Instrumental             1235       0    14938       1235            1350\n",
       "Pop                        10       0    13845         10            1186\n",
       "International               2       0     5271          2            1018\n",
       "Classical                   5       0     4106          5             619\n",
       "Old-Time / Historic         8       0      868          8             510\n",
       "Jazz                        4       0     4126          4             384\n",
       "Country                     9       0     1987          9             178\n",
       "Soul-RnB                   14       0     1499         14             154\n",
       "Spoken                     20       0     1876         20             118\n",
       "Blues                       3       0     1752          3              74\n",
       "Easy Listening             13       0      730         13              21"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tmp = genres[genres['parent'] == 0].reset_index().set_index('title')\n",
    "tmp['#tracks_medium'] = fma_medium['track', 'genre_top'].value_counts()\n",
    "tmp.sort_values('#tracks_medium', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 Small\n",
    "\n",
    "Main characteristic: genre balanced (and echonest features).\n",
    "\n",
    "Choices:\n",
    "* 8 genres with 1000 tracks --> 8,000 tracks\n",
    "* 10 genres with 500 tracks --> 5,000 tracks\n",
    "\n",
    "Todo:\n",
    "* Download more echonest features so that all tracks can have them. Otherwise intersection of tracks with echonest features and one top-level genre is too small."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2058 lost, 22942 left\n"
     ]
    }
   ],
   "source": [
    "N_GENRES = 8\n",
    "N_TRACKS = 1000\n",
    "\n",
    "top_genres = tmp.sort_values('#tracks_medium', ascending=False)[:N_GENRES].index\n",
    "fma_small = pd.DataFrame(fma_medium)\n",
    "fma_small = keep(fma_small['track', 'genre_top'].isin(top_genres), fma_small)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_keep = []\n",
    "for genre in top_genres:\n",
    "    subset = fma_small[fma_small['track', 'genre_top'] == genre]\n",
    "    drop = subset.sort_values('popularity_measure').index[:-N_TRACKS]\n",
    "    fma_small.drop(drop, inplace=True)\n",
    "assert len(fma_small) == N_GENRES * N_TRACKS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.4 Subset indication"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "SUBSETS = ('small', 'medium', 'large')\n",
    "tracks['set', 'subset'] = pd.Series().astype('category', categories=SUBSETS, ordered=True)\n",
    "tracks.loc[tracks.index, ('set', 'subset')] = 'large'\n",
    "tracks.loc[fma_medium.index, ('set', 'subset')] = 'medium'\n",
    "tracks.loc[fma_small.index, ('set', 'subset')] = 'small'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.5 Echonest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 lost, 14511 left\n",
      "205 lost, 14306 left\n",
      "239 lost, 14067 left\n",
      "938 lost, 13129 left\n",
      "7848 lost, 5281 left\n",
      "11835 lost, 1294 left\n"
     ]
    }
   ],
   "source": [
    "echonest = pd.read_csv('raw_echonest.csv', index_col=0, header=[0, 1, 2])\n",
    "echonest = keep(~echonest['echonest', 'temporal_features'].isnull().any(axis=1), echonest)\n",
    "echonest = keep(~echonest['echonest', 'audio_features'].isnull().any(axis=1), echonest)\n",
    "echonest = keep(~echonest['echonest', 'social_features'].isnull().any(axis=1), echonest)\n",
    "\n",
    "echonest = keep(echonest.index.isin(tracks.index), echonest);\n",
    "keep(echonest.index.isin(fma_medium.index), echonest);\n",
    "keep(echonest.index.isin(fma_small.index), echonest);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6 Splits: training, validation, test\n",
    "\n",
    "Take into account:\n",
    "* Artists may only appear on one side.\n",
    "* Stratification: ideally, all characteristics (#tracks per artist, duration, sampling rate, information, bio) and targets (genres, tags) should be equally distributed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "for genre in genres.index:\n",
    "    tracks['genre', genres.at[genre, 'title']] = tracks['track', 'genres_all'].map(lambda genres: genre in genres)\n",
    "\n",
    "SPLITS = ('training', 'test', 'validation')\n",
    "PERCENTAGES = (0.8, 0.1, 0.1)\n",
    "tracks['set', 'split'] = pd.Series().astype('category', categories=SPLITS)\n",
    "\n",
    "for subset in SUBSETS:\n",
    "\n",
    "    tracks_subset = tracks['set', 'subset'] <= subset\n",
    "\n",
    "    # Consider only top-level genres for small and medium.\n",
    "    genre_list = list(tracks.loc[tracks_subset, ('track', 'genre_top')].unique())\n",
    "    if subset == 'large':\n",
    "        genre_list = list(genres['title']) \n",
    "\n",
    "    while True:\n",
    "        if len(genre_list) == 0:\n",
    "            break\n",
    "\n",
    "        # Choose most constrained genre, i.e. genre with the least unassigned artists.\n",
    "        tracks_unsplit = tracks['set', 'split'].isnull()\n",
    "        count = tracks[tracks_subset & tracks_unsplit].set_index(('artist', 'id'), append=True)['genre']\n",
    "        count = count.groupby(level=1).sum().astype(np.bool).sum()\n",
    "        genre = np.argmin(count[genre_list])\n",
    "        genre_list.remove(genre)\n",
    "        \n",
    "        # Given genre, select artists.\n",
    "        tracks_genre = tracks['genre', genre] == 1\n",
    "        artists = tracks.loc[tracks_genre & tracks_subset & tracks_unsplit, ('artist', 'id')].value_counts()\n",
    "        #print('-->', genre, len(artists))\n",
    "\n",
    "        current = {split: np.sum(tracks_genre & tracks_subset & (tracks['set', 'split'] == split)) for split in SPLITS}\n",
    "\n",
    "        # Assign artists with most tracks first.\n",
    "        for artist, count in artists.items():\n",
    "            choice = np.argmin([current[split] / percentage for split, percentage in zip(SPLITS, PERCENTAGES)])\n",
    "            current[SPLITS[choice]] += count\n",
    "            #assert tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')].isnull().all()\n",
    "            tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')] = SPLITS[choice]\n",
    "\n",
    "# Tracks without genre can only serve as unlabeled data for training, e.g. for semi-supervised algorithms.\n",
    "no_genres = tracks['track', 'genres_all'].map(lambda genres: len(genres) == 0)\n",
    "no_split = tracks['set', 'split'].isnull()\n",
    "assert not (no_split & ~no_genres).any()\n",
    "tracks.loc[no_split, ('set', 'split')] = 'training'\n",
    "\n",
    "# Not needed any more.\n",
    "tracks.drop('genre', axis=1, level=0, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7 Store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "for dataset in 'tracks', 'genres', 'echonest':\n",
    "    eval(dataset).sort_index(axis=0, inplace=True)\n",
    "    eval(dataset).sort_index(axis=1, inplace=True)\n",
    "    params = dict(float_format='%.10f') if dataset == 'echonest' else dict()\n",
    "    eval(dataset).to_csv(dataset + '.csv', **params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ./creation.py normalize /path/to/fma\n",
    "# ./creation.py zips /path/to/fma"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8 Description"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "album   comments                      int64\n",
       "        date_created         datetime64[ns]\n",
       "        date_released        datetime64[ns]\n",
       "        engineer                     object\n",
       "        favorites                     int64\n",
       "        id                            int64\n",
       "        information                category\n",
       "        listens                       int64\n",
       "        producer                     object\n",
       "        tags                         object\n",
       "        title                        object\n",
       "        tracks                        int64\n",
       "        type                       category\n",
       "artist  active_year_begin    datetime64[ns]\n",
       "        active_year_end      datetime64[ns]\n",
       "        associated_labels            object\n",
       "        bio                        category\n",
       "        comments                      int64\n",
       "        date_created         datetime64[ns]\n",
       "        favorites                     int64\n",
       "        id                            int64\n",
       "        latitude                    float64\n",
       "        location                     object\n",
       "        longitude                   float64\n",
       "        members                      object\n",
       "        name                         object\n",
       "        related_projects             object\n",
       "        tags                         object\n",
       "        website                      object\n",
       "        wikipedia_page               object\n",
       "set     split                        object\n",
       "        subset                     category\n",
       "track   bit_rate                      int64\n",
       "        comments                      int64\n",
       "        composer                     object\n",
       "        date_created         datetime64[ns]\n",
       "        date_recorded        datetime64[ns]\n",
       "        duration                      int64\n",
       "        favorites                     int64\n",
       "        genre_top                  category\n",
       "        genres                       object\n",
       "        genres_all                   object\n",
       "        information                  object\n",
       "        interest                      int64\n",
       "        language_code                object\n",
       "        license                    category\n",
       "        listens                       int64\n",
       "        lyricist                     object\n",
       "        number                        int64\n",
       "        publisher                    object\n",
       "        tags                         object\n",
       "        title                        object\n",
       "dtype: object"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tracks = utils.load('tracks.csv')\n",
    "tracks.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bit_rate</th>\n",
       "      <th>comments</th>\n",
       "      <th>composer</th>\n",
       "      <th>date_created</th>\n",
       "      <th>date_recorded</th>\n",
       "      <th>duration</th>\n",
       "      <th>favorites</th>\n",
       "      <th>genre_top</th>\n",
       "      <th>genres</th>\n",
       "      <th>genres_all</th>\n",
       "      <th>information</th>\n",
       "      <th>interest</th>\n",
       "      <th>language_code</th>\n",
       "      <th>license</th>\n",
       "      <th>listens</th>\n",
       "      <th>lyricist</th>\n",
       "      <th>number</th>\n",
       "      <th>publisher</th>\n",
       "      <th>tags</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>track_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>256000</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2008-11-26 01:48:12</td>\n",
       "      <td>2008-11-26</td>\n",
       "      <td>168</td>\n",
       "      <td>2</td>\n",
       "      <td>Hip-Hop</td>\n",
       "      <td>[21]</td>\n",
       "      <td>[21]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4656</td>\n",
       "      <td>en</td>\n",
       "      <td>Attribution-NonCommercial-ShareAlike 3.0 Inter...</td>\n",
       "      <td>1293</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>Food</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>256000</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2008-11-26 01:48:14</td>\n",
       "      <td>2008-11-26</td>\n",
       "      <td>237</td>\n",
       "      <td>1</td>\n",
       "      <td>Hip-Hop</td>\n",
       "      <td>[21]</td>\n",
       "      <td>[21]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1470</td>\n",
       "      <td>en</td>\n",
       "      <td>Attribution-NonCommercial-ShareAlike 3.0 Inter...</td>\n",
       "      <td>514</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>Electric Ave</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>256000</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2008-11-26 01:48:20</td>\n",
       "      <td>2008-11-26</td>\n",
       "      <td>206</td>\n",
       "      <td>6</td>\n",
       "      <td>Hip-Hop</td>\n",
       "      <td>[21]</td>\n",
       "      <td>[21]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1933</td>\n",
       "      <td>en</td>\n",
       "      <td>Attribution-NonCommercial-ShareAlike 3.0 Inter...</td>\n",
       "      <td>1151</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>This World</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>192000</td>\n",
       "      <td>0</td>\n",
       "      <td>Kurt Vile</td>\n",
       "      <td>2008-11-25 17:49:06</td>\n",
       "      <td>2008-11-26</td>\n",
       "      <td>161</td>\n",
       "      <td>178</td>\n",
       "      <td>Pop</td>\n",
       "      <td>[10]</td>\n",
       "      <td>[10]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>54881</td>\n",
       "      <td>en</td>\n",
       "      <td>Attribution-NonCommercial-NoDerivatives (aka M...</td>\n",
       "      <td>50135</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>Freeway</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>256000</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2008-11-26 01:48:56</td>\n",
       "      <td>2008-01-01</td>\n",
       "      <td>311</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[76, 103]</td>\n",
       "      <td>[17, 10, 76, 103]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>978</td>\n",
       "      <td>en</td>\n",
       "      <td>Attribution-NonCommercial-NoDerivatives (aka M...</td>\n",
       "      <td>361</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>Spiritual Level</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          bit_rate  comments   composer        date_created date_recorded  \\\n",
       "track_id                                                                    \n",
       "2           256000         0        NaN 2008-11-26 01:48:12    2008-11-26   \n",
       "3           256000         0        NaN 2008-11-26 01:48:14    2008-11-26   \n",
       "5           256000         0        NaN 2008-11-26 01:48:20    2008-11-26   \n",
       "10          192000         0  Kurt Vile 2008-11-25 17:49:06    2008-11-26   \n",
       "20          256000         0        NaN 2008-11-26 01:48:56    2008-01-01   \n",
       "\n",
       "          duration  favorites genre_top     genres         genres_all  \\\n",
       "track_id                                                                \n",
       "2              168          2   Hip-Hop       [21]               [21]   \n",
       "3              237          1   Hip-Hop       [21]               [21]   \n",
       "5              206          6   Hip-Hop       [21]               [21]   \n",
       "10             161        178       Pop       [10]               [10]   \n",
       "20             311          0       NaN  [76, 103]  [17, 10, 76, 103]   \n",
       "\n",
       "         information  interest language_code  \\\n",
       "track_id                                       \n",
       "2                NaN      4656            en   \n",
       "3                NaN      1470            en   \n",
       "5                NaN      1933            en   \n",
       "10               NaN     54881            en   \n",
       "20               NaN       978            en   \n",
       "\n",
       "                                                    license  listens lyricist  \\\n",
       "track_id                                                                        \n",
       "2         Attribution-NonCommercial-ShareAlike 3.0 Inter...     1293      NaN   \n",
       "3         Attribution-NonCommercial-ShareAlike 3.0 Inter...      514      NaN   \n",
       "5         Attribution-NonCommercial-ShareAlike 3.0 Inter...     1151      NaN   \n",
       "10        Attribution-NonCommercial-NoDerivatives (aka M...    50135      NaN   \n",
       "20        Attribution-NonCommercial-NoDerivatives (aka M...      361      NaN   \n",
       "\n",
       "          number publisher tags            title  \n",
       "track_id                                          \n",
       "2              3       NaN   []             Food  \n",
       "3              4       NaN   []     Electric Ave  \n",
       "5              6       NaN   []       This World  \n",
       "10             1       NaN   []          Freeway  \n",
       "20             3       NaN   []  Spiritual Level  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>comments</th>\n",
       "      <th>date_created</th>\n",
       "      <th>date_released</th>\n",
       "      <th>engineer</th>\n",
       "      <th>favorites</th>\n",
       "      <th>id</th>\n",
       "      <th>information</th>\n",
       "      <th>listens</th>\n",
       "      <th>producer</th>\n",
       "      <th>tags</th>\n",
       "      <th>title</th>\n",
       "      <th>tracks</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>track_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>2008-11-26 01:44:45</td>\n",
       "      <td>2009-01-05</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;p&gt;&lt;/p&gt;</td>\n",
       "      <td>6073</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>AWOL - A Way Of Life</td>\n",
       "      <td>7</td>\n",
       "      <td>Album</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>2008-11-26 01:44:45</td>\n",
       "      <td>2009-01-05</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;p&gt;&lt;/p&gt;</td>\n",
       "      <td>6073</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>AWOL - A Way Of Life</td>\n",
       "      <td>7</td>\n",
       "      <td>Album</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>2008-11-26 01:44:45</td>\n",
       "      <td>2009-01-05</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;p&gt;&lt;/p&gt;</td>\n",
       "      <td>6073</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>AWOL - A Way Of Life</td>\n",
       "      <td>7</td>\n",
       "      <td>Album</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0</td>\n",
       "      <td>2008-11-26 01:45:08</td>\n",
       "      <td>2008-02-06</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>47632</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>Constant Hitmaker</td>\n",
       "      <td>2</td>\n",
       "      <td>Album</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>0</td>\n",
       "      <td>2008-11-26 01:45:05</td>\n",
       "      <td>2009-01-06</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>&lt;p&gt; \"spiritual songs\" from Nicky Cook&lt;/p&gt;</td>\n",
       "      <td>2710</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>Niris</td>\n",
       "      <td>13</td>\n",
       "      <td>Album</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          comments        date_created date_released engineer  favorites  id  \\\n",
       "track_id                                                                       \n",
       "2                0 2008-11-26 01:44:45    2009-01-05      NaN          4   1   \n",
       "3                0 2008-11-26 01:44:45    2009-01-05      NaN          4   1   \n",
       "5                0 2008-11-26 01:44:45    2009-01-05      NaN          4   1   \n",
       "10               0 2008-11-26 01:45:08    2008-02-06      NaN          4   6   \n",
       "20               0 2008-11-26 01:45:05    2009-01-06      NaN          2   4   \n",
       "\n",
       "                                        information  listens producer tags  \\\n",
       "track_id                                                                     \n",
       "2                                           <p></p>     6073      NaN   []   \n",
       "3                                           <p></p>     6073      NaN   []   \n",
       "5                                           <p></p>     6073      NaN   []   \n",
       "10                                              NaN    47632      NaN   []   \n",
       "20        <p> \"spiritual songs\" from Nicky Cook</p>     2710      NaN   []   \n",
       "\n",
       "                         title  tracks   type  \n",
       "track_id                                       \n",
       "2         AWOL - A Way Of Life       7  Album  \n",
       "3         AWOL - A Way Of Life       7  Album  \n",
       "5         AWOL - A Way Of Life       7  Album  \n",
       "10           Constant Hitmaker       2  Album  \n",
       "20                       Niris      13  Album  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>active_year_begin</th>\n",
       "      <th>active_year_end</th>\n",
       "      <th>associated_labels</th>\n",
       "      <th>bio</th>\n",
       "      <th>comments</th>\n",
       "      <th>date_created</th>\n",
       "      <th>favorites</th>\n",
       "      <th>id</th>\n",
       "      <th>latitude</th>\n",
       "      <th>location</th>\n",
       "      <th>longitude</th>\n",
       "      <th>members</th>\n",
       "      <th>name</th>\n",
       "      <th>related_projects</th>\n",
       "      <th>tags</th>\n",
       "      <th>website</th>\n",
       "      <th>wikipedia_page</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>track_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2006-01-01</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;p&gt;A Way Of Life, A Collective of Hip-Hop from...</td>\n",
       "      <td>0</td>\n",
       "      <td>2008-11-26 01:42:32</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>40.058324</td>\n",
       "      <td>New Jersey</td>\n",
       "      <td>-74.405661</td>\n",
       "      <td>Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...</td>\n",
       "      <td>AWOL</td>\n",
       "      <td>The list of past projects is 2 long but every1...</td>\n",
       "      <td>[awol]</td>\n",
       "      <td>http://www.AzillionRecords.blogspot.com</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2006-01-01</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;p&gt;A Way Of Life, A Collective of Hip-Hop from...</td>\n",
       "      <td>0</td>\n",
       "      <td>2008-11-26 01:42:32</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>40.058324</td>\n",
       "      <td>New Jersey</td>\n",
       "      <td>-74.405661</td>\n",
       "      <td>Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...</td>\n",
       "      <td>AWOL</td>\n",
       "      <td>The list of past projects is 2 long but every1...</td>\n",
       "      <td>[awol]</td>\n",
       "      <td>http://www.AzillionRecords.blogspot.com</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2006-01-01</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;p&gt;A Way Of Life, A Collective of Hip-Hop from...</td>\n",
       "      <td>0</td>\n",
       "      <td>2008-11-26 01:42:32</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>40.058324</td>\n",
       "      <td>New Jersey</td>\n",
       "      <td>-74.405661</td>\n",
       "      <td>Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...</td>\n",
       "      <td>AWOL</td>\n",
       "      <td>The list of past projects is 2 long but every1...</td>\n",
       "      <td>[awol]</td>\n",
       "      <td>http://www.AzillionRecords.blogspot.com</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "      <td>Mexican Summer, Richie Records, Woodsist, Skul...</td>\n",
       "      <td>&lt;p&gt;&lt;span style=\"font-family:Verdana, Geneva, A...</td>\n",
       "      <td>3</td>\n",
       "      <td>2008-11-26 01:42:55</td>\n",
       "      <td>74</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Kurt Vile, the Violators</td>\n",
       "      <td>Kurt Vile</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[philly, kurt vile]</td>\n",
       "      <td>http://kurtvile.com</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>1990-01-01</td>\n",
       "      <td>2011-01-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;p&gt;Songs written by: Nicky Cook&lt;/p&gt;\\n&lt;p&gt;VOCALS...</td>\n",
       "      <td>2</td>\n",
       "      <td>2008-11-26 01:42:52</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>51.895927</td>\n",
       "      <td>Colchester England</td>\n",
       "      <td>0.891874</td>\n",
       "      <td>Nicky Cook\\n</td>\n",
       "      <td>Nicky Cook</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[instrumentals, experimental pop, post punk, e...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         active_year_begin active_year_end  \\\n",
       "track_id                                     \n",
       "2               2006-01-01             NaT   \n",
       "3               2006-01-01             NaT   \n",
       "5               2006-01-01             NaT   \n",
       "10                     NaT             NaT   \n",
       "20              1990-01-01      2011-01-01   \n",
       "\n",
       "                                          associated_labels  \\\n",
       "track_id                                                      \n",
       "2                                                       NaN   \n",
       "3                                                       NaN   \n",
       "5                                                       NaN   \n",
       "10        Mexican Summer, Richie Records, Woodsist, Skul...   \n",
       "20                                                      NaN   \n",
       "\n",
       "                                                        bio  comments  \\\n",
       "track_id                                                                \n",
       "2         <p>A Way Of Life, A Collective of Hip-Hop from...         0   \n",
       "3         <p>A Way Of Life, A Collective of Hip-Hop from...         0   \n",
       "5         <p>A Way Of Life, A Collective of Hip-Hop from...         0   \n",
       "10        <p><span style=\"font-family:Verdana, Geneva, A...         3   \n",
       "20        <p>Songs written by: Nicky Cook</p>\\n<p>VOCALS...         2   \n",
       "\n",
       "                date_created  favorites  id   latitude            location  \\\n",
       "track_id                                                                     \n",
       "2        2008-11-26 01:42:32          9   1  40.058324          New Jersey   \n",
       "3        2008-11-26 01:42:32          9   1  40.058324          New Jersey   \n",
       "5        2008-11-26 01:42:32          9   1  40.058324          New Jersey   \n",
       "10       2008-11-26 01:42:55         74   6        NaN                 NaN   \n",
       "20       2008-11-26 01:42:52         10   4  51.895927  Colchester England   \n",
       "\n",
       "          longitude                                            members  \\\n",
       "track_id                                                                 \n",
       "2        -74.405661  Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...   \n",
       "3        -74.405661  Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...   \n",
       "5        -74.405661  Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...   \n",
       "10              NaN                           Kurt Vile, the Violators   \n",
       "20         0.891874                                       Nicky Cook\\n   \n",
       "\n",
       "                name                                   related_projects  \\\n",
       "track_id                                                                  \n",
       "2               AWOL  The list of past projects is 2 long but every1...   \n",
       "3               AWOL  The list of past projects is 2 long but every1...   \n",
       "5               AWOL  The list of past projects is 2 long but every1...   \n",
       "10         Kurt Vile                                                NaN   \n",
       "20        Nicky Cook                                                NaN   \n",
       "\n",
       "                                                       tags  \\\n",
       "track_id                                                      \n",
       "2                                                    [awol]   \n",
       "3                                                    [awol]   \n",
       "5                                                    [awol]   \n",
       "10                                      [philly, kurt vile]   \n",
       "20        [instrumentals, experimental pop, post punk, e...   \n",
       "\n",
       "                                          website wikipedia_page  \n",
       "track_id                                                          \n",
       "2         http://www.AzillionRecords.blogspot.com            NaN  \n",
       "3         http://www.AzillionRecords.blogspot.com            NaN  \n",
       "5         http://www.AzillionRecords.blogspot.com            NaN  \n",
       "10                            http://kurtvile.com            NaN  \n",
       "20                                            NaN            NaN  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "N = 5\n",
    "ipd.display(tracks['track'].head(N))\n",
    "ipd.display(tracks['album'].head(N))\n",
    "ipd.display(tracks['artist'].head(N))"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}