{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Collecting Data from the Spotify Web API using Spotipy\n",
    "\n",
    "## About the Spotipy Library:\n",
    "\n",
    "From the [official Spotipy docs](https://spotipy.readthedocs.io/en/latest/): \n",
    ">\"Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform.\"\n",
    "\n",
    "\n",
    "## About using the Spotify Web API:\n",
    "\n",
    "Spotify offers a number of [API endpoints](https://beta.developer.spotify.com/documentation/web-api/reference/) to access the Spotify data. In this notebook, I used the following:\n",
    "\n",
    "- [search endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/search/search/) to get the track IDs \n",
    "- [audio features endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) to get the corresponding audio features.\n",
    "\n",
    "The data was collected on several days during the months of April, May and August 2018.\n",
    "\n",
    "\n",
    "## Goal of this notebook:\n",
    "\n",
    "The goal is to show how to collect audio features data for tracks from the [official Spotify Web API](https://beta.developer.spotify.com/documentation/web-api/) in order to use it for further analysis/ machine learning which will be part of another notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Setting Up\n",
    "\n",
    "The below code is sufficient to set up Spotipy for querying the API endpoint. A more detailed explanation of the whole procedure is available in the [official docs](https://spotipy.readthedocs.io/en/latest/#installation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spotipy\n",
    "from spotipy.oauth2 import SpotifyClientCredentials\n",
    "\n",
    "cid =\"xx\" \n",
    "secret = \"xx\"\n",
    "\n",
    "client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)\n",
    "sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Get the Track ID Data\n",
    "\n",
    "The data collection is divided into 2 parts: the track IDs and the audio features. In this step, I'm going to collect 10.000 track IDs from the Spotify API.\n",
    "\n",
    "The [search endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/search/search/) used in this step had a few limitations:\n",
    "\n",
    "- limit: a maximum of 50 results can be returned per query\n",
    "- offset: this is the index of the first result to return, so if you want to get the results with the index 50-100 you will need to set the offset to 50 etc.\n",
    "\n",
    "Spotify cut down the maximum offset to 10.000 (as of May 2018?), I was lucky enough to do my first collection attempt while it was still 100.000\n",
    "\n",
    "My solution: using a nested for loop, I increased the offset by 50 in the outer loop until the maxium limit/ offset was reached. The inner for loop did the actual querying and appending the returned results to appropriate lists which I used afterwards to create my dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to run this code (in seconds): 242.2539935503155\n"
     ]
    }
   ],
   "source": [
    "# timeit library to measure the time needed to run this code\n",
    "import timeit\n",
    "start = timeit.default_timer()\n",
    "\n",
    "# create empty lists where the results are going to be stored\n",
    "artist_name = []\n",
    "track_name = []\n",
    "popularity = []\n",
    "track_id = []\n",
    "\n",
    "for i in range(0,10000,50):\n",
    "    track_results = sp.search(q='year:2018', type='track', limit=50,offset=i)\n",
    "    for i, t in enumerate(track_results['tracks']['items']):\n",
    "        artist_name.append(t['artists'][0]['name'])\n",
    "        track_name.append(t['name'])\n",
    "        track_id.append(t['id'])\n",
    "        popularity.append(t['popularity'])\n",
    "      \n",
    "\n",
    "stop = timeit.default_timer()\n",
    "print ('Time to run this code (in seconds):', stop - start)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. EDA + Data Preparation\n",
    "\n",
    "In the next few cells, I'm going to do some exploratory data analysis as well as data preparation of the newly gained data.\n",
    "\n",
    "A quick check for the track_id list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number of elements in the track_id list: 10000\n"
     ]
    }
   ],
   "source": [
    "print('number of elements in the track_id list:', len(track_id))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks good. Now loading the lists in a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(10000, 4)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>artist_name</th>\n",
       "      <th>popularity</th>\n",
       "      <th>track_id</th>\n",
       "      <th>track_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Drake</td>\n",
       "      <td>100</td>\n",
       "      <td>2G7V7zsVDxg1yRsu7Ew9RJ</td>\n",
       "      <td>In My Feelings</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>XXXTENTACION</td>\n",
       "      <td>97</td>\n",
       "      <td>3ee8Jmje8o58CHK66QrVC2</td>\n",
       "      <td>SAD!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Tyga</td>\n",
       "      <td>96</td>\n",
       "      <td>5IaHrVsrferBYDm0bDyABy</td>\n",
       "      <td>Taste (feat. Offset)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Cardi B</td>\n",
       "      <td>97</td>\n",
       "      <td>58q2HKrzhC3ozto2nDdN4z</td>\n",
       "      <td>I Like It</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>XXXTENTACION</td>\n",
       "      <td>95</td>\n",
       "      <td>0JP9xo3adEtGSdUEISiszL</td>\n",
       "      <td>Moonlight</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    artist_name  popularity                track_id            track_name\n",
       "0         Drake         100  2G7V7zsVDxg1yRsu7Ew9RJ        In My Feelings\n",
       "1  XXXTENTACION          97  3ee8Jmje8o58CHK66QrVC2                  SAD!\n",
       "2          Tyga          96  5IaHrVsrferBYDm0bDyABy  Taste (feat. Offset)\n",
       "3       Cardi B          97  58q2HKrzhC3ozto2nDdN4z             I Like It\n",
       "4  XXXTENTACION          95  0JP9xo3adEtGSdUEISiszL             Moonlight"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df_tracks = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id,'popularity':popularity})\n",
    "print(df_tracks.shape)\n",
    "df_tracks.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10000 entries, 0 to 9999\n",
      "Data columns (total 4 columns):\n",
      "artist_name    10000 non-null object\n",
      "popularity     10000 non-null int64\n",
      "track_id       10000 non-null object\n",
      "track_name     10000 non-null object\n",
      "dtypes: int64(1), object(3)\n",
      "memory usage: 312.6+ KB\n"
     ]
    }
   ],
   "source": [
    "df_tracks.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes, the same track is returned under different track IDs (single, as part of an album etc.).\n",
    "\n",
    "This needs to be checked for and corrected if needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "524"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# group the entries by artist_name and track_name and check for duplicates\n",
    "\n",
    "grouped = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()\n",
    "grouped[grouped > 1].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 524 duplicate entries which will be dropped in the next cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_tracks.drop_duplicates(subset=['artist_name','track_name'], inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# doing the same grouping as before to verify the solution\n",
    "grouped_after_dropping = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()\n",
    "grouped_after_dropping[grouped_after_dropping > 1].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time the results are empty. Another way of checking this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "artist_name    0\n",
       "popularity     0\n",
       "track_id       0\n",
       "track_name     0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking how many tracks are left now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9460, 4)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_tracks.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4: Get the Audio Features Data\n",
    "\n",
    "With the [audio features endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) I will now get the audio features data for my 9460 track IDs.\n",
    "\n",
    "The limitation for this endpoint is that a maximum of 100 track IDs can be submitted per query.\n",
    "\n",
    "Again, I used a nested for loop. This time the outer loop was pulling track IDs in batches of size 100 and the inner for loop was doing the query and appending the results to the rows list.\n",
    "\n",
    "Additionaly, I had to implement a check when a track ID didn't return any audio features (i.e. None was returned) as this was causing issues."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of tracks where no audio features were available: 86\n",
      "Time to run this code (in seconds): 11.267732854001224\n"
     ]
    }
   ],
   "source": [
    "# again measuring the time\n",
    "start = timeit.default_timer()\n",
    "\n",
    "# empty list, batchsize and the counter for None results\n",
    "rows = []\n",
    "batchsize = 100\n",
    "None_counter = 0\n",
    "\n",
    "for i in range(0,len(df_tracks['track_id']),batchsize):\n",
    "    batch = df_tracks['track_id'][i:i+batchsize]\n",
    "    feature_results = sp.audio_features(batch)\n",
    "    for i, t in enumerate(feature_results):\n",
    "        if t == None:\n",
    "            None_counter = None_counter + 1\n",
    "        else:\n",
    "            rows.append(t)\n",
    "            \n",
    "print('Number of tracks where no audio features were available:',None_counter)\n",
    "\n",
    "stop = timeit.default_timer()\n",
    "print ('Time to run this code (in seconds):',stop - start)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. EDA + Data Preparation\n",
    "\n",
    "Same as with the first dataset, checking how the rows list looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number of elements in the track_id list: 9374\n"
     ]
    }
   ],
   "source": [
    "print('number of elements in the track_id list:', len(rows))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, I will load the audio features in a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of the dataset: (9374, 18)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>acousticness</th>\n",
       "      <th>analysis_url</th>\n",
       "      <th>danceability</th>\n",
       "      <th>duration_ms</th>\n",
       "      <th>energy</th>\n",
       "      <th>id</th>\n",
       "      <th>instrumentalness</th>\n",
       "      <th>key</th>\n",
       "      <th>liveness</th>\n",
       "      <th>loudness</th>\n",
       "      <th>mode</th>\n",
       "      <th>speechiness</th>\n",
       "      <th>tempo</th>\n",
       "      <th>time_signature</th>\n",
       "      <th>track_href</th>\n",
       "      <th>type</th>\n",
       "      <th>uri</th>\n",
       "      <th>valence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.00669</td>\n",
       "      <td>https://api.spotify.com/v1/audio-analysis/2G7V...</td>\n",
       "      <td>0.738</td>\n",
       "      <td>217933</td>\n",
       "      <td>0.466</td>\n",
       "      <td>2G7V7zsVDxg1yRsu7Ew9RJ</td>\n",
       "      <td>0.01020</td>\n",
       "      <td>8</td>\n",
       "      <td>0.449</td>\n",
       "      <td>-9.433</td>\n",
       "      <td>1</td>\n",
       "      <td>0.1370</td>\n",
       "      <td>181.992</td>\n",
       "      <td>4</td>\n",
       "      <td>https://api.spotify.com/v1/tracks/2G7V7zsVDxg1...</td>\n",
       "      <td>audio_features</td>\n",
       "      <td>spotify:track:2G7V7zsVDxg1yRsu7Ew9RJ</td>\n",
       "      <td>0.401</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.25800</td>\n",
       "      <td>https://api.spotify.com/v1/audio-analysis/3ee8...</td>\n",
       "      <td>0.740</td>\n",
       "      <td>166606</td>\n",
       "      <td>0.613</td>\n",
       "      <td>3ee8Jmje8o58CHK66QrVC2</td>\n",
       "      <td>0.00372</td>\n",
       "      <td>8</td>\n",
       "      <td>0.123</td>\n",
       "      <td>-4.880</td>\n",
       "      <td>1</td>\n",
       "      <td>0.1450</td>\n",
       "      <td>75.023</td>\n",
       "      <td>4</td>\n",
       "      <td>https://api.spotify.com/v1/tracks/3ee8Jmje8o58...</td>\n",
       "      <td>audio_features</td>\n",
       "      <td>spotify:track:3ee8Jmje8o58CHK66QrVC2</td>\n",
       "      <td>0.473</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.02360</td>\n",
       "      <td>https://api.spotify.com/v1/audio-analysis/5IaH...</td>\n",
       "      <td>0.884</td>\n",
       "      <td>232959</td>\n",
       "      <td>0.559</td>\n",
       "      <td>5IaHrVsrferBYDm0bDyABy</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0</td>\n",
       "      <td>0.101</td>\n",
       "      <td>-7.442</td>\n",
       "      <td>1</td>\n",
       "      <td>0.1200</td>\n",
       "      <td>97.994</td>\n",
       "      <td>4</td>\n",
       "      <td>https://api.spotify.com/v1/tracks/5IaHrVsrferB...</td>\n",
       "      <td>audio_features</td>\n",
       "      <td>spotify:track:5IaHrVsrferBYDm0bDyABy</td>\n",
       "      <td>0.342</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.09900</td>\n",
       "      <td>https://api.spotify.com/v1/audio-analysis/58q2...</td>\n",
       "      <td>0.816</td>\n",
       "      <td>253390</td>\n",
       "      <td>0.726</td>\n",
       "      <td>58q2HKrzhC3ozto2nDdN4z</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>5</td>\n",
       "      <td>0.372</td>\n",
       "      <td>-3.998</td>\n",
       "      <td>0</td>\n",
       "      <td>0.1290</td>\n",
       "      <td>136.048</td>\n",
       "      <td>4</td>\n",
       "      <td>https://api.spotify.com/v1/tracks/58q2HKrzhC3o...</td>\n",
       "      <td>audio_features</td>\n",
       "      <td>spotify:track:58q2HKrzhC3ozto2nDdN4z</td>\n",
       "      <td>0.650</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.55600</td>\n",
       "      <td>https://api.spotify.com/v1/audio-analysis/0JP9...</td>\n",
       "      <td>0.921</td>\n",
       "      <td>135090</td>\n",
       "      <td>0.537</td>\n",
       "      <td>0JP9xo3adEtGSdUEISiszL</td>\n",
       "      <td>0.00404</td>\n",
       "      <td>9</td>\n",
       "      <td>0.102</td>\n",
       "      <td>-5.723</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0804</td>\n",
       "      <td>128.009</td>\n",
       "      <td>4</td>\n",
       "      <td>https://api.spotify.com/v1/tracks/0JP9xo3adEtG...</td>\n",
       "      <td>audio_features</td>\n",
       "      <td>spotify:track:0JP9xo3adEtGSdUEISiszL</td>\n",
       "      <td>0.711</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   acousticness                                       analysis_url  \\\n",
       "0       0.00669  https://api.spotify.com/v1/audio-analysis/2G7V...   \n",
       "1       0.25800  https://api.spotify.com/v1/audio-analysis/3ee8...   \n",
       "2       0.02360  https://api.spotify.com/v1/audio-analysis/5IaH...   \n",
       "3       0.09900  https://api.spotify.com/v1/audio-analysis/58q2...   \n",
       "4       0.55600  https://api.spotify.com/v1/audio-analysis/0JP9...   \n",
       "\n",
       "   danceability  duration_ms  energy                      id  \\\n",
       "0         0.738       217933   0.466  2G7V7zsVDxg1yRsu7Ew9RJ   \n",
       "1         0.740       166606   0.613  3ee8Jmje8o58CHK66QrVC2   \n",
       "2         0.884       232959   0.559  5IaHrVsrferBYDm0bDyABy   \n",
       "3         0.816       253390   0.726  58q2HKrzhC3ozto2nDdN4z   \n",
       "4         0.921       135090   0.537  0JP9xo3adEtGSdUEISiszL   \n",
       "\n",
       "   instrumentalness  key  liveness  loudness  mode  speechiness    tempo  \\\n",
       "0           0.01020    8     0.449    -9.433     1       0.1370  181.992   \n",
       "1           0.00372    8     0.123    -4.880     1       0.1450   75.023   \n",
       "2           0.00000    0     0.101    -7.442     1       0.1200   97.994   \n",
       "3           0.00000    5     0.372    -3.998     0       0.1290  136.048   \n",
       "4           0.00404    9     0.102    -5.723     0       0.0804  128.009   \n",
       "\n",
       "   time_signature                                         track_href  \\\n",
       "0               4  https://api.spotify.com/v1/tracks/2G7V7zsVDxg1...   \n",
       "1               4  https://api.spotify.com/v1/tracks/3ee8Jmje8o58...   \n",
       "2               4  https://api.spotify.com/v1/tracks/5IaHrVsrferB...   \n",
       "3               4  https://api.spotify.com/v1/tracks/58q2HKrzhC3o...   \n",
       "4               4  https://api.spotify.com/v1/tracks/0JP9xo3adEtG...   \n",
       "\n",
       "             type                                   uri  valence  \n",
       "0  audio_features  spotify:track:2G7V7zsVDxg1yRsu7Ew9RJ    0.401  \n",
       "1  audio_features  spotify:track:3ee8Jmje8o58CHK66QrVC2    0.473  \n",
       "2  audio_features  spotify:track:5IaHrVsrferBYDm0bDyABy    0.342  \n",
       "3  audio_features  spotify:track:58q2HKrzhC3ozto2nDdN4z    0.650  \n",
       "4  audio_features  spotify:track:0JP9xo3adEtGSdUEISiszL    0.711  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_audio_features = pd.DataFrame.from_dict(rows,orient='columns')\n",
    "print(\"Shape of the dataset:\", df_audio_features.shape)\n",
    "df_audio_features.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 9374 entries, 0 to 9373\n",
      "Data columns (total 18 columns):\n",
      "acousticness        9374 non-null float64\n",
      "analysis_url        9374 non-null object\n",
      "danceability        9374 non-null float64\n",
      "duration_ms         9374 non-null int64\n",
      "energy              9374 non-null float64\n",
      "id                  9374 non-null object\n",
      "instrumentalness    9374 non-null float64\n",
      "key                 9374 non-null int64\n",
      "liveness            9374 non-null float64\n",
      "loudness            9374 non-null float64\n",
      "mode                9374 non-null int64\n",
      "speechiness         9374 non-null float64\n",
      "tempo               9374 non-null float64\n",
      "time_signature      9374 non-null int64\n",
      "track_href          9374 non-null object\n",
      "type                9374 non-null object\n",
      "uri                 9374 non-null object\n",
      "valence             9374 non-null float64\n",
      "dtypes: float64(9), int64(4), object(5)\n",
      "memory usage: 1.3+ MB\n"
     ]
    }
   ],
   "source": [
    "df_audio_features.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some columns are not needed for the analysis so I will drop them.\n",
    "\n",
    "Also the ID column will be renamed to track_id so that it matches the column name from the first dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9374, 14)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns_to_drop = ['analysis_url','track_href','type','uri']\n",
    "df_audio_features.drop(columns_to_drop, axis=1,inplace=True)\n",
    "\n",
    "df_audio_features.rename(columns={'id': 'track_id'}, inplace=True)\n",
    "\n",
    "df_audio_features.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of the dataset: (9374, 14)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>artist_name</th>\n",
       "      <th>popularity</th>\n",
       "      <th>track_id</th>\n",
       "      <th>track_name</th>\n",
       "      <th>acousticness</th>\n",
       "      <th>danceability</th>\n",
       "      <th>duration_ms</th>\n",
       "      <th>energy</th>\n",
       "      <th>instrumentalness</th>\n",
       "      <th>key</th>\n",
       "      <th>liveness</th>\n",
       "      <th>loudness</th>\n",
       "      <th>mode</th>\n",
       "      <th>speechiness</th>\n",
       "      <th>tempo</th>\n",
       "      <th>time_signature</th>\n",
       "      <th>valence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Drake</td>\n",
       "      <td>100</td>\n",
       "      <td>2G7V7zsVDxg1yRsu7Ew9RJ</td>\n",
       "      <td>In My Feelings</td>\n",
       "      <td>0.00669</td>\n",
       "      <td>0.738</td>\n",
       "      <td>217933</td>\n",
       "      <td>0.466</td>\n",
       "      <td>0.01020</td>\n",
       "      <td>8</td>\n",
       "      <td>0.449</td>\n",
       "      <td>-9.433</td>\n",
       "      <td>1</td>\n",
       "      <td>0.1370</td>\n",
       "      <td>181.992</td>\n",
       "      <td>4</td>\n",
       "      <td>0.401</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>XXXTENTACION</td>\n",
       "      <td>97</td>\n",
       "      <td>3ee8Jmje8o58CHK66QrVC2</td>\n",
       "      <td>SAD!</td>\n",
       "      <td>0.25800</td>\n",
       "      <td>0.740</td>\n",
       "      <td>166606</td>\n",
       "      <td>0.613</td>\n",
       "      <td>0.00372</td>\n",
       "      <td>8</td>\n",
       "      <td>0.123</td>\n",
       "      <td>-4.880</td>\n",
       "      <td>1</td>\n",
       "      <td>0.1450</td>\n",
       "      <td>75.023</td>\n",
       "      <td>4</td>\n",
       "      <td>0.473</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Tyga</td>\n",
       "      <td>96</td>\n",
       "      <td>5IaHrVsrferBYDm0bDyABy</td>\n",
       "      <td>Taste (feat. Offset)</td>\n",
       "      <td>0.02360</td>\n",
       "      <td>0.884</td>\n",
       "      <td>232959</td>\n",
       "      <td>0.559</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0</td>\n",
       "      <td>0.101</td>\n",
       "      <td>-7.442</td>\n",
       "      <td>1</td>\n",
       "      <td>0.1200</td>\n",
       "      <td>97.994</td>\n",
       "      <td>4</td>\n",
       "      <td>0.342</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Cardi B</td>\n",
       "      <td>97</td>\n",
       "      <td>58q2HKrzhC3ozto2nDdN4z</td>\n",
       "      <td>I Like It</td>\n",
       "      <td>0.09900</td>\n",
       "      <td>0.816</td>\n",
       "      <td>253390</td>\n",
       "      <td>0.726</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>5</td>\n",
       "      <td>0.372</td>\n",
       "      <td>-3.998</td>\n",
       "      <td>0</td>\n",
       "      <td>0.1290</td>\n",
       "      <td>136.048</td>\n",
       "      <td>4</td>\n",
       "      <td>0.650</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>XXXTENTACION</td>\n",
       "      <td>95</td>\n",
       "      <td>0JP9xo3adEtGSdUEISiszL</td>\n",
       "      <td>Moonlight</td>\n",
       "      <td>0.55600</td>\n",
       "      <td>0.921</td>\n",
       "      <td>135090</td>\n",
       "      <td>0.537</td>\n",
       "      <td>0.00404</td>\n",
       "      <td>9</td>\n",
       "      <td>0.102</td>\n",
       "      <td>-5.723</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0804</td>\n",
       "      <td>128.009</td>\n",
       "      <td>4</td>\n",
       "      <td>0.711</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    artist_name  popularity                track_id            track_name  \\\n",
       "0         Drake         100  2G7V7zsVDxg1yRsu7Ew9RJ        In My Feelings   \n",
       "1  XXXTENTACION          97  3ee8Jmje8o58CHK66QrVC2                  SAD!   \n",
       "2          Tyga          96  5IaHrVsrferBYDm0bDyABy  Taste (feat. Offset)   \n",
       "3       Cardi B          97  58q2HKrzhC3ozto2nDdN4z             I Like It   \n",
       "4  XXXTENTACION          95  0JP9xo3adEtGSdUEISiszL             Moonlight   \n",
       "\n",
       "   acousticness  danceability  duration_ms  energy  instrumentalness  key  \\\n",
       "0       0.00669         0.738       217933   0.466           0.01020    8   \n",
       "1       0.25800         0.740       166606   0.613           0.00372    8   \n",
       "2       0.02360         0.884       232959   0.559           0.00000    0   \n",
       "3       0.09900         0.816       253390   0.726           0.00000    5   \n",
       "4       0.55600         0.921       135090   0.537           0.00404    9   \n",
       "\n",
       "   liveness  loudness  mode  speechiness    tempo  time_signature  valence  \n",
       "0     0.449    -9.433     1       0.1370  181.992               4    0.401  \n",
       "1     0.123    -4.880     1       0.1450   75.023               4    0.473  \n",
       "2     0.101    -7.442     1       0.1200   97.994               4    0.342  \n",
       "3     0.372    -3.998     0       0.1290  136.048               4    0.650  \n",
       "4     0.102    -5.723     0       0.0804  128.009               4    0.711  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# merge both dataframes\n",
    "# the 'inner' method will make sure that we only keep track IDs present in both datasets\n",
    "df = pd.merge(df_tracks,df_audio_features,on='track_id',how='inner')\n",
    "print(\"Shape of the dataset:\", df_audio_features.shape)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 9374 entries, 0 to 9373\n",
      "Data columns (total 17 columns):\n",
      "artist_name         9374 non-null object\n",
      "popularity          9374 non-null int64\n",
      "track_id            9374 non-null object\n",
      "track_name          9374 non-null object\n",
      "acousticness        9374 non-null float64\n",
      "danceability        9374 non-null float64\n",
      "duration_ms         9374 non-null int64\n",
      "energy              9374 non-null float64\n",
      "instrumentalness    9374 non-null float64\n",
      "key                 9374 non-null int64\n",
      "liveness            9374 non-null float64\n",
      "loudness            9374 non-null float64\n",
      "mode                9374 non-null int64\n",
      "speechiness         9374 non-null float64\n",
      "tempo               9374 non-null float64\n",
      "time_signature      9374 non-null int64\n",
      "valence             9374 non-null float64\n",
      "dtypes: float64(9), int64(5), object(3)\n",
      "memory usage: 1.3+ MB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just in case, checking for any duplicate tracks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>artist_name</th>\n",
       "      <th>popularity</th>\n",
       "      <th>track_id</th>\n",
       "      <th>track_name</th>\n",
       "      <th>acousticness</th>\n",
       "      <th>danceability</th>\n",
       "      <th>duration_ms</th>\n",
       "      <th>energy</th>\n",
       "      <th>instrumentalness</th>\n",
       "      <th>key</th>\n",
       "      <th>liveness</th>\n",
       "      <th>loudness</th>\n",
       "      <th>mode</th>\n",
       "      <th>speechiness</th>\n",
       "      <th>tempo</th>\n",
       "      <th>time_signature</th>\n",
       "      <th>valence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [artist_name, popularity, track_id, track_name, acousticness, danceability, duration_ms, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature, valence]\n",
       "Index: []"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.duplicated(subset=['artist_name','track_name'],keep=False)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Everything seems to be fine so I will save the dataframe as a .csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv('SpotifyAudioFeatures08082018.csv')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}