{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clustering Musical Artists\n",
    "\n",
    "*Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following [this link.](http://nbviewer.jupyter.org/github/david-cortes/datascienceprojects/blob/master/machine_learning/clustering_fm_artists.ipynb)*\n",
    "\n",
    "This project consists on clustering musical artists using a dataset with the top 50 played artists per user of a random sample of ~360,000 users from Last.fm, which can be found [here](http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html), based on the idea that artists who are preferred by the same users tend to be similar."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[1. Loading and cleaning the data](#1)\n",
    "* [1.1 Downloading  the dataset and generating a sample](#1.1)\n",
    "* [1.2 Downloading and formatting the data](#1.2)\n",
    "\n",
    "[2. Establishing Artists' pairwise similarities](#2)\n",
    "* [2.1 Generating candidate pairs of aritsts to compare](#2.1)\n",
    "* [2.2 Converting to cosine distances](#2.2)\n",
    "\n",
    "[3. Clustering Artists](#3)\n",
    "* [Additional clustering without Spark](#3.1)\n",
    "\n",
    "[4. Checking cluster sizes and calculating cluster quality metrics](#4)\n",
    "* [4.1 Checking the sizes of the largest clusters for the different algorithms](#4.1)\n",
    "* [4.2 Calculating cluster quality metrics](#4.2)\n",
    "\n",
    "[5. Checking a sample of the results](#5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='1'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1 - Loading and cleaning the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='1.1'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 - Downloading the dataset and generating a sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os, sys\n",
    "\n",
    "def download_data():\n",
    "    import urllib, tarfile\n",
    "    data_file = urllib.URLopener()\n",
    "    data_file.retrieve(\"http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz\", \"lastfm-dataset-360K.tar.gz\")\n",
    "    data_file = tarfile.open(\"lastfm-dataset-360K.tar.gz\", 'r:gz')\n",
    "    data_file.extractall()\n",
    "    \n",
    "def generate_sample(file_path=os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv',n=10000):\n",
    "    with open('small_sample.tsv','w') as out:\n",
    "        with open(file_path) as f:\n",
    "            for i in range(n):\n",
    "                out.write(f.readline())\n",
    "    \n",
    "#download_data()\n",
    "# generate_sample()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='1.2'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - Loading and formatting the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are  359339  different users\n",
      "There are  160162  valid artists\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "PythonRDD[202] at RDD at PythonRDD.scala:43"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os, json\n",
    "\n",
    "#Here I'm assuming there's an Spark Context already set up under the name 'sc'\n",
    "\n",
    "# dataset=sc.textFile(os.getcwd()+'/small_sample.tsv')\n",
    "dataset=sc.textFile(os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv')\n",
    "\n",
    "dataset=dataset.map(lambda x: x.split('\\t')).filter(lambda x: len(x[1])>30) #removing invalid artists (e.g. 'bbc radio')\n",
    "dataset=dataset.filter(lambda x: x[1]!='89ad4ac3-39f7-470e-963a-56509c546377') #removing 'various artists'\n",
    "\n",
    "#converting hash values to integers to speed up the analysis\n",
    "u_list=dataset.map(lambda x: x[0]).distinct().collect()\n",
    "users=dict()\n",
    "for v,k in enumerate(u_list):\n",
    "    users[k]=v\n",
    "a_list=dataset.map(lambda x: x[1]).distinct().collect()\n",
    "artists=dict()\n",
    "for v,k in enumerate(a_list):\n",
    "    artists[k]=v\n",
    "n_users=len(u_list)\n",
    "n_artists=len(a_list)\n",
    "\n",
    "dataset=dataset.map(lambda x: (users[x[0]],artists[x[1]],x[2])) #the number of plays is not relevant here\n",
    "dataset.cache()\n",
    "del u_list, a_list, users, artists\n",
    "\n",
    "\n",
    "print(\"There are \", n_users, \" different users\")\n",
    "print(\"There are \", n_artists, \" valid artists\")\n",
    "\n",
    "\n",
    "#Generating some useful files for later\n",
    "\n",
    "#Artists in this dataset can appear under more than one name\n",
    "def build_arts_dict(dataset):\n",
    "    return dict(dataset.map(lambda x: (x[1],x[2])).groupByKey().mapValues(list).mapValues(lambda x: x[0]).map(lambda x: [x[0],x[1]]).collect())\n",
    "\n",
    "arts_dic=build_arts_dict(dataset)\n",
    "with open('artists.json', 'w') as outfile:\n",
    "    json.dump(arts_dic, outfile)\n",
    "del arts_dic\n",
    "dataset.cache()\n",
    "dataset.map(lambda x: (x[0],x[1])).saveAsTextFile('processed_data')\n",
    "\n",
    "users_per_artist=dict(dataset.map(lambda x: (x[1],x[0])).groupByKey().mapValues(len).map(list).collect())\n",
    "with open('users_per_artist.json', 'w') as outfile:\n",
    "    json.dump(users_per_artist, outfile)\n",
    "del users_per_artist\n",
    "dataset.unpersist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='2'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2 - Establishing Artists' pairwise similarities\n",
    "\n",
    "Calculating cosine similarities among artists (this penalizes according to the frequencies of artists across users, giving less of a penalty for cases of assimetric frequencies than dice similarity, for example), considering only wheter they are in a user's top played list regardless of the number of plays and only taking into account those with a high-enoguh similarity.\n",
    "\n",
    "If we assume that each user has roughly 50-top played artists, and each artist has an equal chance of appearing within any given user's playlist, then the expected cosine similarity between two artists, if everything happened by chance, could be approximated like this:\n",
    "\n",
    "$$expected.sim=\\frac{(\\frac{50}{n. artists})^2 \\times n.users}{\\sqrt{50}^2} \\approx 0.0007 $$\n",
    "\n",
    "However, since pairs of artists with such similarity would likely not be in the same cluster, it could be a better idea to set an arbitrary threshold instead, so as to decrease the number of pairs. A cosine similarity of 0.1 would be equivalent to two artists appearing each in the playlists of 100 users and having 10 users in common; or in a different case, to an artist appearing in 100 users' playlist and another in 50, having 7 users in common.\n",
    "\n",
    "In order to decrease the number of artists pairs to evaluate in the clustering part and make it manageable on a single machine, it would be better to set a minimum requirement for common users among artists - here I set it to 7, so a pair of artists should have at least 7 users in common in order for it to be assigned a non-zero distance, otherwise there would be artists assigned to the same cluster only because one user heard both - as well as a threshold for the cosine distance, which I set at 4 times the expected value if everything happened at random, in order for a pair to be considered as having a certain similarity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='2.1'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Generating candidate pairs of aritsts to compare\n",
    "\n",
    "This dataset encompasses 360,000 users, each having a list of 50 top played artists, summing up to 160,000 different artists. Trying to establish similarities between all the artists would imply looping through $160,000 \\times (160,000-1)/2 \\approx 13,000,000,000$ pairs, so it's better to first see which artists have users in common, as of the 13 billion possible pairs, there are very few with at least one user in common. Doing it this way would imply looping over only $360,000 \\times 50 \\times (50-1)/2 \\approx 441,000,000$ pairs, and in the process, it's possible to count how many users do artists have in common to ease further distance calculations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PythonRDD[10] at RDD at PythonRDD.scala:43"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import itertools\n",
    "from operator import add\n",
    "\n",
    "def ordered(pair):\n",
    "    n1,n2=pair\n",
    "    if n2>n1:\n",
    "        return n1,n2\n",
    "    else:\n",
    "        return n2,n1\n",
    "\n",
    "dataset=sc.textFile('processed_data').map(eval).groupByKey().mapValues(list).map(lambda x: x[1])\n",
    "dataset=dataset.flatMap(lambda x: [(ordered(i),1) for i in itertools.combinations(x,2)]).reduceByKey(add).map(lambda x: (x[0][0],x[0][1],x[1])).filter(lambda x: x[2]>6)\n",
    "dataset.cache()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='2.2'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Converting to cosine distances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are  7607557  non-zero pairwise distances\n"
     ]
    }
   ],
   "source": [
    "from math import sqrt\n",
    "import json\n",
    "\n",
    "#Taken from the previous results\n",
    "n_artists=160162\n",
    "n_users=359339\n",
    "threshold=4* (  ((50.0/n_artists)**2)*n_users/(sqrt(50.0)**2)  )\n",
    "\n",
    "with open('users_per_artist.json') as file:\n",
    "    users_per_artist=json.load(file)\n",
    "users_per_artist={int(k):v for k,v in users_per_artist.items()}\n",
    "bc_dic=sc.broadcast(users_per_artist)\n",
    "del users_per_artist\n",
    "\n",
    "dataset=dataset.map(lambda x: (x[0],x[1],x[2]*1.0/(sqrt(bc_dic.value[x[0]])*sqrt(bc_dic.value[x[1]])))).filter(lambda x: x[2]>threshold)\n",
    "dataset.cache()\n",
    "dataset.saveAsTextFile('sims')\n",
    "print('There are ',dataset.count(),' non-zero pairwise distances')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='3'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3 - Clustering Artists\n",
    "\n",
    "Here I'll produce different clusterings, using 100, 200, 500, 700 and 1000 clusters usign power iteration clustering, which provides similar (though usually slightly inferior) results to spectral clustering but runs faster and is scalable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "55637  artists were clustered\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>artist_id</th>\n",
       "      <th>cluster100</th>\n",
       "      <th>cluster200</th>\n",
       "      <th>cluster500</th>\n",
       "      <th>cluster700</th>\n",
       "      <th>cluster1000</th>\n",
       "      <th>artist_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>43120</td>\n",
       "      <td>54</td>\n",
       "      <td>56</td>\n",
       "      <td>328</td>\n",
       "      <td>29</td>\n",
       "      <td>976</td>\n",
       "      <td>sy smith</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>90797</td>\n",
       "      <td>74</td>\n",
       "      <td>154</td>\n",
       "      <td>103</td>\n",
       "      <td>466</td>\n",
       "      <td>829</td>\n",
       "      <td>phoebe killdeer and the short straws</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10290</td>\n",
       "      <td>56</td>\n",
       "      <td>147</td>\n",
       "      <td>273</td>\n",
       "      <td>30</td>\n",
       "      <td>928</td>\n",
       "      <td>judge jules</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>17934</td>\n",
       "      <td>85</td>\n",
       "      <td>78</td>\n",
       "      <td>275</td>\n",
       "      <td>683</td>\n",
       "      <td>853</td>\n",
       "      <td>strip steve</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>30429</td>\n",
       "      <td>49</td>\n",
       "      <td>38</td>\n",
       "      <td>299</td>\n",
       "      <td>672</td>\n",
       "      <td>352</td>\n",
       "      <td>marja mattlar</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   artist_id  cluster100  cluster200  cluster500  cluster700  cluster1000  \\\n",
       "0      43120          54          56         328          29          976   \n",
       "1      90797          74         154         103         466          829   \n",
       "2      10290          56         147         273          30          928   \n",
       "3      17934          85          78         275         683          853   \n",
       "4      30429          49          38         299         672          352   \n",
       "\n",
       "                            artist_name  \n",
       "0                              sy smith  \n",
       "1  phoebe killdeer and the short straws  \n",
       "2                           judge jules  \n",
       "3                           strip steve  \n",
       "4                         marja mattlar  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pyspark.mllib.clustering import PowerIterationClustering as pic\n",
    "import pandas as pd\n",
    "import json\n",
    "\n",
    "# dataset=sc.textFile('sims').map(eval)\n",
    "# dataset.cache()\n",
    "\n",
    "n_clusters=100\n",
    "clusters=pic.train(dataset,n_clusters)\n",
    "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts100')\n",
    "del clusters\n",
    "\n",
    "n_clusters=200\n",
    "clusters=pic.train(dataset,n_clusters)\n",
    "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts200')\n",
    "del clusters\n",
    "\n",
    "n_clusters=500\n",
    "clusters=pic.train(dataset,n_clusters)\n",
    "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts500')\n",
    "del clusters\n",
    "\n",
    "n_clusters=700\n",
    "clusters=pic.train(dataset,n_clusters)\n",
    "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts700')\n",
    "del clusters\n",
    "\n",
    "n_clusters=1000\n",
    "clusters=pic.train(dataset,n_clusters)\n",
    "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts1000')\n",
    "del clusters\n",
    "\n",
    "\n",
    "dataset=pd.read_csv('clusts100\\part-00000',header=None)\n",
    "dataset.columns=['artist_id','cluster100']\n",
    "dataset200=pd.read_csv('clusts200\\part-00000',header=None)\n",
    "dataset200.columns=['artist_id','cluster200']\n",
    "dataset500=pd.read_csv('clusts500\\part-00000',header=None)\n",
    "dataset500.columns=['artist_id','cluster500']\n",
    "dataset700=pd.read_csv('clusts700\\part-00000',header=None)\n",
    "dataset700.columns=['artist_id','cluster700']\n",
    "dataset1000=pd.read_csv('clusts1000\\part-00000',header=None)\n",
    "dataset1000.columns=['artist_id','cluster1000']\n",
    "\n",
    "\n",
    "dataset=dataset.merge(dataset200,how='outer',on='artist_id').merge(dataset500,how='outer',on='artist_id').merge(dataset700,how='outer',on='artist_id').merge(dataset1000,how='outer',on='artist_id')\n",
    "\n",
    "with open('artists.json') as art:\n",
    "    artists_dict=json.load(art)\n",
    "artists_dict={int(k):v for k,v in artists_dict.items()}\n",
    "dataset['artist_name']=[artists_dict[art] for art in dataset['artist_id']]\n",
    "dataset.to_csv('results_all.csv',index=False)\n",
    "\n",
    "del dataset200,dataset500,dataset700,dataset1000\n",
    "print(dataset.shape[0],' artists were clustered')\n",
    "dataset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since most artists are in only one or two user's playlists, it's unreliable (and computationally complex) to cluster them with so few data. That's why only a fraction (around one third) of the artists were considered for the clustering process."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='3.1'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_**Additional:** clustering with scikit-learn (dbscan) and igraph (louvain modularity) (both are non-parallel). I chose these parameters and algorithms after some manual experimentation seeing which ones give a reasonable spread of artists across clusters. These algorithms have the nice property of automatically determining the number of clusters._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Rearranging the data format\n",
    "import re\n",
    "\n",
    "dataset=sc.textFile('sims')\n",
    "dataset=dataset.map(lambda x: re.sub('[\\(\\)\\s]','',x))\n",
    "dataset.repartition(1).saveAsTextFile('sims_csv')\n",
    "dataset.unpersist()\n",
    "del dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>artist_name</th>\n",
       "      <th>pic100</th>\n",
       "      <th>pic200</th>\n",
       "      <th>pic500</th>\n",
       "      <th>pic700</th>\n",
       "      <th>pic1000</th>\n",
       "      <th>dbsc</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>sy smith</td>\n",
       "      <td>54</td>\n",
       "      <td>56</td>\n",
       "      <td>328</td>\n",
       "      <td>29</td>\n",
       "      <td>976</td>\n",
       "      <td>-1</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>phoebe killdeer and the short straws</td>\n",
       "      <td>74</td>\n",
       "      <td>154</td>\n",
       "      <td>103</td>\n",
       "      <td>466</td>\n",
       "      <td>829</td>\n",
       "      <td>-1</td>\n",
       "      <td>62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>judge jules</td>\n",
       "      <td>56</td>\n",
       "      <td>147</td>\n",
       "      <td>273</td>\n",
       "      <td>30</td>\n",
       "      <td>928</td>\n",
       "      <td>-1</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>strip steve</td>\n",
       "      <td>85</td>\n",
       "      <td>78</td>\n",
       "      <td>275</td>\n",
       "      <td>683</td>\n",
       "      <td>853</td>\n",
       "      <td>-1</td>\n",
       "      <td>62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>marja mattlar</td>\n",
       "      <td>49</td>\n",
       "      <td>38</td>\n",
       "      <td>299</td>\n",
       "      <td>672</td>\n",
       "      <td>352</td>\n",
       "      <td>-1</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            artist_name  pic100  pic200  pic500  pic700  \\\n",
       "0                              sy smith      54      56     328      29   \n",
       "1  phoebe killdeer and the short straws      74     154     103     466   \n",
       "2                           judge jules      56     147     273      30   \n",
       "3                           strip steve      85      78     275     683   \n",
       "4                         marja mattlar      49      38     299     672   \n",
       "\n",
       "   pic1000  dbsc  louvain  \n",
       "0      976    -1       55  \n",
       "1      829    -1       62  \n",
       "2      928    -1       55  \n",
       "3      853    -1       62  \n",
       "4      352    -1       38  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import sklearn, igraph, scipy, re\n",
    "import pandas as pd\n",
    "import sklearn.cluster\n",
    "\n",
    "dataset=pd.read_csv('sims_csv/part-00000',header=None)\n",
    "dataset.columns=['art1','art2','sim']\n",
    "dataset['dist']=[1-i for i in dataset['sim']]\n",
    "present_artists=set(dataset['art1'].append(dataset['art2']).values.tolist())\n",
    "new_numer_art_to_int=dict()\n",
    "new_numer_int_to_art=dict()\n",
    "count=0\n",
    "for art in present_artists:\n",
    "    new_numer_art_to_int[art]=count\n",
    "    new_numer_int_to_art[count]=art\n",
    "    count+=1\n",
    "del present_artists, count\n",
    "dataset['art1']=[new_numer_art_to_int[i] for i in dataset['art1']]\n",
    "dataset['art2']=[new_numer_art_to_int[i] for i in dataset['art2']]\n",
    "\n",
    "I=dataset['art1'].append(dataset['art2'])\n",
    "J=dataset['art2'].append(dataset['art1'])\n",
    "V=dataset['dist'].append(dataset['dist'])\n",
    "\n",
    "dataset_matrix=scipy.sparse.csr_matrix((V,(I,J)))\n",
    "del I,J,V\n",
    "dataset_matrix\n",
    "\n",
    "dbsc=sklearn.cluster.DBSCAN(eps=0.775,metric='precomputed').fit_predict(dataset_matrix)\n",
    "new_res=pd.Series(range(dataset_matrix.shape[0])).to_frame()\n",
    "new_res.columns=['artist_id']\n",
    "new_res['dbsc']=dbsc\n",
    "del dbsc, dataset_matrix\n",
    "\n",
    "g=igraph.Graph(edges=dataset[['art1','art2']].values.tolist(),directed=False)\n",
    "g.es['weight']=dataset['sim'].values.tolist()\n",
    "del dataset\n",
    "louvain_weighted=g.community_multilevel(weights=g.es['weight'])\n",
    "new_res['louvain']=louvain_weighted.membership\n",
    "new_res['artist_id']=[new_numer_int_to_art[i] for i in new_res['artist_id']]\n",
    "\n",
    "results=pd.read_csv('results_all.csv',engine='python')\n",
    "results=results.merge(new_res,how='left',on='artist_id')\n",
    "new_res=new_res.merge(results[['artist_id','cluster100','cluster200','cluster500','cluster700','cluster1000']],how='left',on='artist_id')\n",
    "cols=results.columns.tolist()\n",
    "cols=cols[6:7]+cols[1:6]+cols[7:9]\n",
    "results=results[cols]\n",
    "results.columns=[re.sub('cluster','pic',i) for i in results.columns]\n",
    "new_res.columns=[re.sub('cluster','pic',i) for i in new_res.columns]\n",
    "\n",
    "results.to_csv('results_all.csv',index=False)\n",
    "results.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Note: a cluster assignment of -1 means that the row was not asigned to any cluster. In DBSCAN most of the artists are not assigned to any cluster, thus those clusters should be of better quality._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='4'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 4 - Checking cluster sizes and calculating cluster quality metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='4.1'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 Checking the sizes of the largest clusters for the different algorithms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pic100</th>\n",
       "      <th>pic200</th>\n",
       "      <th>pic500</th>\n",
       "      <th>pic700</th>\n",
       "      <th>pic1000</th>\n",
       "      <th>dbsc</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1817</td>\n",
       "      <td>1013</td>\n",
       "      <td>414</td>\n",
       "      <td>287</td>\n",
       "      <td>287</td>\n",
       "      <td>2065</td>\n",
       "      <td>13605</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1779</td>\n",
       "      <td>1005</td>\n",
       "      <td>361</td>\n",
       "      <td>287</td>\n",
       "      <td>196</td>\n",
       "      <td>724</td>\n",
       "      <td>6314</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1752</td>\n",
       "      <td>908</td>\n",
       "      <td>360</td>\n",
       "      <td>271</td>\n",
       "      <td>188</td>\n",
       "      <td>289</td>\n",
       "      <td>6001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1664</td>\n",
       "      <td>875</td>\n",
       "      <td>350</td>\n",
       "      <td>251</td>\n",
       "      <td>187</td>\n",
       "      <td>261</td>\n",
       "      <td>4418</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1547</td>\n",
       "      <td>762</td>\n",
       "      <td>349</td>\n",
       "      <td>246</td>\n",
       "      <td>183</td>\n",
       "      <td>217</td>\n",
       "      <td>2518</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1507</td>\n",
       "      <td>757</td>\n",
       "      <td>343</td>\n",
       "      <td>244</td>\n",
       "      <td>179</td>\n",
       "      <td>187</td>\n",
       "      <td>2514</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1442</td>\n",
       "      <td>750</td>\n",
       "      <td>337</td>\n",
       "      <td>244</td>\n",
       "      <td>176</td>\n",
       "      <td>177</td>\n",
       "      <td>2470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1329</td>\n",
       "      <td>749</td>\n",
       "      <td>332</td>\n",
       "      <td>241</td>\n",
       "      <td>174</td>\n",
       "      <td>176</td>\n",
       "      <td>2224</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1324</td>\n",
       "      <td>746</td>\n",
       "      <td>332</td>\n",
       "      <td>237</td>\n",
       "      <td>170</td>\n",
       "      <td>176</td>\n",
       "      <td>1893</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1319</td>\n",
       "      <td>730</td>\n",
       "      <td>332</td>\n",
       "      <td>235</td>\n",
       "      <td>168</td>\n",
       "      <td>169</td>\n",
       "      <td>1633</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>1292</td>\n",
       "      <td>724</td>\n",
       "      <td>329</td>\n",
       "      <td>234</td>\n",
       "      <td>168</td>\n",
       "      <td>162</td>\n",
       "      <td>1552</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1204</td>\n",
       "      <td>722</td>\n",
       "      <td>327</td>\n",
       "      <td>229</td>\n",
       "      <td>165</td>\n",
       "      <td>159</td>\n",
       "      <td>1097</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>1188</td>\n",
       "      <td>716</td>\n",
       "      <td>326</td>\n",
       "      <td>228</td>\n",
       "      <td>165</td>\n",
       "      <td>158</td>\n",
       "      <td>960</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>1185</td>\n",
       "      <td>696</td>\n",
       "      <td>324</td>\n",
       "      <td>223</td>\n",
       "      <td>163</td>\n",
       "      <td>147</td>\n",
       "      <td>830</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>1184</td>\n",
       "      <td>688</td>\n",
       "      <td>324</td>\n",
       "      <td>221</td>\n",
       "      <td>160</td>\n",
       "      <td>144</td>\n",
       "      <td>828</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>1170</td>\n",
       "      <td>683</td>\n",
       "      <td>321</td>\n",
       "      <td>220</td>\n",
       "      <td>158</td>\n",
       "      <td>144</td>\n",
       "      <td>744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>1146</td>\n",
       "      <td>678</td>\n",
       "      <td>319</td>\n",
       "      <td>220</td>\n",
       "      <td>158</td>\n",
       "      <td>137</td>\n",
       "      <td>684</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>1144</td>\n",
       "      <td>676</td>\n",
       "      <td>316</td>\n",
       "      <td>218</td>\n",
       "      <td>157</td>\n",
       "      <td>133</td>\n",
       "      <td>641</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>1120</td>\n",
       "      <td>663</td>\n",
       "      <td>314</td>\n",
       "      <td>218</td>\n",
       "      <td>157</td>\n",
       "      <td>120</td>\n",
       "      <td>641</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>1112</td>\n",
       "      <td>656</td>\n",
       "      <td>312</td>\n",
       "      <td>215</td>\n",
       "      <td>157</td>\n",
       "      <td>111</td>\n",
       "      <td>508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>1084</td>\n",
       "      <td>653</td>\n",
       "      <td>311</td>\n",
       "      <td>215</td>\n",
       "      <td>156</td>\n",
       "      <td>101</td>\n",
       "      <td>496</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>972</td>\n",
       "      <td>652</td>\n",
       "      <td>310</td>\n",
       "      <td>214</td>\n",
       "      <td>156</td>\n",
       "      <td>98</td>\n",
       "      <td>367</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>950</td>\n",
       "      <td>645</td>\n",
       "      <td>305</td>\n",
       "      <td>214</td>\n",
       "      <td>155</td>\n",
       "      <td>85</td>\n",
       "      <td>350</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>907</td>\n",
       "      <td>622</td>\n",
       "      <td>304</td>\n",
       "      <td>211</td>\n",
       "      <td>154</td>\n",
       "      <td>84</td>\n",
       "      <td>328</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>890</td>\n",
       "      <td>622</td>\n",
       "      <td>301</td>\n",
       "      <td>209</td>\n",
       "      <td>153</td>\n",
       "      <td>82</td>\n",
       "      <td>302</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>888</td>\n",
       "      <td>615</td>\n",
       "      <td>300</td>\n",
       "      <td>209</td>\n",
       "      <td>149</td>\n",
       "      <td>80</td>\n",
       "      <td>237</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>880</td>\n",
       "      <td>606</td>\n",
       "      <td>300</td>\n",
       "      <td>208</td>\n",
       "      <td>149</td>\n",
       "      <td>80</td>\n",
       "      <td>215</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>788</td>\n",
       "      <td>594</td>\n",
       "      <td>299</td>\n",
       "      <td>208</td>\n",
       "      <td>148</td>\n",
       "      <td>79</td>\n",
       "      <td>208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>753</td>\n",
       "      <td>592</td>\n",
       "      <td>299</td>\n",
       "      <td>207</td>\n",
       "      <td>148</td>\n",
       "      <td>75</td>\n",
       "      <td>157</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>724</td>\n",
       "      <td>591</td>\n",
       "      <td>292</td>\n",
       "      <td>207</td>\n",
       "      <td>147</td>\n",
       "      <td>72</td>\n",
       "      <td>102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>970</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>971</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>972</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>973</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>974</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>975</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>976</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>977</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>978</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>979</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>980</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>981</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>982</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>983</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>984</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>985</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>986</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>987</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>988</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>989</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>990</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>991</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>992</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>993</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>994</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    pic100 pic200 pic500 pic700  pic1000  dbsc louvain\n",
       "0     1817   1013    414    287      287  2065   13605\n",
       "1     1779   1005    361    287      196   724    6314\n",
       "2     1752    908    360    271      188   289    6001\n",
       "3     1664    875    350    251      187   261    4418\n",
       "4     1547    762    349    246      183   217    2518\n",
       "5     1507    757    343    244      179   187    2514\n",
       "6     1442    750    337    244      176   177    2470\n",
       "7     1329    749    332    241      174   176    2224\n",
       "8     1324    746    332    237      170   176    1893\n",
       "9     1319    730    332    235      168   169    1633\n",
       "10    1292    724    329    234      168   162    1552\n",
       "11    1204    722    327    229      165   159    1097\n",
       "12    1188    716    326    228      165   158     960\n",
       "13    1185    696    324    223      163   147     830\n",
       "14    1184    688    324    221      160   144     828\n",
       "15    1170    683    321    220      158   144     744\n",
       "16    1146    678    319    220      158   137     684\n",
       "17    1144    676    316    218      157   133     641\n",
       "18    1120    663    314    218      157   120     641\n",
       "19    1112    656    312    215      157   111     508\n",
       "20    1084    653    311    215      156   101     496\n",
       "21     972    652    310    214      156    98     367\n",
       "22     950    645    305    214      155    85     350\n",
       "23     907    622    304    211      154    84     328\n",
       "24     890    622    301    209      153    82     302\n",
       "25     888    615    300    209      149    80     237\n",
       "26     880    606    300    208      149    80     215\n",
       "27     788    594    299    208      148    79     208\n",
       "28     753    592    299    207      148    75     157\n",
       "29     724    591    292    207      147    72     102\n",
       "..     ...    ...    ...    ...      ...   ...     ...\n",
       "970                                    2              \n",
       "971                                    2              \n",
       "972                                    2              \n",
       "973                                    2              \n",
       "974                                    2              \n",
       "975                                    2              \n",
       "976                                    2              \n",
       "977                                    2              \n",
       "978                                    2              \n",
       "979                                    2              \n",
       "980                                    2              \n",
       "981                                    2              \n",
       "982                                    2              \n",
       "983                                    1              \n",
       "984                                    1              \n",
       "985                                    1              \n",
       "986                                    1              \n",
       "987                                    1              \n",
       "988                                    1              \n",
       "989                                    1              \n",
       "990                                    1              \n",
       "991                                    1              \n",
       "992                                    1              \n",
       "993                                    1              \n",
       "994                                    1              \n",
       "995                                    1              \n",
       "996                                    1              \n",
       "997                                    1              \n",
       "998                                    1              \n",
       "999                                    1              \n",
       "\n",
       "[1000 rows x 7 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sizes=[pd.Series(results[i].value_counts()) for i in results.columns[1:]]\n",
    "sizes[5]=sizes[5][1:]\n",
    "for i in range(len(sizes)):\n",
    "    sizes[i].index=range(len(sizes[i]))\n",
    "sizes=pd.DataFrame(sizes).transpose()\n",
    "sizes.columns=results.columns[1:]\n",
    "sizes.fillna('')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From these results, it can be seen that 1000 clusters was definitely too much, since many artists ended up in their own cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='4.2'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Calculating cluster quality metrics\n",
    "\n",
    "Given the size of this dataset, it's not feasible to calculate typical clustering quality metrics such as the silhouette coefficient or the Dunn index, but some metrics for graph cuts can be used. In this case, I'll use modularity, which can be calculated very efficiently for this dataset. This metric is, however, very sensitive to singleton clusters (clusters of size 1) and favors larger clusters, so in this case it might not be the best decision criteria to see which algorithm did better, but it's a good indicator to have some idea of it. Possible values for modularity range from -0.5 to 1, with more being better. For the case of DBSCAN, however, this metric wouldn't be comparable to other algorithms, since most artists are not assigned to any cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Modularity for Power Iteration Clustering with 100 clusters : 0.037718345357822605\n",
      "Modularity for Power Iteration Clustering with 200 clusters : 0.019087964635427532\n",
      "Modularity for Power Iteration Clustering with 500 clusters : 0.0073897684741175\n",
      "Modularity for Power Iteration Clustering with 700 clusters : 0.005237515826273008\n",
      "Modularity for Power Iteration Clustering with 1000 clusters : 0.003573255160752189\n",
      "Modularity for Louvain Modularity ( 76 clusters) : 0.6073314825719774\n",
      "\n",
      "Results for DBSCAN:\n",
      "Number of clusters:  343\n",
      "Number of artists belonging to a cluster:  11867\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "print('Modularity for Power Iteration Clustering with 100 clusters :',g.modularity(membership=new_res['pic100'],weights=g.es['weight']))\n",
    "print('Modularity for Power Iteration Clustering with 200 clusters :',g.modularity(membership=new_res['pic200'],weights=g.es['weight']))\n",
    "print('Modularity for Power Iteration Clustering with 500 clusters :',g.modularity(membership=new_res['pic500'],weights=g.es['weight']),)\n",
    "print('Modularity for Power Iteration Clustering with 700 clusters :',g.modularity(membership=new_res['pic700'],weights=g.es['weight']))\n",
    "print('Modularity for Power Iteration Clustering with 1000 clusters :',g.modularity(membership=new_res['pic1000'],weights=g.es['weight']))\n",
    "print('Modularity for Louvain Modularity (',len(set(louvain_weighted.membership)),'clusters) :',louvain_weighted.modularity)\n",
    "print()\n",
    "print(\"Results for DBSCAN:\")\n",
    "print(\"Number of clusters: \",len(set(results['dbsc']))-1)\n",
    "print(\"Number of artists belonging to a cluster: \",len(results['dbsc'].loc[results['dbsc']!=-1]))\n",
    "del g, louvain_weighted, new_res, new_numer_art_to_int, new_numer_int_to_art"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Graph modularity seems to suggest that, in this case, power iteration clustering did terribly bad compared to louvain modularity, but as mentioned before, this metric might not be the most adequate and it doesn't necessarily mean that the results are invalid. A good knowledge about the music industry and manual examination would be needed to tell this. Morever, the clusters obtained from power iteration clustering also come with different levels of granularity (accordign to the number of clusters)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='5'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Part 5 - Checking a sample of the results\n",
    "\n",
    "Here I'll check some of the medium-sized clusters obtained from DBSCAN, which should be of better quality than the ones obtained from the other algorithms, since it only assigned a fraction of them to a cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cluster1</th>\n",
       "      <th>cluster2</th>\n",
       "      <th>cluster3</th>\n",
       "      <th>cluster4</th>\n",
       "      <th>cluster5</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>edip akbayram</td>\n",
       "      <td>shreya ghosal</td>\n",
       "      <td>hans albers</td>\n",
       "      <td>ten typ mes</td>\n",
       "      <td>david houston</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>silahs?z kuvvet</td>\n",
       "      <td>juggy d</td>\n",
       "      <td>jimmy makulis</td>\n",
       "      <td>bora</td>\n",
       "      <td>vince gill</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>muharrem ertas</td>\n",
       "      <td>himesh reshammiya</td>\n",
       "      <td>heino</td>\n",
       "      <td>pijani powietrzem</td>\n",
       "      <td>kitty wells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ali akbar moradi</td>\n",
       "      <td>r.d.burman</td>\n",
       "      <td>michael heck</td>\n",
       "      <td>koniec ?wiata</td>\n",
       "      <td>buck owens</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>rojin</td>\n",
       "      <td>b21</td>\n",
       "      <td>petra frey</td>\n",
       "      <td>pokahontaz</td>\n",
       "      <td>trace adkins</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>cem karaca</td>\n",
       "      <td>suzanne</td>\n",
       "      <td>maria &amp; margot hellwig</td>\n",
       "      <td>lumpex75</td>\n",
       "      <td>eric church</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>emre ayd?n</td>\n",
       "      <td>call</td>\n",
       "      <td>hansi hinterseer</td>\n",
       "      <td>coalition</td>\n",
       "      <td>janie fricke</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>mozole mirach</td>\n",
       "      <td>jagjit singh</td>\n",
       "      <td>lale andersen</td>\n",
       "      <td>fu</td>\n",
       "      <td>brooks &amp; dunn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>fuat saka</td>\n",
       "      <td>noor jehan</td>\n",
       "      <td>vikinger</td>\n",
       "      <td>s?owa we krwi</td>\n",
       "      <td>montgomery gentry</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>sabahat akkiraz</td>\n",
       "      <td>dr. zeus</td>\n",
       "      <td>nino de angelo</td>\n",
       "      <td>pogotowie seksualne</td>\n",
       "      <td>toby keith</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>emrah</td>\n",
       "      <td>vishal bhardwaj</td>\n",
       "      <td>das stoakogler trio</td>\n",
       "      <td>2cztery7</td>\n",
       "      <td>lee ann womack</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>nazan Öncel</td>\n",
       "      <td>amrinder gill</td>\n",
       "      <td>manuela</td>\n",
       "      <td>strachy na lachy</td>\n",
       "      <td>johnny horton</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>agire jiyan</td>\n",
       "      <td>farida khanum</td>\n",
       "      <td>lys assia</td>\n",
       "      <td>hey</td>\n",
       "      <td>connie smith</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>ferhat göçer</td>\n",
       "      <td>kamal heer</td>\n",
       "      <td>g. g. anderson</td>\n",
       "      <td>kaliber 44</td>\n",
       "      <td>mark chesnutt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>athena</td>\n",
       "      <td>spb</td>\n",
       "      <td>uwe busse</td>\n",
       "      <td>kazik staszewski</td>\n",
       "      <td>donna fargo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>grup yorum</td>\n",
       "      <td>sujatha</td>\n",
       "      <td>mara kayser</td>\n",
       "      <td>pezet</td>\n",
       "      <td>pam tillis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>mt</td>\n",
       "      <td>bikram singh</td>\n",
       "      <td>bernd stelter</td>\n",
       "      <td>dezerter</td>\n",
       "      <td>collin raye</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>haluk levent</td>\n",
       "      <td>ghulam ali</td>\n",
       "      <td>wencke myhre</td>\n",
       "      <td>pezet-noon</td>\n",
       "      <td>eddy arnold</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>mazhar alanson</td>\n",
       "      <td>unni menon</td>\n",
       "      <td>markus becker</td>\n",
       "      <td>all wheel drive</td>\n",
       "      <td>clint black</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>sümer ezgü</td>\n",
       "      <td>fuzon</td>\n",
       "      <td>die stoakogler</td>\n",
       "      <td>zeus</td>\n",
       "      <td>jessica andrews</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>ajda pekkan</td>\n",
       "      <td>shantanu moitra</td>\n",
       "      <td>klostertaler</td>\n",
       "      <td>kombajn do zbierania kur po wioskach</td>\n",
       "      <td>keith whitley</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>fuat</td>\n",
       "      <td>kuldip manak</td>\n",
       "      <td>ireen sheer</td>\n",
       "      <td>muchy</td>\n",
       "      <td>patty loveless</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>tolga çandar</td>\n",
       "      <td>chitra</td>\n",
       "      <td>rex gildo</td>\n",
       "      <td>marek grechuta</td>\n",
       "      <td>terri clark</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>aynur do?an</td>\n",
       "      <td>udit narayan &amp; alka yagnik</td>\n",
       "      <td>rudi schuricke</td>\n",
       "      <td>cool kids of death</td>\n",
       "      <td>bill anderson</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>esengül</td>\n",
       "      <td>srinivas</td>\n",
       "      <td>uta bresan</td>\n",
       "      <td>infekcja</td>\n",
       "      <td>mindy mccready</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>nilgül</td>\n",
       "      <td>salim-sulaiman</td>\n",
       "      <td>oliver thomas</td>\n",
       "      <td>konkwista 88</td>\n",
       "      <td>charley pride</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>bulutsuzluk Özlemi</td>\n",
       "      <td>babbu maan</td>\n",
       "      <td>roberto blanco</td>\n",
       "      <td>homomilitia</td>\n",
       "      <td>dwight yoakam</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>direc-t</td>\n",
       "      <td>mohit chauhan</td>\n",
       "      <td>freddy quinn</td>\n",
       "      <td>akurat</td>\n",
       "      <td>rascal flatts</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>muazzez ersoy</td>\n",
       "      <td>bombay jayashree</td>\n",
       "      <td>alpenrebellen</td>\n",
       "      <td>lady pank</td>\n",
       "      <td>vern gosdin</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>erol parlak</td>\n",
       "      <td>ismail darbar</td>\n",
       "      <td>katja ebstein</td>\n",
       "      <td>east west rockers</td>\n",
       "      <td>george jones</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>117</th>\n",
       "      <td>aydilge</td>\n",
       "      <td>a.r. rahman</td>\n",
       "      <td>jonny hill</td>\n",
       "      <td>farben lehre</td>\n",
       "      <td>porter wagoner</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118</th>\n",
       "      <td>bülent ortaçgil</td>\n",
       "      <td>mithoon</td>\n",
       "      <td>nik p.</td>\n",
       "      <td>awantura</td>\n",
       "      <td>kellie pickler</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>119</th>\n",
       "      <td>Üç nokta bir</td>\n",
       "      <td>gurdas maan</td>\n",
       "      <td>andrea jürgens</td>\n",
       "      <td>indios bravos</td>\n",
       "      <td>charlie rich</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120</th>\n",
       "      <td>a?k?n nur yengi</td>\n",
       "      <td>shamur</td>\n",
       "      <td>andreas martin</td>\n",
       "      <td>pid?ama porno</td>\n",
       "      <td>webb pierce</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>121</th>\n",
       "      <td>hümeyra</td>\n",
       "      <td>shehzad roy</td>\n",
       "      <td>maxi arland</td>\n",
       "      <td>daab</td>\n",
       "      <td>hank thompson</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <td>gece yolcular?</td>\n",
       "      <td>rajesh roshan</td>\n",
       "      <td>axel becker</td>\n",
       "      <td>izrael</td>\n",
       "      <td>jason aldean</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>123</th>\n",
       "      <td>y?ld?z tilbe</td>\n",
       "      <td>ali haider</td>\n",
       "      <td>kastelruther spatzen</td>\n",
       "      <td>tworzywo sztuczne</td>\n",
       "      <td>doug stone</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>mor ve Ötesi</td>\n",
       "      <td>jasbir jassi</td>\n",
       "      <td>die flippers</td>\n",
       "      <td>waglewski fisz emade</td>\n",
       "      <td>faron young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>125</th>\n",
       "      <td>yeni türkü</td>\n",
       "      <td>wadali brothers</td>\n",
       "      <td>bata illic</td>\n",
       "      <td>grzegorz turnau</td>\n",
       "      <td>sugarland</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>126</th>\n",
       "      <td>serdar ortaç</td>\n",
       "      <td>sunidhi chauhan</td>\n",
       "      <td>matthias reim</td>\n",
       "      <td>peja</td>\n",
       "      <td>jim ed brown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>?smail hakk? demircio?lu</td>\n",
       "      <td>bally jagpal</td>\n",
       "      <td>peter maffay</td>\n",
       "      <td>happysad</td>\n",
       "      <td>tracy byrd</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>128</th>\n",
       "      <td>nefret</td>\n",
       "      <td>karthik</td>\n",
       "      <td>brunner &amp; brunner</td>\n",
       "      <td>p?omie? 81</td>\n",
       "      <td>jason michael carroll</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129</th>\n",
       "      <td>demir demirkan</td>\n",
       "      <td>javed ali</td>\n",
       "      <td>slavko avsenik und seine original oberkrainer</td>\n",
       "      <td>rahim</td>\n",
       "      <td>sonny james</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>130</th>\n",
       "      <td>leman sam</td>\n",
       "      <td>shaan</td>\n",
       "      <td>andrea berg</td>\n",
       "      <td>cymeon x</td>\n",
       "      <td>chris cagle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>131</th>\n",
       "      <td>hayko cepkin</td>\n",
       "      <td>abida parveen</td>\n",
       "      <td>die zillertaler</td>\n",
       "      <td>april</td>\n",
       "      <td>craig morgan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>132</th>\n",
       "      <td>yal?n</td>\n",
       "      <td>asha bhosle &amp; kishore kumar</td>\n",
       "      <td>marianne &amp; michael</td>\n",
       "      <td>t.love</td>\n",
       "      <td>bobby bare</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>133</th>\n",
       "      <td>aylin asl?m</td>\n",
       "      <td>stereo nation</td>\n",
       "      <td>jürgen</td>\n",
       "      <td>wwo</td>\n",
       "      <td>billy currington</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>134</th>\n",
       "      <td>fuchs</td>\n",
       "      <td>rishi rich</td>\n",
       "      <td>jürgen drews</td>\n",
       "      <td>btm</td>\n",
       "      <td>rodney atkins</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>135</th>\n",
       "      <td>berdan mardini</td>\n",
       "      <td>sukhwinder singh</td>\n",
       "      <td>margot eskens</td>\n",
       "      <td>bia?a gor?czka</td>\n",
       "      <td>claude king</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>136</th>\n",
       "      <td>yüksek sadakat</td>\n",
       "      <td>vishal - shekhar</td>\n",
       "      <td>ernst mosch und seine original egerländer musi...</td>\n",
       "      <td>sound pollution</td>\n",
       "      <td>jimmy wayne</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>137</th>\n",
       "      <td>pinhani</td>\n",
       "      <td>rdb</td>\n",
       "      <td>die jungen zillertaler</td>\n",
       "      <td>guernica y luno</td>\n",
       "      <td>gene watson</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>138</th>\n",
       "      <td>Çelik</td>\n",
       "      <td>shail hada</td>\n",
       "      <td>angela wiedl</td>\n",
       "      <td>ca?a góra barwinków</td>\n",
       "      <td>keith anderson</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139</th>\n",
       "      <td>hande yener</td>\n",
       "      <td>amitabh bachchan</td>\n",
       "      <td>original naabtal duo</td>\n",
       "      <td>lilu</td>\n",
       "      <td>reba mcentire</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>140</th>\n",
       "      <td>soner arica</td>\n",
       "      <td>talat mahmood</td>\n",
       "      <td>vico torriani</td>\n",
       "      <td>maleo reggae rockers</td>\n",
       "      <td>johnny lee</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>141</th>\n",
       "      <td>emre altu?</td>\n",
       "      <td>aziz mian</td>\n",
       "      <td>hanne haller</td>\n",
       "      <td>maria peszek</td>\n",
       "      <td>jim reeves</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>142</th>\n",
       "      <td>abluka alarm</td>\n",
       "      <td>neeraj shridhar &amp; tulsi kumar</td>\n",
       "      <td>gus backus</td>\n",
       "      <td>defekt muzgó</td>\n",
       "      <td>jimmy dean</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>143</th>\n",
       "      <td>cengiz Özkan</td>\n",
       "      <td>ali zafar</td>\n",
       "      <td>mickie krause</td>\n",
       "      <td>fisz</td>\n",
       "      <td>randy travis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>bülent ersoy</td>\n",
       "      <td>chitra singh</td>\n",
       "      <td>juliane werding</td>\n",
       "      <td>pih</td>\n",
       "      <td>billie jo spears</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>ali ekber cicek</td>\n",
       "      <td>bally sagoo</td>\n",
       "      <td>wenche myhre</td>\n",
       "      <td>dixon37</td>\n",
       "      <td>dan seals</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>betül demir</td>\n",
       "      <td>achanak</td>\n",
       "      <td>truck stop</td>\n",
       "      <td>molesta</td>\n",
       "      <td>lonestar</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>147 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                     cluster1                       cluster2  \\\n",
       "0               edip akbayram                  shreya ghosal   \n",
       "1             silahs?z kuvvet                        juggy d   \n",
       "2              muharrem ertas              himesh reshammiya   \n",
       "3            ali akbar moradi                     r.d.burman   \n",
       "4                       rojin                            b21   \n",
       "5                  cem karaca                        suzanne   \n",
       "6                  emre ayd?n                           call   \n",
       "7               mozole mirach                   jagjit singh   \n",
       "8                   fuat saka                     noor jehan   \n",
       "9             sabahat akkiraz                       dr. zeus   \n",
       "10                      emrah                vishal bhardwaj   \n",
       "11                nazan Öncel                  amrinder gill   \n",
       "12                agire jiyan                  farida khanum   \n",
       "13               ferhat göçer                     kamal heer   \n",
       "14                     athena                            spb   \n",
       "15                 grup yorum                        sujatha   \n",
       "16                         mt                   bikram singh   \n",
       "17               haluk levent                     ghulam ali   \n",
       "18             mazhar alanson                     unni menon   \n",
       "19                 sümer ezgü                          fuzon   \n",
       "20                ajda pekkan                shantanu moitra   \n",
       "21                       fuat                   kuldip manak   \n",
       "22               tolga çandar                         chitra   \n",
       "23                aynur do?an     udit narayan & alka yagnik   \n",
       "24                    esengül                       srinivas   \n",
       "25                     nilgül                 salim-sulaiman   \n",
       "26         bulutsuzluk Özlemi                     babbu maan   \n",
       "27                    direc-t                  mohit chauhan   \n",
       "28              muazzez ersoy               bombay jayashree   \n",
       "29                erol parlak                  ismail darbar   \n",
       "..                        ...                            ...   \n",
       "117                   aydilge                    a.r. rahman   \n",
       "118           bülent ortaçgil                        mithoon   \n",
       "119              Üç nokta bir                    gurdas maan   \n",
       "120           a?k?n nur yengi                         shamur   \n",
       "121                   hümeyra                    shehzad roy   \n",
       "122            gece yolcular?                  rajesh roshan   \n",
       "123              y?ld?z tilbe                     ali haider   \n",
       "124              mor ve Ötesi                   jasbir jassi   \n",
       "125                yeni türkü                wadali brothers   \n",
       "126              serdar ortaç                sunidhi chauhan   \n",
       "127  ?smail hakk? demircio?lu                   bally jagpal   \n",
       "128                    nefret                        karthik   \n",
       "129            demir demirkan                      javed ali   \n",
       "130                 leman sam                          shaan   \n",
       "131              hayko cepkin                  abida parveen   \n",
       "132                     yal?n    asha bhosle & kishore kumar   \n",
       "133               aylin asl?m                  stereo nation   \n",
       "134                     fuchs                     rishi rich   \n",
       "135            berdan mardini               sukhwinder singh   \n",
       "136            yüksek sadakat               vishal - shekhar   \n",
       "137                   pinhani                            rdb   \n",
       "138                     Çelik                     shail hada   \n",
       "139               hande yener               amitabh bachchan   \n",
       "140               soner arica                  talat mahmood   \n",
       "141                emre altu?                      aziz mian   \n",
       "142              abluka alarm  neeraj shridhar & tulsi kumar   \n",
       "143              cengiz Özkan                      ali zafar   \n",
       "144              bülent ersoy                   chitra singh   \n",
       "145           ali ekber cicek                    bally sagoo   \n",
       "146               betül demir                        achanak   \n",
       "\n",
       "                                              cluster3  \\\n",
       "0                                          hans albers   \n",
       "1                                        jimmy makulis   \n",
       "2                                                heino   \n",
       "3                                         michael heck   \n",
       "4                                           petra frey   \n",
       "5                               maria & margot hellwig   \n",
       "6                                     hansi hinterseer   \n",
       "7                                        lale andersen   \n",
       "8                                             vikinger   \n",
       "9                                       nino de angelo   \n",
       "10                                 das stoakogler trio   \n",
       "11                                             manuela   \n",
       "12                                           lys assia   \n",
       "13                                      g. g. anderson   \n",
       "14                                           uwe busse   \n",
       "15                                         mara kayser   \n",
       "16                                       bernd stelter   \n",
       "17                                        wencke myhre   \n",
       "18                                       markus becker   \n",
       "19                                      die stoakogler   \n",
       "20                                        klostertaler   \n",
       "21                                         ireen sheer   \n",
       "22                                           rex gildo   \n",
       "23                                      rudi schuricke   \n",
       "24                                          uta bresan   \n",
       "25                                       oliver thomas   \n",
       "26                                      roberto blanco   \n",
       "27                                        freddy quinn   \n",
       "28                                       alpenrebellen   \n",
       "29                                       katja ebstein   \n",
       "..                                                 ...   \n",
       "117                                         jonny hill   \n",
       "118                                             nik p.   \n",
       "119                                     andrea jürgens   \n",
       "120                                     andreas martin   \n",
       "121                                        maxi arland   \n",
       "122                                        axel becker   \n",
       "123                               kastelruther spatzen   \n",
       "124                                       die flippers   \n",
       "125                                         bata illic   \n",
       "126                                      matthias reim   \n",
       "127                                       peter maffay   \n",
       "128                                  brunner & brunner   \n",
       "129      slavko avsenik und seine original oberkrainer   \n",
       "130                                        andrea berg   \n",
       "131                                    die zillertaler   \n",
       "132                                 marianne & michael   \n",
       "133                                             jürgen   \n",
       "134                                       jürgen drews   \n",
       "135                                      margot eskens   \n",
       "136  ernst mosch und seine original egerländer musi...   \n",
       "137                             die jungen zillertaler   \n",
       "138                                       angela wiedl   \n",
       "139                               original naabtal duo   \n",
       "140                                      vico torriani   \n",
       "141                                       hanne haller   \n",
       "142                                         gus backus   \n",
       "143                                      mickie krause   \n",
       "144                                    juliane werding   \n",
       "145                                       wenche myhre   \n",
       "146                                         truck stop   \n",
       "\n",
       "                                 cluster4               cluster5  \n",
       "0                             ten typ mes          david houston  \n",
       "1                                    bora             vince gill  \n",
       "2                       pijani powietrzem            kitty wells  \n",
       "3                           koniec ?wiata             buck owens  \n",
       "4                              pokahontaz           trace adkins  \n",
       "5                                lumpex75            eric church  \n",
       "6                               coalition           janie fricke  \n",
       "7                                      fu          brooks & dunn  \n",
       "8                           s?owa we krwi      montgomery gentry  \n",
       "9                     pogotowie seksualne             toby keith  \n",
       "10                               2cztery7         lee ann womack  \n",
       "11                       strachy na lachy          johnny horton  \n",
       "12                                    hey           connie smith  \n",
       "13                             kaliber 44          mark chesnutt  \n",
       "14                       kazik staszewski            donna fargo  \n",
       "15                                  pezet             pam tillis  \n",
       "16                               dezerter            collin raye  \n",
       "17                             pezet-noon            eddy arnold  \n",
       "18                        all wheel drive            clint black  \n",
       "19                                   zeus        jessica andrews  \n",
       "20   kombajn do zbierania kur po wioskach          keith whitley  \n",
       "21                                  muchy         patty loveless  \n",
       "22                         marek grechuta            terri clark  \n",
       "23                     cool kids of death          bill anderson  \n",
       "24                               infekcja         mindy mccready  \n",
       "25                           konkwista 88          charley pride  \n",
       "26                            homomilitia          dwight yoakam  \n",
       "27                                 akurat          rascal flatts  \n",
       "28                              lady pank            vern gosdin  \n",
       "29                      east west rockers           george jones  \n",
       "..                                    ...                    ...  \n",
       "117                          farben lehre         porter wagoner  \n",
       "118                              awantura         kellie pickler  \n",
       "119                         indios bravos           charlie rich  \n",
       "120                         pid?ama porno            webb pierce  \n",
       "121                                  daab          hank thompson  \n",
       "122                                izrael           jason aldean  \n",
       "123                     tworzywo sztuczne             doug stone  \n",
       "124                  waglewski fisz emade            faron young  \n",
       "125                       grzegorz turnau              sugarland  \n",
       "126                                  peja           jim ed brown  \n",
       "127                              happysad             tracy byrd  \n",
       "128                            p?omie? 81  jason michael carroll  \n",
       "129                                 rahim            sonny james  \n",
       "130                              cymeon x            chris cagle  \n",
       "131                                 april           craig morgan  \n",
       "132                                t.love             bobby bare  \n",
       "133                                   wwo       billy currington  \n",
       "134                                   btm          rodney atkins  \n",
       "135                        bia?a gor?czka            claude king  \n",
       "136                       sound pollution            jimmy wayne  \n",
       "137                       guernica y luno            gene watson  \n",
       "138                   ca?a góra barwinków         keith anderson  \n",
       "139                                  lilu          reba mcentire  \n",
       "140                  maleo reggae rockers             johnny lee  \n",
       "141                          maria peszek             jim reeves  \n",
       "142                          defekt muzgó             jimmy dean  \n",
       "143                                  fisz           randy travis  \n",
       "144                                   pih       billie jo spears  \n",
       "145                               dixon37              dan seals  \n",
       "146                               molesta               lonestar  \n",
       "\n",
       "[147 rows x 5 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters=results['dbsc'].value_counts()[10:15].index\n",
    "clusters=[pd.DataFrame(results[['artist_name']].loc[results['dbsc']==i]) for i in clusters]\n",
    "for i in range(len(clusters)):\n",
    "    clusters[i].index=range(len(clusters[i]))\n",
    "clusters=clusters[0].merge(clusters[1],left_index=True, right_index=True).merge(clusters[2],left_index=True, right_index=True).merge(clusters[3],left_index=True, right_index=True).merge(clusters[4],left_index=True, right_index=True)\n",
    "clusters.columns=['cluster'+str(i) for i in range(1,len(clusters.columns)+1)]\n",
    "clusters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Trying to interpret the clusters:\n",
    "* The first cluster seems to contain mostly Turkish musicians of popular and folk music.\n",
    "* The second cluster seems to contain mostly Indian-Punjabi musicians, also of popular and folk music.\n",
    "* The third cluster seems to contain mostly German-speaking singers of pop and movie-derived songs.\n",
    "* The fourth cluster seems to contain mostly east European small artists of alternative rock and indie art.\n",
    "* The fifth cluster seems to contain mostly country artists from the US."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}