{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Clustering Musical Artists\n",
"\n",
"*Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following [this link.](http://nbviewer.jupyter.org/github/david-cortes/datascienceprojects/blob/master/machine_learning/clustering_fm_artists.ipynb)*\n",
"\n",
"This project consists on clustering musical artists using a dataset with the top 50 played artists per user of a random sample of ~360,000 users from Last.fm, which can be found [here](http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html), based on the idea that artists who are preferred by the same users tend to be similar."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[1. Loading and cleaning the data](#1)\n",
"* [1.1 Downloading the dataset and generating a sample](#1.1)\n",
"* [1.2 Downloading and formatting the data](#1.2)\n",
"\n",
"[2. Establishing Artists' pairwise similarities](#2)\n",
"* [2.1 Generating candidate pairs of aritsts to compare](#2.1)\n",
"* [2.2 Converting to cosine distances](#2.2)\n",
"\n",
"[3. Clustering Artists](#3)\n",
"* [Additional clustering without Spark](#3.1)\n",
"\n",
"[4. Checking cluster sizes and calculating cluster quality metrics](#4)\n",
"* [4.1 Checking the sizes of the largest clusters for the different algorithms](#4.1)\n",
"* [4.2 Calculating cluster quality metrics](#4.2)\n",
"\n",
"[5. Checking a sample of the results](#5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1 - Loading and cleaning the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 - Downloading the dataset and generating a sample"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import os, sys\n",
"\n",
"def download_data():\n",
" import urllib, tarfile\n",
" data_file = urllib.URLopener()\n",
" data_file.retrieve(\"http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz\", \"lastfm-dataset-360K.tar.gz\")\n",
" data_file = tarfile.open(\"lastfm-dataset-360K.tar.gz\", 'r:gz')\n",
" data_file.extractall()\n",
" \n",
"def generate_sample(file_path=os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv',n=10000):\n",
" with open('small_sample.tsv','w') as out:\n",
" with open(file_path) as f:\n",
" for i in range(n):\n",
" out.write(f.readline())\n",
" \n",
"#download_data()\n",
"# generate_sample()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 - Loading and formatting the data"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 359339 different users\n",
"There are 160162 valid artists\n"
]
},
{
"data": {
"text/plain": [
"PythonRDD[202] at RDD at PythonRDD.scala:43"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import os, json\n",
"\n",
"#Here I'm assuming there's an Spark Context already set up under the name 'sc'\n",
"\n",
"# dataset=sc.textFile(os.getcwd()+'/small_sample.tsv')\n",
"dataset=sc.textFile(os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv')\n",
"\n",
"dataset=dataset.map(lambda x: x.split('\\t')).filter(lambda x: len(x[1])>30) #removing invalid artists (e.g. 'bbc radio')\n",
"dataset=dataset.filter(lambda x: x[1]!='89ad4ac3-39f7-470e-963a-56509c546377') #removing 'various artists'\n",
"\n",
"#converting hash values to integers to speed up the analysis\n",
"u_list=dataset.map(lambda x: x[0]).distinct().collect()\n",
"users=dict()\n",
"for v,k in enumerate(u_list):\n",
" users[k]=v\n",
"a_list=dataset.map(lambda x: x[1]).distinct().collect()\n",
"artists=dict()\n",
"for v,k in enumerate(a_list):\n",
" artists[k]=v\n",
"n_users=len(u_list)\n",
"n_artists=len(a_list)\n",
"\n",
"dataset=dataset.map(lambda x: (users[x[0]],artists[x[1]],x[2])) #the number of plays is not relevant here\n",
"dataset.cache()\n",
"del u_list, a_list, users, artists\n",
"\n",
"\n",
"print(\"There are \", n_users, \" different users\")\n",
"print(\"There are \", n_artists, \" valid artists\")\n",
"\n",
"\n",
"#Generating some useful files for later\n",
"\n",
"#Artists in this dataset can appear under more than one name\n",
"def build_arts_dict(dataset):\n",
" return dict(dataset.map(lambda x: (x[1],x[2])).groupByKey().mapValues(list).mapValues(lambda x: x[0]).map(lambda x: [x[0],x[1]]).collect())\n",
"\n",
"arts_dic=build_arts_dict(dataset)\n",
"with open('artists.json', 'w') as outfile:\n",
" json.dump(arts_dic, outfile)\n",
"del arts_dic\n",
"dataset.cache()\n",
"dataset.map(lambda x: (x[0],x[1])).saveAsTextFile('processed_data')\n",
"\n",
"users_per_artist=dict(dataset.map(lambda x: (x[1],x[0])).groupByKey().mapValues(len).map(list).collect())\n",
"with open('users_per_artist.json', 'w') as outfile:\n",
" json.dump(users_per_artist, outfile)\n",
"del users_per_artist\n",
"dataset.unpersist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2 - Establishing Artists' pairwise similarities\n",
"\n",
"Calculating cosine similarities among artists (this penalizes according to the frequencies of artists across users, giving less of a penalty for cases of assimetric frequencies than dice similarity, for example), considering only wheter they are in a user's top played list regardless of the number of plays and only taking into account those with a high-enoguh similarity.\n",
"\n",
"If we assume that each user has roughly 50-top played artists, and each artist has an equal chance of appearing within any given user's playlist, then the expected cosine similarity between two artists, if everything happened by chance, could be approximated like this:\n",
"\n",
"$$expected.sim=\\frac{(\\frac{50}{n. artists})^2 \\times n.users}{\\sqrt{50}^2} \\approx 0.0007 $$\n",
"\n",
"However, since pairs of artists with such similarity would likely not be in the same cluster, it could be a better idea to set an arbitrary threshold instead, so as to decrease the number of pairs. A cosine similarity of 0.1 would be equivalent to two artists appearing each in the playlists of 100 users and having 10 users in common; or in a different case, to an artist appearing in 100 users' playlist and another in 50, having 7 users in common.\n",
"\n",
"In order to decrease the number of artists pairs to evaluate in the clustering part and make it manageable on a single machine, it would be better to set a minimum requirement for common users among artists - here I set it to 7, so a pair of artists should have at least 7 users in common in order for it to be assigned a non-zero distance, otherwise there would be artists assigned to the same cluster only because one user heard both - as well as a threshold for the cosine distance, which I set at 4 times the expected value if everything happened at random, in order for a pair to be considered as having a certain similarity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 Generating candidate pairs of aritsts to compare\n",
"\n",
"This dataset encompasses 360,000 users, each having a list of 50 top played artists, summing up to 160,000 different artists. Trying to establish similarities between all the artists would imply looping through $160,000 \\times (160,000-1)/2 \\approx 13,000,000,000$ pairs, so it's better to first see which artists have users in common, as of the 13 billion possible pairs, there are very few with at least one user in common. Doing it this way would imply looping over only $360,000 \\times 50 \\times (50-1)/2 \\approx 441,000,000$ pairs, and in the process, it's possible to count how many users do artists have in common to ease further distance calculations."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"PythonRDD[10] at RDD at PythonRDD.scala:43"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import itertools\n",
"from operator import add\n",
"\n",
"def ordered(pair):\n",
" n1,n2=pair\n",
" if n2>n1:\n",
" return n1,n2\n",
" else:\n",
" return n2,n1\n",
"\n",
"dataset=sc.textFile('processed_data').map(eval).groupByKey().mapValues(list).map(lambda x: x[1])\n",
"dataset=dataset.flatMap(lambda x: [(ordered(i),1) for i in itertools.combinations(x,2)]).reduceByKey(add).map(lambda x: (x[0][0],x[0][1],x[1])).filter(lambda x: x[2]>6)\n",
"dataset.cache()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Converting to cosine distances"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 7607557 non-zero pairwise distances\n"
]
}
],
"source": [
"from math import sqrt\n",
"import json\n",
"\n",
"#Taken from the previous results\n",
"n_artists=160162\n",
"n_users=359339\n",
"threshold=4* ( ((50.0/n_artists)**2)*n_users/(sqrt(50.0)**2) )\n",
"\n",
"with open('users_per_artist.json') as file:\n",
" users_per_artist=json.load(file)\n",
"users_per_artist={int(k):v for k,v in users_per_artist.items()}\n",
"bc_dic=sc.broadcast(users_per_artist)\n",
"del users_per_artist\n",
"\n",
"dataset=dataset.map(lambda x: (x[0],x[1],x[2]*1.0/(sqrt(bc_dic.value[x[0]])*sqrt(bc_dic.value[x[1]])))).filter(lambda x: x[2]>threshold)\n",
"dataset.cache()\n",
"dataset.saveAsTextFile('sims')\n",
"print('There are ',dataset.count(),' non-zero pairwise distances')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 3 - Clustering Artists\n",
"\n",
"Here I'll produce different clusterings, using 100, 200, 500, 700 and 1000 clusters usign power iteration clustering, which provides similar (though usually slightly inferior) results to spectral clustering but runs faster and is scalable."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"55637 artists were clustered\n"
]
},
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" artist_id | \n",
" cluster100 | \n",
" cluster200 | \n",
" cluster500 | \n",
" cluster700 | \n",
" cluster1000 | \n",
" artist_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 43120 | \n",
" 54 | \n",
" 56 | \n",
" 328 | \n",
" 29 | \n",
" 976 | \n",
" sy smith | \n",
"
\n",
" \n",
" 1 | \n",
" 90797 | \n",
" 74 | \n",
" 154 | \n",
" 103 | \n",
" 466 | \n",
" 829 | \n",
" phoebe killdeer and the short straws | \n",
"
\n",
" \n",
" 2 | \n",
" 10290 | \n",
" 56 | \n",
" 147 | \n",
" 273 | \n",
" 30 | \n",
" 928 | \n",
" judge jules | \n",
"
\n",
" \n",
" 3 | \n",
" 17934 | \n",
" 85 | \n",
" 78 | \n",
" 275 | \n",
" 683 | \n",
" 853 | \n",
" strip steve | \n",
"
\n",
" \n",
" 4 | \n",
" 30429 | \n",
" 49 | \n",
" 38 | \n",
" 299 | \n",
" 672 | \n",
" 352 | \n",
" marja mattlar | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" artist_id cluster100 cluster200 cluster500 cluster700 cluster1000 \\\n",
"0 43120 54 56 328 29 976 \n",
"1 90797 74 154 103 466 829 \n",
"2 10290 56 147 273 30 928 \n",
"3 17934 85 78 275 683 853 \n",
"4 30429 49 38 299 672 352 \n",
"\n",
" artist_name \n",
"0 sy smith \n",
"1 phoebe killdeer and the short straws \n",
"2 judge jules \n",
"3 strip steve \n",
"4 marja mattlar "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pyspark.mllib.clustering import PowerIterationClustering as pic\n",
"import pandas as pd\n",
"import json\n",
"\n",
"# dataset=sc.textFile('sims').map(eval)\n",
"# dataset.cache()\n",
"\n",
"n_clusters=100\n",
"clusters=pic.train(dataset,n_clusters)\n",
"clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts100')\n",
"del clusters\n",
"\n",
"n_clusters=200\n",
"clusters=pic.train(dataset,n_clusters)\n",
"clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts200')\n",
"del clusters\n",
"\n",
"n_clusters=500\n",
"clusters=pic.train(dataset,n_clusters)\n",
"clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts500')\n",
"del clusters\n",
"\n",
"n_clusters=700\n",
"clusters=pic.train(dataset,n_clusters)\n",
"clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts700')\n",
"del clusters\n",
"\n",
"n_clusters=1000\n",
"clusters=pic.train(dataset,n_clusters)\n",
"clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts1000')\n",
"del clusters\n",
"\n",
"\n",
"dataset=pd.read_csv('clusts100\\part-00000',header=None)\n",
"dataset.columns=['artist_id','cluster100']\n",
"dataset200=pd.read_csv('clusts200\\part-00000',header=None)\n",
"dataset200.columns=['artist_id','cluster200']\n",
"dataset500=pd.read_csv('clusts500\\part-00000',header=None)\n",
"dataset500.columns=['artist_id','cluster500']\n",
"dataset700=pd.read_csv('clusts700\\part-00000',header=None)\n",
"dataset700.columns=['artist_id','cluster700']\n",
"dataset1000=pd.read_csv('clusts1000\\part-00000',header=None)\n",
"dataset1000.columns=['artist_id','cluster1000']\n",
"\n",
"\n",
"dataset=dataset.merge(dataset200,how='outer',on='artist_id').merge(dataset500,how='outer',on='artist_id').merge(dataset700,how='outer',on='artist_id').merge(dataset1000,how='outer',on='artist_id')\n",
"\n",
"with open('artists.json') as art:\n",
" artists_dict=json.load(art)\n",
"artists_dict={int(k):v for k,v in artists_dict.items()}\n",
"dataset['artist_name']=[artists_dict[art] for art in dataset['artist_id']]\n",
"dataset.to_csv('results_all.csv',index=False)\n",
"\n",
"del dataset200,dataset500,dataset700,dataset1000\n",
"print(dataset.shape[0],' artists were clustered')\n",
"dataset.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since most artists are in only one or two user's playlists, it's unreliable (and computationally complex) to cluster them with so few data. That's why only a fraction (around one third) of the artists were considered for the clustering process."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_**Additional:** clustering with scikit-learn (dbscan) and igraph (louvain modularity) (both are non-parallel). I chose these parameters and algorithms after some manual experimentation seeing which ones give a reasonable spread of artists across clusters. These algorithms have the nice property of automatically determining the number of clusters._"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Rearranging the data format\n",
"import re\n",
"\n",
"dataset=sc.textFile('sims')\n",
"dataset=dataset.map(lambda x: re.sub('[\\(\\)\\s]','',x))\n",
"dataset.repartition(1).saveAsTextFile('sims_csv')\n",
"dataset.unpersist()\n",
"del dataset"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" artist_name | \n",
" pic100 | \n",
" pic200 | \n",
" pic500 | \n",
" pic700 | \n",
" pic1000 | \n",
" dbsc | \n",
" louvain | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" sy smith | \n",
" 54 | \n",
" 56 | \n",
" 328 | \n",
" 29 | \n",
" 976 | \n",
" -1 | \n",
" 55 | \n",
"
\n",
" \n",
" 1 | \n",
" phoebe killdeer and the short straws | \n",
" 74 | \n",
" 154 | \n",
" 103 | \n",
" 466 | \n",
" 829 | \n",
" -1 | \n",
" 62 | \n",
"
\n",
" \n",
" 2 | \n",
" judge jules | \n",
" 56 | \n",
" 147 | \n",
" 273 | \n",
" 30 | \n",
" 928 | \n",
" -1 | \n",
" 55 | \n",
"
\n",
" \n",
" 3 | \n",
" strip steve | \n",
" 85 | \n",
" 78 | \n",
" 275 | \n",
" 683 | \n",
" 853 | \n",
" -1 | \n",
" 62 | \n",
"
\n",
" \n",
" 4 | \n",
" marja mattlar | \n",
" 49 | \n",
" 38 | \n",
" 299 | \n",
" 672 | \n",
" 352 | \n",
" -1 | \n",
" 38 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" artist_name pic100 pic200 pic500 pic700 \\\n",
"0 sy smith 54 56 328 29 \n",
"1 phoebe killdeer and the short straws 74 154 103 466 \n",
"2 judge jules 56 147 273 30 \n",
"3 strip steve 85 78 275 683 \n",
"4 marja mattlar 49 38 299 672 \n",
"\n",
" pic1000 dbsc louvain \n",
"0 976 -1 55 \n",
"1 829 -1 62 \n",
"2 928 -1 55 \n",
"3 853 -1 62 \n",
"4 352 -1 38 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sklearn, igraph, scipy, re\n",
"import pandas as pd\n",
"import sklearn.cluster\n",
"\n",
"dataset=pd.read_csv('sims_csv/part-00000',header=None)\n",
"dataset.columns=['art1','art2','sim']\n",
"dataset['dist']=[1-i for i in dataset['sim']]\n",
"present_artists=set(dataset['art1'].append(dataset['art2']).values.tolist())\n",
"new_numer_art_to_int=dict()\n",
"new_numer_int_to_art=dict()\n",
"count=0\n",
"for art in present_artists:\n",
" new_numer_art_to_int[art]=count\n",
" new_numer_int_to_art[count]=art\n",
" count+=1\n",
"del present_artists, count\n",
"dataset['art1']=[new_numer_art_to_int[i] for i in dataset['art1']]\n",
"dataset['art2']=[new_numer_art_to_int[i] for i in dataset['art2']]\n",
"\n",
"I=dataset['art1'].append(dataset['art2'])\n",
"J=dataset['art2'].append(dataset['art1'])\n",
"V=dataset['dist'].append(dataset['dist'])\n",
"\n",
"dataset_matrix=scipy.sparse.csr_matrix((V,(I,J)))\n",
"del I,J,V\n",
"dataset_matrix\n",
"\n",
"dbsc=sklearn.cluster.DBSCAN(eps=0.775,metric='precomputed').fit_predict(dataset_matrix)\n",
"new_res=pd.Series(range(dataset_matrix.shape[0])).to_frame()\n",
"new_res.columns=['artist_id']\n",
"new_res['dbsc']=dbsc\n",
"del dbsc, dataset_matrix\n",
"\n",
"g=igraph.Graph(edges=dataset[['art1','art2']].values.tolist(),directed=False)\n",
"g.es['weight']=dataset['sim'].values.tolist()\n",
"del dataset\n",
"louvain_weighted=g.community_multilevel(weights=g.es['weight'])\n",
"new_res['louvain']=louvain_weighted.membership\n",
"new_res['artist_id']=[new_numer_int_to_art[i] for i in new_res['artist_id']]\n",
"\n",
"results=pd.read_csv('results_all.csv',engine='python')\n",
"results=results.merge(new_res,how='left',on='artist_id')\n",
"new_res=new_res.merge(results[['artist_id','cluster100','cluster200','cluster500','cluster700','cluster1000']],how='left',on='artist_id')\n",
"cols=results.columns.tolist()\n",
"cols=cols[6:7]+cols[1:6]+cols[7:9]\n",
"results=results[cols]\n",
"results.columns=[re.sub('cluster','pic',i) for i in results.columns]\n",
"new_res.columns=[re.sub('cluster','pic',i) for i in new_res.columns]\n",
"\n",
"results.to_csv('results_all.csv',index=False)\n",
"results.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_Note: a cluster assignment of -1 means that the row was not asigned to any cluster. In DBSCAN most of the artists are not assigned to any cluster, thus those clusters should be of better quality._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 4 - Checking cluster sizes and calculating cluster quality metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1 Checking the sizes of the largest clusters for the different algorithms"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pic100 | \n",
" pic200 | \n",
" pic500 | \n",
" pic700 | \n",
" pic1000 | \n",
" dbsc | \n",
" louvain | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1817 | \n",
" 1013 | \n",
" 414 | \n",
" 287 | \n",
" 287 | \n",
" 2065 | \n",
" 13605 | \n",
"
\n",
" \n",
" 1 | \n",
" 1779 | \n",
" 1005 | \n",
" 361 | \n",
" 287 | \n",
" 196 | \n",
" 724 | \n",
" 6314 | \n",
"
\n",
" \n",
" 2 | \n",
" 1752 | \n",
" 908 | \n",
" 360 | \n",
" 271 | \n",
" 188 | \n",
" 289 | \n",
" 6001 | \n",
"
\n",
" \n",
" 3 | \n",
" 1664 | \n",
" 875 | \n",
" 350 | \n",
" 251 | \n",
" 187 | \n",
" 261 | \n",
" 4418 | \n",
"
\n",
" \n",
" 4 | \n",
" 1547 | \n",
" 762 | \n",
" 349 | \n",
" 246 | \n",
" 183 | \n",
" 217 | \n",
" 2518 | \n",
"
\n",
" \n",
" 5 | \n",
" 1507 | \n",
" 757 | \n",
" 343 | \n",
" 244 | \n",
" 179 | \n",
" 187 | \n",
" 2514 | \n",
"
\n",
" \n",
" 6 | \n",
" 1442 | \n",
" 750 | \n",
" 337 | \n",
" 244 | \n",
" 176 | \n",
" 177 | \n",
" 2470 | \n",
"
\n",
" \n",
" 7 | \n",
" 1329 | \n",
" 749 | \n",
" 332 | \n",
" 241 | \n",
" 174 | \n",
" 176 | \n",
" 2224 | \n",
"
\n",
" \n",
" 8 | \n",
" 1324 | \n",
" 746 | \n",
" 332 | \n",
" 237 | \n",
" 170 | \n",
" 176 | \n",
" 1893 | \n",
"
\n",
" \n",
" 9 | \n",
" 1319 | \n",
" 730 | \n",
" 332 | \n",
" 235 | \n",
" 168 | \n",
" 169 | \n",
" 1633 | \n",
"
\n",
" \n",
" 10 | \n",
" 1292 | \n",
" 724 | \n",
" 329 | \n",
" 234 | \n",
" 168 | \n",
" 162 | \n",
" 1552 | \n",
"
\n",
" \n",
" 11 | \n",
" 1204 | \n",
" 722 | \n",
" 327 | \n",
" 229 | \n",
" 165 | \n",
" 159 | \n",
" 1097 | \n",
"
\n",
" \n",
" 12 | \n",
" 1188 | \n",
" 716 | \n",
" 326 | \n",
" 228 | \n",
" 165 | \n",
" 158 | \n",
" 960 | \n",
"
\n",
" \n",
" 13 | \n",
" 1185 | \n",
" 696 | \n",
" 324 | \n",
" 223 | \n",
" 163 | \n",
" 147 | \n",
" 830 | \n",
"
\n",
" \n",
" 14 | \n",
" 1184 | \n",
" 688 | \n",
" 324 | \n",
" 221 | \n",
" 160 | \n",
" 144 | \n",
" 828 | \n",
"
\n",
" \n",
" 15 | \n",
" 1170 | \n",
" 683 | \n",
" 321 | \n",
" 220 | \n",
" 158 | \n",
" 144 | \n",
" 744 | \n",
"
\n",
" \n",
" 16 | \n",
" 1146 | \n",
" 678 | \n",
" 319 | \n",
" 220 | \n",
" 158 | \n",
" 137 | \n",
" 684 | \n",
"
\n",
" \n",
" 17 | \n",
" 1144 | \n",
" 676 | \n",
" 316 | \n",
" 218 | \n",
" 157 | \n",
" 133 | \n",
" 641 | \n",
"
\n",
" \n",
" 18 | \n",
" 1120 | \n",
" 663 | \n",
" 314 | \n",
" 218 | \n",
" 157 | \n",
" 120 | \n",
" 641 | \n",
"
\n",
" \n",
" 19 | \n",
" 1112 | \n",
" 656 | \n",
" 312 | \n",
" 215 | \n",
" 157 | \n",
" 111 | \n",
" 508 | \n",
"
\n",
" \n",
" 20 | \n",
" 1084 | \n",
" 653 | \n",
" 311 | \n",
" 215 | \n",
" 156 | \n",
" 101 | \n",
" 496 | \n",
"
\n",
" \n",
" 21 | \n",
" 972 | \n",
" 652 | \n",
" 310 | \n",
" 214 | \n",
" 156 | \n",
" 98 | \n",
" 367 | \n",
"
\n",
" \n",
" 22 | \n",
" 950 | \n",
" 645 | \n",
" 305 | \n",
" 214 | \n",
" 155 | \n",
" 85 | \n",
" 350 | \n",
"
\n",
" \n",
" 23 | \n",
" 907 | \n",
" 622 | \n",
" 304 | \n",
" 211 | \n",
" 154 | \n",
" 84 | \n",
" 328 | \n",
"
\n",
" \n",
" 24 | \n",
" 890 | \n",
" 622 | \n",
" 301 | \n",
" 209 | \n",
" 153 | \n",
" 82 | \n",
" 302 | \n",
"
\n",
" \n",
" 25 | \n",
" 888 | \n",
" 615 | \n",
" 300 | \n",
" 209 | \n",
" 149 | \n",
" 80 | \n",
" 237 | \n",
"
\n",
" \n",
" 26 | \n",
" 880 | \n",
" 606 | \n",
" 300 | \n",
" 208 | \n",
" 149 | \n",
" 80 | \n",
" 215 | \n",
"
\n",
" \n",
" 27 | \n",
" 788 | \n",
" 594 | \n",
" 299 | \n",
" 208 | \n",
" 148 | \n",
" 79 | \n",
" 208 | \n",
"
\n",
" \n",
" 28 | \n",
" 753 | \n",
" 592 | \n",
" 299 | \n",
" 207 | \n",
" 148 | \n",
" 75 | \n",
" 157 | \n",
"
\n",
" \n",
" 29 | \n",
" 724 | \n",
" 591 | \n",
" 292 | \n",
" 207 | \n",
" 147 | \n",
" 72 | \n",
" 102 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 970 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 971 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 972 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 973 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 974 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 975 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 976 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 977 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 978 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 979 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 980 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 981 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 982 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 983 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 984 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 985 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 986 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 987 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 988 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 989 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 990 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 991 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 992 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 993 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 994 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 995 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 996 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 997 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 998 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 999 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
"
\n",
"
1000 rows × 7 columns
\n",
"
"
],
"text/plain": [
" pic100 pic200 pic500 pic700 pic1000 dbsc louvain\n",
"0 1817 1013 414 287 287 2065 13605\n",
"1 1779 1005 361 287 196 724 6314\n",
"2 1752 908 360 271 188 289 6001\n",
"3 1664 875 350 251 187 261 4418\n",
"4 1547 762 349 246 183 217 2518\n",
"5 1507 757 343 244 179 187 2514\n",
"6 1442 750 337 244 176 177 2470\n",
"7 1329 749 332 241 174 176 2224\n",
"8 1324 746 332 237 170 176 1893\n",
"9 1319 730 332 235 168 169 1633\n",
"10 1292 724 329 234 168 162 1552\n",
"11 1204 722 327 229 165 159 1097\n",
"12 1188 716 326 228 165 158 960\n",
"13 1185 696 324 223 163 147 830\n",
"14 1184 688 324 221 160 144 828\n",
"15 1170 683 321 220 158 144 744\n",
"16 1146 678 319 220 158 137 684\n",
"17 1144 676 316 218 157 133 641\n",
"18 1120 663 314 218 157 120 641\n",
"19 1112 656 312 215 157 111 508\n",
"20 1084 653 311 215 156 101 496\n",
"21 972 652 310 214 156 98 367\n",
"22 950 645 305 214 155 85 350\n",
"23 907 622 304 211 154 84 328\n",
"24 890 622 301 209 153 82 302\n",
"25 888 615 300 209 149 80 237\n",
"26 880 606 300 208 149 80 215\n",
"27 788 594 299 208 148 79 208\n",
"28 753 592 299 207 148 75 157\n",
"29 724 591 292 207 147 72 102\n",
".. ... ... ... ... ... ... ...\n",
"970 2 \n",
"971 2 \n",
"972 2 \n",
"973 2 \n",
"974 2 \n",
"975 2 \n",
"976 2 \n",
"977 2 \n",
"978 2 \n",
"979 2 \n",
"980 2 \n",
"981 2 \n",
"982 2 \n",
"983 1 \n",
"984 1 \n",
"985 1 \n",
"986 1 \n",
"987 1 \n",
"988 1 \n",
"989 1 \n",
"990 1 \n",
"991 1 \n",
"992 1 \n",
"993 1 \n",
"994 1 \n",
"995 1 \n",
"996 1 \n",
"997 1 \n",
"998 1 \n",
"999 1 \n",
"\n",
"[1000 rows x 7 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sizes=[pd.Series(results[i].value_counts()) for i in results.columns[1:]]\n",
"sizes[5]=sizes[5][1:]\n",
"for i in range(len(sizes)):\n",
" sizes[i].index=range(len(sizes[i]))\n",
"sizes=pd.DataFrame(sizes).transpose()\n",
"sizes.columns=results.columns[1:]\n",
"sizes.fillna('')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From these results, it can be seen that 1000 clusters was definitely too much, since many artists ended up in their own cluster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2 Calculating cluster quality metrics\n",
"\n",
"Given the size of this dataset, it's not feasible to calculate typical clustering quality metrics such as the silhouette coefficient or the Dunn index, but some metrics for graph cuts can be used. In this case, I'll use modularity, which can be calculated very efficiently for this dataset. This metric is, however, very sensitive to singleton clusters (clusters of size 1) and favors larger clusters, so in this case it might not be the best decision criteria to see which algorithm did better, but it's a good indicator to have some idea of it. Possible values for modularity range from -0.5 to 1, with more being better. For the case of DBSCAN, however, this metric wouldn't be comparable to other algorithms, since most artists are not assigned to any cluster."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Modularity for Power Iteration Clustering with 100 clusters : 0.037718345357822605\n",
"Modularity for Power Iteration Clustering with 200 clusters : 0.019087964635427532\n",
"Modularity for Power Iteration Clustering with 500 clusters : 0.0073897684741175\n",
"Modularity for Power Iteration Clustering with 700 clusters : 0.005237515826273008\n",
"Modularity for Power Iteration Clustering with 1000 clusters : 0.003573255160752189\n",
"Modularity for Louvain Modularity ( 76 clusters) : 0.6073314825719774\n",
"\n",
"Results for DBSCAN:\n",
"Number of clusters: 343\n",
"Number of artists belonging to a cluster: 11867\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"print('Modularity for Power Iteration Clustering with 100 clusters :',g.modularity(membership=new_res['pic100'],weights=g.es['weight']))\n",
"print('Modularity for Power Iteration Clustering with 200 clusters :',g.modularity(membership=new_res['pic200'],weights=g.es['weight']))\n",
"print('Modularity for Power Iteration Clustering with 500 clusters :',g.modularity(membership=new_res['pic500'],weights=g.es['weight']),)\n",
"print('Modularity for Power Iteration Clustering with 700 clusters :',g.modularity(membership=new_res['pic700'],weights=g.es['weight']))\n",
"print('Modularity for Power Iteration Clustering with 1000 clusters :',g.modularity(membership=new_res['pic1000'],weights=g.es['weight']))\n",
"print('Modularity for Louvain Modularity (',len(set(louvain_weighted.membership)),'clusters) :',louvain_weighted.modularity)\n",
"print()\n",
"print(\"Results for DBSCAN:\")\n",
"print(\"Number of clusters: \",len(set(results['dbsc']))-1)\n",
"print(\"Number of artists belonging to a cluster: \",len(results['dbsc'].loc[results['dbsc']!=-1]))\n",
"del g, louvain_weighted, new_res, new_numer_art_to_int, new_numer_int_to_art"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Graph modularity seems to suggest that, in this case, power iteration clustering did terribly bad compared to louvain modularity, but as mentioned before, this metric might not be the most adequate and it doesn't necessarily mean that the results are invalid. A good knowledge about the music industry and manual examination would be needed to tell this. Morever, the clusters obtained from power iteration clustering also come with different levels of granularity (accordign to the number of clusters)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Part 5 - Checking a sample of the results\n",
"\n",
"Here I'll check some of the medium-sized clusters obtained from DBSCAN, which should be of better quality than the ones obtained from the other algorithms, since it only assigned a fraction of them to a cluster."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cluster1 | \n",
" cluster2 | \n",
" cluster3 | \n",
" cluster4 | \n",
" cluster5 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" edip akbayram | \n",
" shreya ghosal | \n",
" hans albers | \n",
" ten typ mes | \n",
" david houston | \n",
"
\n",
" \n",
" 1 | \n",
" silahs?z kuvvet | \n",
" juggy d | \n",
" jimmy makulis | \n",
" bora | \n",
" vince gill | \n",
"
\n",
" \n",
" 2 | \n",
" muharrem ertas | \n",
" himesh reshammiya | \n",
" heino | \n",
" pijani powietrzem | \n",
" kitty wells | \n",
"
\n",
" \n",
" 3 | \n",
" ali akbar moradi | \n",
" r.d.burman | \n",
" michael heck | \n",
" koniec ?wiata | \n",
" buck owens | \n",
"
\n",
" \n",
" 4 | \n",
" rojin | \n",
" b21 | \n",
" petra frey | \n",
" pokahontaz | \n",
" trace adkins | \n",
"
\n",
" \n",
" 5 | \n",
" cem karaca | \n",
" suzanne | \n",
" maria & margot hellwig | \n",
" lumpex75 | \n",
" eric church | \n",
"
\n",
" \n",
" 6 | \n",
" emre ayd?n | \n",
" call | \n",
" hansi hinterseer | \n",
" coalition | \n",
" janie fricke | \n",
"
\n",
" \n",
" 7 | \n",
" mozole mirach | \n",
" jagjit singh | \n",
" lale andersen | \n",
" fu | \n",
" brooks & dunn | \n",
"
\n",
" \n",
" 8 | \n",
" fuat saka | \n",
" noor jehan | \n",
" vikinger | \n",
" s?owa we krwi | \n",
" montgomery gentry | \n",
"
\n",
" \n",
" 9 | \n",
" sabahat akkiraz | \n",
" dr. zeus | \n",
" nino de angelo | \n",
" pogotowie seksualne | \n",
" toby keith | \n",
"
\n",
" \n",
" 10 | \n",
" emrah | \n",
" vishal bhardwaj | \n",
" das stoakogler trio | \n",
" 2cztery7 | \n",
" lee ann womack | \n",
"
\n",
" \n",
" 11 | \n",
" nazan Öncel | \n",
" amrinder gill | \n",
" manuela | \n",
" strachy na lachy | \n",
" johnny horton | \n",
"
\n",
" \n",
" 12 | \n",
" agire jiyan | \n",
" farida khanum | \n",
" lys assia | \n",
" hey | \n",
" connie smith | \n",
"
\n",
" \n",
" 13 | \n",
" ferhat göçer | \n",
" kamal heer | \n",
" g. g. anderson | \n",
" kaliber 44 | \n",
" mark chesnutt | \n",
"
\n",
" \n",
" 14 | \n",
" athena | \n",
" spb | \n",
" uwe busse | \n",
" kazik staszewski | \n",
" donna fargo | \n",
"
\n",
" \n",
" 15 | \n",
" grup yorum | \n",
" sujatha | \n",
" mara kayser | \n",
" pezet | \n",
" pam tillis | \n",
"
\n",
" \n",
" 16 | \n",
" mt | \n",
" bikram singh | \n",
" bernd stelter | \n",
" dezerter | \n",
" collin raye | \n",
"
\n",
" \n",
" 17 | \n",
" haluk levent | \n",
" ghulam ali | \n",
" wencke myhre | \n",
" pezet-noon | \n",
" eddy arnold | \n",
"
\n",
" \n",
" 18 | \n",
" mazhar alanson | \n",
" unni menon | \n",
" markus becker | \n",
" all wheel drive | \n",
" clint black | \n",
"
\n",
" \n",
" 19 | \n",
" sümer ezgü | \n",
" fuzon | \n",
" die stoakogler | \n",
" zeus | \n",
" jessica andrews | \n",
"
\n",
" \n",
" 20 | \n",
" ajda pekkan | \n",
" shantanu moitra | \n",
" klostertaler | \n",
" kombajn do zbierania kur po wioskach | \n",
" keith whitley | \n",
"
\n",
" \n",
" 21 | \n",
" fuat | \n",
" kuldip manak | \n",
" ireen sheer | \n",
" muchy | \n",
" patty loveless | \n",
"
\n",
" \n",
" 22 | \n",
" tolga çandar | \n",
" chitra | \n",
" rex gildo | \n",
" marek grechuta | \n",
" terri clark | \n",
"
\n",
" \n",
" 23 | \n",
" aynur do?an | \n",
" udit narayan & alka yagnik | \n",
" rudi schuricke | \n",
" cool kids of death | \n",
" bill anderson | \n",
"
\n",
" \n",
" 24 | \n",
" esengül | \n",
" srinivas | \n",
" uta bresan | \n",
" infekcja | \n",
" mindy mccready | \n",
"
\n",
" \n",
" 25 | \n",
" nilgül | \n",
" salim-sulaiman | \n",
" oliver thomas | \n",
" konkwista 88 | \n",
" charley pride | \n",
"
\n",
" \n",
" 26 | \n",
" bulutsuzluk Özlemi | \n",
" babbu maan | \n",
" roberto blanco | \n",
" homomilitia | \n",
" dwight yoakam | \n",
"
\n",
" \n",
" 27 | \n",
" direc-t | \n",
" mohit chauhan | \n",
" freddy quinn | \n",
" akurat | \n",
" rascal flatts | \n",
"
\n",
" \n",
" 28 | \n",
" muazzez ersoy | \n",
" bombay jayashree | \n",
" alpenrebellen | \n",
" lady pank | \n",
" vern gosdin | \n",
"
\n",
" \n",
" 29 | \n",
" erol parlak | \n",
" ismail darbar | \n",
" katja ebstein | \n",
" east west rockers | \n",
" george jones | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 117 | \n",
" aydilge | \n",
" a.r. rahman | \n",
" jonny hill | \n",
" farben lehre | \n",
" porter wagoner | \n",
"
\n",
" \n",
" 118 | \n",
" bülent ortaçgil | \n",
" mithoon | \n",
" nik p. | \n",
" awantura | \n",
" kellie pickler | \n",
"
\n",
" \n",
" 119 | \n",
" Üç nokta bir | \n",
" gurdas maan | \n",
" andrea jürgens | \n",
" indios bravos | \n",
" charlie rich | \n",
"
\n",
" \n",
" 120 | \n",
" a?k?n nur yengi | \n",
" shamur | \n",
" andreas martin | \n",
" pid?ama porno | \n",
" webb pierce | \n",
"
\n",
" \n",
" 121 | \n",
" hümeyra | \n",
" shehzad roy | \n",
" maxi arland | \n",
" daab | \n",
" hank thompson | \n",
"
\n",
" \n",
" 122 | \n",
" gece yolcular? | \n",
" rajesh roshan | \n",
" axel becker | \n",
" izrael | \n",
" jason aldean | \n",
"
\n",
" \n",
" 123 | \n",
" y?ld?z tilbe | \n",
" ali haider | \n",
" kastelruther spatzen | \n",
" tworzywo sztuczne | \n",
" doug stone | \n",
"
\n",
" \n",
" 124 | \n",
" mor ve Ötesi | \n",
" jasbir jassi | \n",
" die flippers | \n",
" waglewski fisz emade | \n",
" faron young | \n",
"
\n",
" \n",
" 125 | \n",
" yeni türkü | \n",
" wadali brothers | \n",
" bata illic | \n",
" grzegorz turnau | \n",
" sugarland | \n",
"
\n",
" \n",
" 126 | \n",
" serdar ortaç | \n",
" sunidhi chauhan | \n",
" matthias reim | \n",
" peja | \n",
" jim ed brown | \n",
"
\n",
" \n",
" 127 | \n",
" ?smail hakk? demircio?lu | \n",
" bally jagpal | \n",
" peter maffay | \n",
" happysad | \n",
" tracy byrd | \n",
"
\n",
" \n",
" 128 | \n",
" nefret | \n",
" karthik | \n",
" brunner & brunner | \n",
" p?omie? 81 | \n",
" jason michael carroll | \n",
"
\n",
" \n",
" 129 | \n",
" demir demirkan | \n",
" javed ali | \n",
" slavko avsenik und seine original oberkrainer | \n",
" rahim | \n",
" sonny james | \n",
"
\n",
" \n",
" 130 | \n",
" leman sam | \n",
" shaan | \n",
" andrea berg | \n",
" cymeon x | \n",
" chris cagle | \n",
"
\n",
" \n",
" 131 | \n",
" hayko cepkin | \n",
" abida parveen | \n",
" die zillertaler | \n",
" april | \n",
" craig morgan | \n",
"
\n",
" \n",
" 132 | \n",
" yal?n | \n",
" asha bhosle & kishore kumar | \n",
" marianne & michael | \n",
" t.love | \n",
" bobby bare | \n",
"
\n",
" \n",
" 133 | \n",
" aylin asl?m | \n",
" stereo nation | \n",
" jürgen | \n",
" wwo | \n",
" billy currington | \n",
"
\n",
" \n",
" 134 | \n",
" fuchs | \n",
" rishi rich | \n",
" jürgen drews | \n",
" btm | \n",
" rodney atkins | \n",
"
\n",
" \n",
" 135 | \n",
" berdan mardini | \n",
" sukhwinder singh | \n",
" margot eskens | \n",
" bia?a gor?czka | \n",
" claude king | \n",
"
\n",
" \n",
" 136 | \n",
" yüksek sadakat | \n",
" vishal - shekhar | \n",
" ernst mosch und seine original egerländer musi... | \n",
" sound pollution | \n",
" jimmy wayne | \n",
"
\n",
" \n",
" 137 | \n",
" pinhani | \n",
" rdb | \n",
" die jungen zillertaler | \n",
" guernica y luno | \n",
" gene watson | \n",
"
\n",
" \n",
" 138 | \n",
" Çelik | \n",
" shail hada | \n",
" angela wiedl | \n",
" ca?a góra barwinków | \n",
" keith anderson | \n",
"
\n",
" \n",
" 139 | \n",
" hande yener | \n",
" amitabh bachchan | \n",
" original naabtal duo | \n",
" lilu | \n",
" reba mcentire | \n",
"
\n",
" \n",
" 140 | \n",
" soner arica | \n",
" talat mahmood | \n",
" vico torriani | \n",
" maleo reggae rockers | \n",
" johnny lee | \n",
"
\n",
" \n",
" 141 | \n",
" emre altu? | \n",
" aziz mian | \n",
" hanne haller | \n",
" maria peszek | \n",
" jim reeves | \n",
"
\n",
" \n",
" 142 | \n",
" abluka alarm | \n",
" neeraj shridhar & tulsi kumar | \n",
" gus backus | \n",
" defekt muzgó | \n",
" jimmy dean | \n",
"
\n",
" \n",
" 143 | \n",
" cengiz Özkan | \n",
" ali zafar | \n",
" mickie krause | \n",
" fisz | \n",
" randy travis | \n",
"
\n",
" \n",
" 144 | \n",
" bülent ersoy | \n",
" chitra singh | \n",
" juliane werding | \n",
" pih | \n",
" billie jo spears | \n",
"
\n",
" \n",
" 145 | \n",
" ali ekber cicek | \n",
" bally sagoo | \n",
" wenche myhre | \n",
" dixon37 | \n",
" dan seals | \n",
"
\n",
" \n",
" 146 | \n",
" betül demir | \n",
" achanak | \n",
" truck stop | \n",
" molesta | \n",
" lonestar | \n",
"
\n",
" \n",
"
\n",
"
147 rows × 5 columns
\n",
"
"
],
"text/plain": [
" cluster1 cluster2 \\\n",
"0 edip akbayram shreya ghosal \n",
"1 silahs?z kuvvet juggy d \n",
"2 muharrem ertas himesh reshammiya \n",
"3 ali akbar moradi r.d.burman \n",
"4 rojin b21 \n",
"5 cem karaca suzanne \n",
"6 emre ayd?n call \n",
"7 mozole mirach jagjit singh \n",
"8 fuat saka noor jehan \n",
"9 sabahat akkiraz dr. zeus \n",
"10 emrah vishal bhardwaj \n",
"11 nazan Öncel amrinder gill \n",
"12 agire jiyan farida khanum \n",
"13 ferhat göçer kamal heer \n",
"14 athena spb \n",
"15 grup yorum sujatha \n",
"16 mt bikram singh \n",
"17 haluk levent ghulam ali \n",
"18 mazhar alanson unni menon \n",
"19 sümer ezgü fuzon \n",
"20 ajda pekkan shantanu moitra \n",
"21 fuat kuldip manak \n",
"22 tolga çandar chitra \n",
"23 aynur do?an udit narayan & alka yagnik \n",
"24 esengül srinivas \n",
"25 nilgül salim-sulaiman \n",
"26 bulutsuzluk Özlemi babbu maan \n",
"27 direc-t mohit chauhan \n",
"28 muazzez ersoy bombay jayashree \n",
"29 erol parlak ismail darbar \n",
".. ... ... \n",
"117 aydilge a.r. rahman \n",
"118 bülent ortaçgil mithoon \n",
"119 Üç nokta bir gurdas maan \n",
"120 a?k?n nur yengi shamur \n",
"121 hümeyra shehzad roy \n",
"122 gece yolcular? rajesh roshan \n",
"123 y?ld?z tilbe ali haider \n",
"124 mor ve Ötesi jasbir jassi \n",
"125 yeni türkü wadali brothers \n",
"126 serdar ortaç sunidhi chauhan \n",
"127 ?smail hakk? demircio?lu bally jagpal \n",
"128 nefret karthik \n",
"129 demir demirkan javed ali \n",
"130 leman sam shaan \n",
"131 hayko cepkin abida parveen \n",
"132 yal?n asha bhosle & kishore kumar \n",
"133 aylin asl?m stereo nation \n",
"134 fuchs rishi rich \n",
"135 berdan mardini sukhwinder singh \n",
"136 yüksek sadakat vishal - shekhar \n",
"137 pinhani rdb \n",
"138 Çelik shail hada \n",
"139 hande yener amitabh bachchan \n",
"140 soner arica talat mahmood \n",
"141 emre altu? aziz mian \n",
"142 abluka alarm neeraj shridhar & tulsi kumar \n",
"143 cengiz Özkan ali zafar \n",
"144 bülent ersoy chitra singh \n",
"145 ali ekber cicek bally sagoo \n",
"146 betül demir achanak \n",
"\n",
" cluster3 \\\n",
"0 hans albers \n",
"1 jimmy makulis \n",
"2 heino \n",
"3 michael heck \n",
"4 petra frey \n",
"5 maria & margot hellwig \n",
"6 hansi hinterseer \n",
"7 lale andersen \n",
"8 vikinger \n",
"9 nino de angelo \n",
"10 das stoakogler trio \n",
"11 manuela \n",
"12 lys assia \n",
"13 g. g. anderson \n",
"14 uwe busse \n",
"15 mara kayser \n",
"16 bernd stelter \n",
"17 wencke myhre \n",
"18 markus becker \n",
"19 die stoakogler \n",
"20 klostertaler \n",
"21 ireen sheer \n",
"22 rex gildo \n",
"23 rudi schuricke \n",
"24 uta bresan \n",
"25 oliver thomas \n",
"26 roberto blanco \n",
"27 freddy quinn \n",
"28 alpenrebellen \n",
"29 katja ebstein \n",
".. ... \n",
"117 jonny hill \n",
"118 nik p. \n",
"119 andrea jürgens \n",
"120 andreas martin \n",
"121 maxi arland \n",
"122 axel becker \n",
"123 kastelruther spatzen \n",
"124 die flippers \n",
"125 bata illic \n",
"126 matthias reim \n",
"127 peter maffay \n",
"128 brunner & brunner \n",
"129 slavko avsenik und seine original oberkrainer \n",
"130 andrea berg \n",
"131 die zillertaler \n",
"132 marianne & michael \n",
"133 jürgen \n",
"134 jürgen drews \n",
"135 margot eskens \n",
"136 ernst mosch und seine original egerländer musi... \n",
"137 die jungen zillertaler \n",
"138 angela wiedl \n",
"139 original naabtal duo \n",
"140 vico torriani \n",
"141 hanne haller \n",
"142 gus backus \n",
"143 mickie krause \n",
"144 juliane werding \n",
"145 wenche myhre \n",
"146 truck stop \n",
"\n",
" cluster4 cluster5 \n",
"0 ten typ mes david houston \n",
"1 bora vince gill \n",
"2 pijani powietrzem kitty wells \n",
"3 koniec ?wiata buck owens \n",
"4 pokahontaz trace adkins \n",
"5 lumpex75 eric church \n",
"6 coalition janie fricke \n",
"7 fu brooks & dunn \n",
"8 s?owa we krwi montgomery gentry \n",
"9 pogotowie seksualne toby keith \n",
"10 2cztery7 lee ann womack \n",
"11 strachy na lachy johnny horton \n",
"12 hey connie smith \n",
"13 kaliber 44 mark chesnutt \n",
"14 kazik staszewski donna fargo \n",
"15 pezet pam tillis \n",
"16 dezerter collin raye \n",
"17 pezet-noon eddy arnold \n",
"18 all wheel drive clint black \n",
"19 zeus jessica andrews \n",
"20 kombajn do zbierania kur po wioskach keith whitley \n",
"21 muchy patty loveless \n",
"22 marek grechuta terri clark \n",
"23 cool kids of death bill anderson \n",
"24 infekcja mindy mccready \n",
"25 konkwista 88 charley pride \n",
"26 homomilitia dwight yoakam \n",
"27 akurat rascal flatts \n",
"28 lady pank vern gosdin \n",
"29 east west rockers george jones \n",
".. ... ... \n",
"117 farben lehre porter wagoner \n",
"118 awantura kellie pickler \n",
"119 indios bravos charlie rich \n",
"120 pid?ama porno webb pierce \n",
"121 daab hank thompson \n",
"122 izrael jason aldean \n",
"123 tworzywo sztuczne doug stone \n",
"124 waglewski fisz emade faron young \n",
"125 grzegorz turnau sugarland \n",
"126 peja jim ed brown \n",
"127 happysad tracy byrd \n",
"128 p?omie? 81 jason michael carroll \n",
"129 rahim sonny james \n",
"130 cymeon x chris cagle \n",
"131 april craig morgan \n",
"132 t.love bobby bare \n",
"133 wwo billy currington \n",
"134 btm rodney atkins \n",
"135 bia?a gor?czka claude king \n",
"136 sound pollution jimmy wayne \n",
"137 guernica y luno gene watson \n",
"138 ca?a góra barwinków keith anderson \n",
"139 lilu reba mcentire \n",
"140 maleo reggae rockers johnny lee \n",
"141 maria peszek jim reeves \n",
"142 defekt muzgó jimmy dean \n",
"143 fisz randy travis \n",
"144 pih billie jo spears \n",
"145 dixon37 dan seals \n",
"146 molesta lonestar \n",
"\n",
"[147 rows x 5 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clusters=results['dbsc'].value_counts()[10:15].index\n",
"clusters=[pd.DataFrame(results[['artist_name']].loc[results['dbsc']==i]) for i in clusters]\n",
"for i in range(len(clusters)):\n",
" clusters[i].index=range(len(clusters[i]))\n",
"clusters=clusters[0].merge(clusters[1],left_index=True, right_index=True).merge(clusters[2],left_index=True, right_index=True).merge(clusters[3],left_index=True, right_index=True).merge(clusters[4],left_index=True, right_index=True)\n",
"clusters.columns=['cluster'+str(i) for i in range(1,len(clusters.columns)+1)]\n",
"clusters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Trying to interpret the clusters:\n",
"* The first cluster seems to contain mostly Turkish musicians of popular and folk music.\n",
"* The second cluster seems to contain mostly Indian-Punjabi musicians, also of popular and folk music.\n",
"* The third cluster seems to contain mostly German-speaking singers of pop and movie-derived songs.\n",
"* The fourth cluster seems to contain mostly east European small artists of alternative rock and indie art.\n",
"* The fifth cluster seems to contain mostly country artists from the US."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}