{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering Musical Artists\n", "\n", "*Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following [this link.](http://nbviewer.jupyter.org/github/david-cortes/datascienceprojects/blob/master/machine_learning/clustering_fm_artists.ipynb)*\n", "\n", "This project consists on clustering musical artists using a dataset with the top 50 played artists per user of a random sample of ~360,000 users from Last.fm, which can be found [here](http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html), based on the idea that artists who are preferred by the same users tend to be similar." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[1. Loading and cleaning the data](#1)\n", "* [1.1 Downloading the dataset and generating a sample](#1.1)\n", "* [1.2 Downloading and formatting the data](#1.2)\n", "\n", "[2. Establishing Artists' pairwise similarities](#2)\n", "* [2.1 Generating candidate pairs of aritsts to compare](#2.1)\n", "* [2.2 Converting to cosine distances](#2.2)\n", "\n", "[3. Clustering Artists](#3)\n", "* [Additional clustering without Spark](#3.1)\n", "\n", "[4. Checking cluster sizes and calculating cluster quality metrics](#4)\n", "* [4.1 Checking the sizes of the largest clusters for the different algorithms](#4.1)\n", "* [4.2 Calculating cluster quality metrics](#4.2)\n", "\n", "[5. Checking a sample of the results](#5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1 - Loading and cleaning the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 - Downloading the dataset and generating a sample" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os, sys\n", "\n", "def download_data():\n", " import urllib, tarfile\n", " data_file = urllib.URLopener()\n", " data_file.retrieve(\"http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz\", \"lastfm-dataset-360K.tar.gz\")\n", " data_file = tarfile.open(\"lastfm-dataset-360K.tar.gz\", 'r:gz')\n", " data_file.extractall()\n", " \n", "def generate_sample(file_path=os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv',n=10000):\n", " with open('small_sample.tsv','w') as out:\n", " with open(file_path) as f:\n", " for i in range(n):\n", " out.write(f.readline())\n", " \n", "#download_data()\n", "# generate_sample()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 - Loading and formatting the data" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 359339 different users\n", "There are 160162 valid artists\n" ] }, { "data": { "text/plain": [ "PythonRDD[202] at RDD at PythonRDD.scala:43" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os, json\n", "\n", "#Here I'm assuming there's an Spark Context already set up under the name 'sc'\n", "\n", "# dataset=sc.textFile(os.getcwd()+'/small_sample.tsv')\n", "dataset=sc.textFile(os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv')\n", "\n", "dataset=dataset.map(lambda x: x.split('\\t')).filter(lambda x: len(x[1])>30) #removing invalid artists (e.g. 'bbc radio')\n", "dataset=dataset.filter(lambda x: x[1]!='89ad4ac3-39f7-470e-963a-56509c546377') #removing 'various artists'\n", "\n", "#converting hash values to integers to speed up the analysis\n", "u_list=dataset.map(lambda x: x[0]).distinct().collect()\n", "users=dict()\n", "for v,k in enumerate(u_list):\n", " users[k]=v\n", "a_list=dataset.map(lambda x: x[1]).distinct().collect()\n", "artists=dict()\n", "for v,k in enumerate(a_list):\n", " artists[k]=v\n", "n_users=len(u_list)\n", "n_artists=len(a_list)\n", "\n", "dataset=dataset.map(lambda x: (users[x[0]],artists[x[1]],x[2])) #the number of plays is not relevant here\n", "dataset.cache()\n", "del u_list, a_list, users, artists\n", "\n", "\n", "print(\"There are \", n_users, \" different users\")\n", "print(\"There are \", n_artists, \" valid artists\")\n", "\n", "\n", "#Generating some useful files for later\n", "\n", "#Artists in this dataset can appear under more than one name\n", "def build_arts_dict(dataset):\n", " return dict(dataset.map(lambda x: (x[1],x[2])).groupByKey().mapValues(list).mapValues(lambda x: x[0]).map(lambda x: [x[0],x[1]]).collect())\n", "\n", "arts_dic=build_arts_dict(dataset)\n", "with open('artists.json', 'w') as outfile:\n", " json.dump(arts_dic, outfile)\n", "del arts_dic\n", "dataset.cache()\n", "dataset.map(lambda x: (x[0],x[1])).saveAsTextFile('processed_data')\n", "\n", "users_per_artist=dict(dataset.map(lambda x: (x[1],x[0])).groupByKey().mapValues(len).map(list).collect())\n", "with open('users_per_artist.json', 'w') as outfile:\n", " json.dump(users_per_artist, outfile)\n", "del users_per_artist\n", "dataset.unpersist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2 - Establishing Artists' pairwise similarities\n", "\n", "Calculating cosine similarities among artists (this penalizes according to the frequencies of artists across users, giving less of a penalty for cases of assimetric frequencies than dice similarity, for example), considering only wheter they are in a user's top played list regardless of the number of plays and only taking into account those with a high-enoguh similarity.\n", "\n", "If we assume that each user has roughly 50-top played artists, and each artist has an equal chance of appearing within any given user's playlist, then the expected cosine similarity between two artists, if everything happened by chance, could be approximated like this:\n", "\n", "$$expected.sim=\\frac{(\\frac{50}{n. artists})^2 \\times n.users}{\\sqrt{50}^2} \\approx 0.0007 $$\n", "\n", "However, since pairs of artists with such similarity would likely not be in the same cluster, it could be a better idea to set an arbitrary threshold instead, so as to decrease the number of pairs. A cosine similarity of 0.1 would be equivalent to two artists appearing each in the playlists of 100 users and having 10 users in common; or in a different case, to an artist appearing in 100 users' playlist and another in 50, having 7 users in common.\n", "\n", "In order to decrease the number of artists pairs to evaluate in the clustering part and make it manageable on a single machine, it would be better to set a minimum requirement for common users among artists - here I set it to 7, so a pair of artists should have at least 7 users in common in order for it to be assigned a non-zero distance, otherwise there would be artists assigned to the same cluster only because one user heard both - as well as a threshold for the cosine distance, which I set at 4 times the expected value if everything happened at random, in order for a pair to be considered as having a certain similarity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Generating candidate pairs of aritsts to compare\n", "\n", "This dataset encompasses 360,000 users, each having a list of 50 top played artists, summing up to 160,000 different artists. Trying to establish similarities between all the artists would imply looping through $160,000 \\times (160,000-1)/2 \\approx 13,000,000,000$ pairs, so it's better to first see which artists have users in common, as of the 13 billion possible pairs, there are very few with at least one user in common. Doing it this way would imply looping over only $360,000 \\times 50 \\times (50-1)/2 \\approx 441,000,000$ pairs, and in the process, it's possible to count how many users do artists have in common to ease further distance calculations." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "PythonRDD[10] at RDD at PythonRDD.scala:43" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import itertools\n", "from operator import add\n", "\n", "def ordered(pair):\n", " n1,n2=pair\n", " if n2>n1:\n", " return n1,n2\n", " else:\n", " return n2,n1\n", "\n", "dataset=sc.textFile('processed_data').map(eval).groupByKey().mapValues(list).map(lambda x: x[1])\n", "dataset=dataset.flatMap(lambda x: [(ordered(i),1) for i in itertools.combinations(x,2)]).reduceByKey(add).map(lambda x: (x[0][0],x[0][1],x[1])).filter(lambda x: x[2]>6)\n", "dataset.cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Converting to cosine distances" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 7607557 non-zero pairwise distances\n" ] } ], "source": [ "from math import sqrt\n", "import json\n", "\n", "#Taken from the previous results\n", "n_artists=160162\n", "n_users=359339\n", "threshold=4* ( ((50.0/n_artists)**2)*n_users/(sqrt(50.0)**2) )\n", "\n", "with open('users_per_artist.json') as file:\n", " users_per_artist=json.load(file)\n", "users_per_artist={int(k):v for k,v in users_per_artist.items()}\n", "bc_dic=sc.broadcast(users_per_artist)\n", "del users_per_artist\n", "\n", "dataset=dataset.map(lambda x: (x[0],x[1],x[2]*1.0/(sqrt(bc_dic.value[x[0]])*sqrt(bc_dic.value[x[1]])))).filter(lambda x: x[2]>threshold)\n", "dataset.cache()\n", "dataset.saveAsTextFile('sims')\n", "print('There are ',dataset.count(),' non-zero pairwise distances')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3 - Clustering Artists\n", "\n", "Here I'll produce different clusterings, using 100, 200, 500, 700 and 1000 clusters usign power iteration clustering, which provides similar (though usually slightly inferior) results to spectral clustering but runs faster and is scalable." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "55637 artists were clustered\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artist_idcluster100cluster200cluster500cluster700cluster1000artist_name
043120545632829976sy smith
19079774154103466829phoebe killdeer and the short straws
2102905614727330928judge jules
3179348578275683853strip steve
4304294938299672352marja mattlar
\n", "
" ], "text/plain": [ " artist_id cluster100 cluster200 cluster500 cluster700 cluster1000 \\\n", "0 43120 54 56 328 29 976 \n", "1 90797 74 154 103 466 829 \n", "2 10290 56 147 273 30 928 \n", "3 17934 85 78 275 683 853 \n", "4 30429 49 38 299 672 352 \n", "\n", " artist_name \n", "0 sy smith \n", "1 phoebe killdeer and the short straws \n", "2 judge jules \n", "3 strip steve \n", "4 marja mattlar " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyspark.mllib.clustering import PowerIterationClustering as pic\n", "import pandas as pd\n", "import json\n", "\n", "# dataset=sc.textFile('sims').map(eval)\n", "# dataset.cache()\n", "\n", "n_clusters=100\n", "clusters=pic.train(dataset,n_clusters)\n", "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts100')\n", "del clusters\n", "\n", "n_clusters=200\n", "clusters=pic.train(dataset,n_clusters)\n", "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts200')\n", "del clusters\n", "\n", "n_clusters=500\n", "clusters=pic.train(dataset,n_clusters)\n", "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts500')\n", "del clusters\n", "\n", "n_clusters=700\n", "clusters=pic.train(dataset,n_clusters)\n", "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts700')\n", "del clusters\n", "\n", "n_clusters=1000\n", "clusters=pic.train(dataset,n_clusters)\n", "clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts1000')\n", "del clusters\n", "\n", "\n", "dataset=pd.read_csv('clusts100\\part-00000',header=None)\n", "dataset.columns=['artist_id','cluster100']\n", "dataset200=pd.read_csv('clusts200\\part-00000',header=None)\n", "dataset200.columns=['artist_id','cluster200']\n", "dataset500=pd.read_csv('clusts500\\part-00000',header=None)\n", "dataset500.columns=['artist_id','cluster500']\n", "dataset700=pd.read_csv('clusts700\\part-00000',header=None)\n", "dataset700.columns=['artist_id','cluster700']\n", "dataset1000=pd.read_csv('clusts1000\\part-00000',header=None)\n", "dataset1000.columns=['artist_id','cluster1000']\n", "\n", "\n", "dataset=dataset.merge(dataset200,how='outer',on='artist_id').merge(dataset500,how='outer',on='artist_id').merge(dataset700,how='outer',on='artist_id').merge(dataset1000,how='outer',on='artist_id')\n", "\n", "with open('artists.json') as art:\n", " artists_dict=json.load(art)\n", "artists_dict={int(k):v for k,v in artists_dict.items()}\n", "dataset['artist_name']=[artists_dict[art] for art in dataset['artist_id']]\n", "dataset.to_csv('results_all.csv',index=False)\n", "\n", "del dataset200,dataset500,dataset700,dataset1000\n", "print(dataset.shape[0],' artists were clustered')\n", "dataset.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since most artists are in only one or two user's playlists, it's unreliable (and computationally complex) to cluster them with so few data. That's why only a fraction (around one third) of the artists were considered for the clustering process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_**Additional:** clustering with scikit-learn (dbscan) and igraph (louvain modularity) (both are non-parallel). I chose these parameters and algorithms after some manual experimentation seeing which ones give a reasonable spread of artists across clusters. These algorithms have the nice property of automatically determining the number of clusters._" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Rearranging the data format\n", "import re\n", "\n", "dataset=sc.textFile('sims')\n", "dataset=dataset.map(lambda x: re.sub('[\\(\\)\\s]','',x))\n", "dataset.repartition(1).saveAsTextFile('sims_csv')\n", "dataset.unpersist()\n", "del dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artist_namepic100pic200pic500pic700pic1000dbsclouvain
0sy smith545632829976-155
1phoebe killdeer and the short straws74154103466829-162
2judge jules5614727330928-155
3strip steve8578275683853-162
4marja mattlar4938299672352-138
\n", "
" ], "text/plain": [ " artist_name pic100 pic200 pic500 pic700 \\\n", "0 sy smith 54 56 328 29 \n", "1 phoebe killdeer and the short straws 74 154 103 466 \n", "2 judge jules 56 147 273 30 \n", "3 strip steve 85 78 275 683 \n", "4 marja mattlar 49 38 299 672 \n", "\n", " pic1000 dbsc louvain \n", "0 976 -1 55 \n", "1 829 -1 62 \n", "2 928 -1 55 \n", "3 853 -1 62 \n", "4 352 -1 38 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sklearn, igraph, scipy, re\n", "import pandas as pd\n", "import sklearn.cluster\n", "\n", "dataset=pd.read_csv('sims_csv/part-00000',header=None)\n", "dataset.columns=['art1','art2','sim']\n", "dataset['dist']=[1-i for i in dataset['sim']]\n", "present_artists=set(dataset['art1'].append(dataset['art2']).values.tolist())\n", "new_numer_art_to_int=dict()\n", "new_numer_int_to_art=dict()\n", "count=0\n", "for art in present_artists:\n", " new_numer_art_to_int[art]=count\n", " new_numer_int_to_art[count]=art\n", " count+=1\n", "del present_artists, count\n", "dataset['art1']=[new_numer_art_to_int[i] for i in dataset['art1']]\n", "dataset['art2']=[new_numer_art_to_int[i] for i in dataset['art2']]\n", "\n", "I=dataset['art1'].append(dataset['art2'])\n", "J=dataset['art2'].append(dataset['art1'])\n", "V=dataset['dist'].append(dataset['dist'])\n", "\n", "dataset_matrix=scipy.sparse.csr_matrix((V,(I,J)))\n", "del I,J,V\n", "dataset_matrix\n", "\n", "dbsc=sklearn.cluster.DBSCAN(eps=0.775,metric='precomputed').fit_predict(dataset_matrix)\n", "new_res=pd.Series(range(dataset_matrix.shape[0])).to_frame()\n", "new_res.columns=['artist_id']\n", "new_res['dbsc']=dbsc\n", "del dbsc, dataset_matrix\n", "\n", "g=igraph.Graph(edges=dataset[['art1','art2']].values.tolist(),directed=False)\n", "g.es['weight']=dataset['sim'].values.tolist()\n", "del dataset\n", "louvain_weighted=g.community_multilevel(weights=g.es['weight'])\n", "new_res['louvain']=louvain_weighted.membership\n", "new_res['artist_id']=[new_numer_int_to_art[i] for i in new_res['artist_id']]\n", "\n", "results=pd.read_csv('results_all.csv',engine='python')\n", "results=results.merge(new_res,how='left',on='artist_id')\n", "new_res=new_res.merge(results[['artist_id','cluster100','cluster200','cluster500','cluster700','cluster1000']],how='left',on='artist_id')\n", "cols=results.columns.tolist()\n", "cols=cols[6:7]+cols[1:6]+cols[7:9]\n", "results=results[cols]\n", "results.columns=[re.sub('cluster','pic',i) for i in results.columns]\n", "new_res.columns=[re.sub('cluster','pic',i) for i in new_res.columns]\n", "\n", "results.to_csv('results_all.csv',index=False)\n", "results.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Note: a cluster assignment of -1 means that the row was not asigned to any cluster. In DBSCAN most of the artists are not assigned to any cluster, thus those clusters should be of better quality._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4 - Checking cluster sizes and calculating cluster quality metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 Checking the sizes of the largest clusters for the different algorithms" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pic100pic200pic500pic700pic1000dbsclouvain
018171013414287287206513605
1177910053612871967246314
217529083602711882896001
316648753502511872614418
415477623492461832172518
515077573432441791872514
614427503372441761772470
713297493322411741762224
813247463322371701761893
913197303322351681691633
1012927243292341681621552
1112047223272291651591097
121188716326228165158960
131185696324223163147830
141184688324221160144828
151170683321220158144744
161146678319220158137684
171144676316218157133641
181120663314218157120641
191112656312215157111508
201084653311215156101496
2197265231021415698367
2295064530521415585350
2390762230421115484328
2489062230120915382302
2588861530020914980237
2688060630020814980215
2778859429920814879208
2875359229920714875157
2972459129220714772102
........................
9702
9712
9722
9732
9742
9752
9762
9772
9782
9792
9802
9812
9822
9831
9841
9851
9861
9871
9881
9891
9901
9911
9921
9931
9941
9951
9961
9971
9981
9991
\n", "

1000 rows × 7 columns

\n", "
" ], "text/plain": [ " pic100 pic200 pic500 pic700 pic1000 dbsc louvain\n", "0 1817 1013 414 287 287 2065 13605\n", "1 1779 1005 361 287 196 724 6314\n", "2 1752 908 360 271 188 289 6001\n", "3 1664 875 350 251 187 261 4418\n", "4 1547 762 349 246 183 217 2518\n", "5 1507 757 343 244 179 187 2514\n", "6 1442 750 337 244 176 177 2470\n", "7 1329 749 332 241 174 176 2224\n", "8 1324 746 332 237 170 176 1893\n", "9 1319 730 332 235 168 169 1633\n", "10 1292 724 329 234 168 162 1552\n", "11 1204 722 327 229 165 159 1097\n", "12 1188 716 326 228 165 158 960\n", "13 1185 696 324 223 163 147 830\n", "14 1184 688 324 221 160 144 828\n", "15 1170 683 321 220 158 144 744\n", "16 1146 678 319 220 158 137 684\n", "17 1144 676 316 218 157 133 641\n", "18 1120 663 314 218 157 120 641\n", "19 1112 656 312 215 157 111 508\n", "20 1084 653 311 215 156 101 496\n", "21 972 652 310 214 156 98 367\n", "22 950 645 305 214 155 85 350\n", "23 907 622 304 211 154 84 328\n", "24 890 622 301 209 153 82 302\n", "25 888 615 300 209 149 80 237\n", "26 880 606 300 208 149 80 215\n", "27 788 594 299 208 148 79 208\n", "28 753 592 299 207 148 75 157\n", "29 724 591 292 207 147 72 102\n", ".. ... ... ... ... ... ... ...\n", "970 2 \n", "971 2 \n", "972 2 \n", "973 2 \n", "974 2 \n", "975 2 \n", "976 2 \n", "977 2 \n", "978 2 \n", "979 2 \n", "980 2 \n", "981 2 \n", "982 2 \n", "983 1 \n", "984 1 \n", "985 1 \n", "986 1 \n", "987 1 \n", "988 1 \n", "989 1 \n", "990 1 \n", "991 1 \n", "992 1 \n", "993 1 \n", "994 1 \n", "995 1 \n", "996 1 \n", "997 1 \n", "998 1 \n", "999 1 \n", "\n", "[1000 rows x 7 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sizes=[pd.Series(results[i].value_counts()) for i in results.columns[1:]]\n", "sizes[5]=sizes[5][1:]\n", "for i in range(len(sizes)):\n", " sizes[i].index=range(len(sizes[i]))\n", "sizes=pd.DataFrame(sizes).transpose()\n", "sizes.columns=results.columns[1:]\n", "sizes.fillna('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From these results, it can be seen that 1000 clusters was definitely too much, since many artists ended up in their own cluster." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Calculating cluster quality metrics\n", "\n", "Given the size of this dataset, it's not feasible to calculate typical clustering quality metrics such as the silhouette coefficient or the Dunn index, but some metrics for graph cuts can be used. In this case, I'll use modularity, which can be calculated very efficiently for this dataset. This metric is, however, very sensitive to singleton clusters (clusters of size 1) and favors larger clusters, so in this case it might not be the best decision criteria to see which algorithm did better, but it's a good indicator to have some idea of it. Possible values for modularity range from -0.5 to 1, with more being better. For the case of DBSCAN, however, this metric wouldn't be comparable to other algorithms, since most artists are not assigned to any cluster." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Modularity for Power Iteration Clustering with 100 clusters : 0.037718345357822605\n", "Modularity for Power Iteration Clustering with 200 clusters : 0.019087964635427532\n", "Modularity for Power Iteration Clustering with 500 clusters : 0.0073897684741175\n", "Modularity for Power Iteration Clustering with 700 clusters : 0.005237515826273008\n", "Modularity for Power Iteration Clustering with 1000 clusters : 0.003573255160752189\n", "Modularity for Louvain Modularity ( 76 clusters) : 0.6073314825719774\n", "\n", "Results for DBSCAN:\n", "Number of clusters: 343\n", "Number of artists belonging to a cluster: 11867\n" ] } ], "source": [ "import numpy as np\n", "\n", "print('Modularity for Power Iteration Clustering with 100 clusters :',g.modularity(membership=new_res['pic100'],weights=g.es['weight']))\n", "print('Modularity for Power Iteration Clustering with 200 clusters :',g.modularity(membership=new_res['pic200'],weights=g.es['weight']))\n", "print('Modularity for Power Iteration Clustering with 500 clusters :',g.modularity(membership=new_res['pic500'],weights=g.es['weight']),)\n", "print('Modularity for Power Iteration Clustering with 700 clusters :',g.modularity(membership=new_res['pic700'],weights=g.es['weight']))\n", "print('Modularity for Power Iteration Clustering with 1000 clusters :',g.modularity(membership=new_res['pic1000'],weights=g.es['weight']))\n", "print('Modularity for Louvain Modularity (',len(set(louvain_weighted.membership)),'clusters) :',louvain_weighted.modularity)\n", "print()\n", "print(\"Results for DBSCAN:\")\n", "print(\"Number of clusters: \",len(set(results['dbsc']))-1)\n", "print(\"Number of artists belonging to a cluster: \",len(results['dbsc'].loc[results['dbsc']!=-1]))\n", "del g, louvain_weighted, new_res, new_numer_art_to_int, new_numer_int_to_art" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Graph modularity seems to suggest that, in this case, power iteration clustering did terribly bad compared to louvain modularity, but as mentioned before, this metric might not be the most adequate and it doesn't necessarily mean that the results are invalid. A good knowledge about the music industry and manual examination would be needed to tell this. Morever, the clusters obtained from power iteration clustering also come with different levels of granularity (accordign to the number of clusters)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Part 5 - Checking a sample of the results\n", "\n", "Here I'll check some of the medium-sized clusters obtained from DBSCAN, which should be of better quality than the ones obtained from the other algorithms, since it only assigned a fraction of them to a cluster." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cluster1cluster2cluster3cluster4cluster5
0edip akbayramshreya ghosalhans albersten typ mesdavid houston
1silahs?z kuvvetjuggy djimmy makulisboravince gill
2muharrem ertashimesh reshammiyaheinopijani powietrzemkitty wells
3ali akbar moradir.d.burmanmichael heckkoniec ?wiatabuck owens
4rojinb21petra freypokahontaztrace adkins
5cem karacasuzannemaria & margot hellwiglumpex75eric church
6emre ayd?ncallhansi hinterseercoalitionjanie fricke
7mozole mirachjagjit singhlale andersenfubrooks & dunn
8fuat sakanoor jehanvikingers?owa we krwimontgomery gentry
9sabahat akkirazdr. zeusnino de angelopogotowie seksualnetoby keith
10emrahvishal bhardwajdas stoakogler trio2cztery7lee ann womack
11nazan Öncelamrinder gillmanuelastrachy na lachyjohnny horton
12agire jiyanfarida khanumlys assiaheyconnie smith
13ferhat göçerkamal heerg. g. andersonkaliber 44mark chesnutt
14athenaspbuwe bussekazik staszewskidonna fargo
15grup yorumsujathamara kayserpezetpam tillis
16mtbikram singhbernd stelterdezertercollin raye
17haluk leventghulam aliwencke myhrepezet-nooneddy arnold
18mazhar alansonunni menonmarkus beckerall wheel driveclint black
19sümer ezgüfuzondie stoakoglerzeusjessica andrews
20ajda pekkanshantanu moitraklostertalerkombajn do zbierania kur po wioskachkeith whitley
21fuatkuldip manakireen sheermuchypatty loveless
22tolga çandarchitrarex gildomarek grechutaterri clark
23aynur do?anudit narayan & alka yagnikrudi schurickecool kids of deathbill anderson
24esengülsrinivasuta bresaninfekcjamindy mccready
25nilgülsalim-sulaimanoliver thomaskonkwista 88charley pride
26bulutsuzluk Özlemibabbu maanroberto blancohomomilitiadwight yoakam
27direc-tmohit chauhanfreddy quinnakuratrascal flatts
28muazzez ersoybombay jayashreealpenrebellenlady pankvern gosdin
29erol parlakismail darbarkatja ebsteineast west rockersgeorge jones
..................
117aydilgea.r. rahmanjonny hillfarben lehreporter wagoner
118bülent ortaçgilmithoonnik p.awanturakellie pickler
119Üç nokta birgurdas maanandrea jürgensindios bravoscharlie rich
120a?k?n nur yengishamurandreas martinpid?ama pornowebb pierce
121hümeyrashehzad roymaxi arlanddaabhank thompson
122gece yolcular?rajesh roshanaxel beckerizraeljason aldean
123y?ld?z tilbeali haiderkastelruther spatzentworzywo sztucznedoug stone
124mor ve Ötesijasbir jassidie flipperswaglewski fisz emadefaron young
125yeni türküwadali brothersbata illicgrzegorz turnausugarland
126serdar ortaçsunidhi chauhanmatthias reimpejajim ed brown
127?smail hakk? demircio?lubally jagpalpeter maffayhappysadtracy byrd
128nefretkarthikbrunner & brunnerp?omie? 81jason michael carroll
129demir demirkanjaved alislavko avsenik und seine original oberkrainerrahimsonny james
130leman samshaanandrea bergcymeon xchris cagle
131hayko cepkinabida parveendie zillertaleraprilcraig morgan
132yal?nasha bhosle & kishore kumarmarianne & michaelt.lovebobby bare
133aylin asl?mstereo nationjürgenwwobilly currington
134fuchsrishi richjürgen drewsbtmrodney atkins
135berdan mardinisukhwinder singhmargot eskensbia?a gor?czkaclaude king
136yüksek sadakatvishal - shekharernst mosch und seine original egerländer musi...sound pollutionjimmy wayne
137pinhanirdbdie jungen zillertalerguernica y lunogene watson
138Çelikshail hadaangela wiedlca?a góra barwinkówkeith anderson
139hande yeneramitabh bachchanoriginal naabtal duolilureba mcentire
140soner aricatalat mahmoodvico torrianimaleo reggae rockersjohnny lee
141emre altu?aziz mianhanne hallermaria peszekjim reeves
142abluka alarmneeraj shridhar & tulsi kumargus backusdefekt muzgójimmy dean
143cengiz Özkanali zafarmickie krausefiszrandy travis
144bülent ersoychitra singhjuliane werdingpihbillie jo spears
145ali ekber cicekbally sagoowenche myhredixon37dan seals
146betül demirachanaktruck stopmolestalonestar
\n", "

147 rows × 5 columns

\n", "
" ], "text/plain": [ " cluster1 cluster2 \\\n", "0 edip akbayram shreya ghosal \n", "1 silahs?z kuvvet juggy d \n", "2 muharrem ertas himesh reshammiya \n", "3 ali akbar moradi r.d.burman \n", "4 rojin b21 \n", "5 cem karaca suzanne \n", "6 emre ayd?n call \n", "7 mozole mirach jagjit singh \n", "8 fuat saka noor jehan \n", "9 sabahat akkiraz dr. zeus \n", "10 emrah vishal bhardwaj \n", "11 nazan Öncel amrinder gill \n", "12 agire jiyan farida khanum \n", "13 ferhat göçer kamal heer \n", "14 athena spb \n", "15 grup yorum sujatha \n", "16 mt bikram singh \n", "17 haluk levent ghulam ali \n", "18 mazhar alanson unni menon \n", "19 sümer ezgü fuzon \n", "20 ajda pekkan shantanu moitra \n", "21 fuat kuldip manak \n", "22 tolga çandar chitra \n", "23 aynur do?an udit narayan & alka yagnik \n", "24 esengül srinivas \n", "25 nilgül salim-sulaiman \n", "26 bulutsuzluk Özlemi babbu maan \n", "27 direc-t mohit chauhan \n", "28 muazzez ersoy bombay jayashree \n", "29 erol parlak ismail darbar \n", ".. ... ... \n", "117 aydilge a.r. rahman \n", "118 bülent ortaçgil mithoon \n", "119 Üç nokta bir gurdas maan \n", "120 a?k?n nur yengi shamur \n", "121 hümeyra shehzad roy \n", "122 gece yolcular? rajesh roshan \n", "123 y?ld?z tilbe ali haider \n", "124 mor ve Ötesi jasbir jassi \n", "125 yeni türkü wadali brothers \n", "126 serdar ortaç sunidhi chauhan \n", "127 ?smail hakk? demircio?lu bally jagpal \n", "128 nefret karthik \n", "129 demir demirkan javed ali \n", "130 leman sam shaan \n", "131 hayko cepkin abida parveen \n", "132 yal?n asha bhosle & kishore kumar \n", "133 aylin asl?m stereo nation \n", "134 fuchs rishi rich \n", "135 berdan mardini sukhwinder singh \n", "136 yüksek sadakat vishal - shekhar \n", "137 pinhani rdb \n", "138 Çelik shail hada \n", "139 hande yener amitabh bachchan \n", "140 soner arica talat mahmood \n", "141 emre altu? aziz mian \n", "142 abluka alarm neeraj shridhar & tulsi kumar \n", "143 cengiz Özkan ali zafar \n", "144 bülent ersoy chitra singh \n", "145 ali ekber cicek bally sagoo \n", "146 betül demir achanak \n", "\n", " cluster3 \\\n", "0 hans albers \n", "1 jimmy makulis \n", "2 heino \n", "3 michael heck \n", "4 petra frey \n", "5 maria & margot hellwig \n", "6 hansi hinterseer \n", "7 lale andersen \n", "8 vikinger \n", "9 nino de angelo \n", "10 das stoakogler trio \n", "11 manuela \n", "12 lys assia \n", "13 g. g. anderson \n", "14 uwe busse \n", "15 mara kayser \n", "16 bernd stelter \n", "17 wencke myhre \n", "18 markus becker \n", "19 die stoakogler \n", "20 klostertaler \n", "21 ireen sheer \n", "22 rex gildo \n", "23 rudi schuricke \n", "24 uta bresan \n", "25 oliver thomas \n", "26 roberto blanco \n", "27 freddy quinn \n", "28 alpenrebellen \n", "29 katja ebstein \n", ".. ... \n", "117 jonny hill \n", "118 nik p. \n", "119 andrea jürgens \n", "120 andreas martin \n", "121 maxi arland \n", "122 axel becker \n", "123 kastelruther spatzen \n", "124 die flippers \n", "125 bata illic \n", "126 matthias reim \n", "127 peter maffay \n", "128 brunner & brunner \n", "129 slavko avsenik und seine original oberkrainer \n", "130 andrea berg \n", "131 die zillertaler \n", "132 marianne & michael \n", "133 jürgen \n", "134 jürgen drews \n", "135 margot eskens \n", "136 ernst mosch und seine original egerländer musi... \n", "137 die jungen zillertaler \n", "138 angela wiedl \n", "139 original naabtal duo \n", "140 vico torriani \n", "141 hanne haller \n", "142 gus backus \n", "143 mickie krause \n", "144 juliane werding \n", "145 wenche myhre \n", "146 truck stop \n", "\n", " cluster4 cluster5 \n", "0 ten typ mes david houston \n", "1 bora vince gill \n", "2 pijani powietrzem kitty wells \n", "3 koniec ?wiata buck owens \n", "4 pokahontaz trace adkins \n", "5 lumpex75 eric church \n", "6 coalition janie fricke \n", "7 fu brooks & dunn \n", "8 s?owa we krwi montgomery gentry \n", "9 pogotowie seksualne toby keith \n", "10 2cztery7 lee ann womack \n", "11 strachy na lachy johnny horton \n", "12 hey connie smith \n", "13 kaliber 44 mark chesnutt \n", "14 kazik staszewski donna fargo \n", "15 pezet pam tillis \n", "16 dezerter collin raye \n", "17 pezet-noon eddy arnold \n", "18 all wheel drive clint black \n", "19 zeus jessica andrews \n", "20 kombajn do zbierania kur po wioskach keith whitley \n", "21 muchy patty loveless \n", "22 marek grechuta terri clark \n", "23 cool kids of death bill anderson \n", "24 infekcja mindy mccready \n", "25 konkwista 88 charley pride \n", "26 homomilitia dwight yoakam \n", "27 akurat rascal flatts \n", "28 lady pank vern gosdin \n", "29 east west rockers george jones \n", ".. ... ... \n", "117 farben lehre porter wagoner \n", "118 awantura kellie pickler \n", "119 indios bravos charlie rich \n", "120 pid?ama porno webb pierce \n", "121 daab hank thompson \n", "122 izrael jason aldean \n", "123 tworzywo sztuczne doug stone \n", "124 waglewski fisz emade faron young \n", "125 grzegorz turnau sugarland \n", "126 peja jim ed brown \n", "127 happysad tracy byrd \n", "128 p?omie? 81 jason michael carroll \n", "129 rahim sonny james \n", "130 cymeon x chris cagle \n", "131 april craig morgan \n", "132 t.love bobby bare \n", "133 wwo billy currington \n", "134 btm rodney atkins \n", "135 bia?a gor?czka claude king \n", "136 sound pollution jimmy wayne \n", "137 guernica y luno gene watson \n", "138 ca?a góra barwinków keith anderson \n", "139 lilu reba mcentire \n", "140 maleo reggae rockers johnny lee \n", "141 maria peszek jim reeves \n", "142 defekt muzgó jimmy dean \n", "143 fisz randy travis \n", "144 pih billie jo spears \n", "145 dixon37 dan seals \n", "146 molesta lonestar \n", "\n", "[147 rows x 5 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters=results['dbsc'].value_counts()[10:15].index\n", "clusters=[pd.DataFrame(results[['artist_name']].loc[results['dbsc']==i]) for i in clusters]\n", "for i in range(len(clusters)):\n", " clusters[i].index=range(len(clusters[i]))\n", "clusters=clusters[0].merge(clusters[1],left_index=True, right_index=True).merge(clusters[2],left_index=True, right_index=True).merge(clusters[3],left_index=True, right_index=True).merge(clusters[4],left_index=True, right_index=True)\n", "clusters.columns=['cluster'+str(i) for i in range(1,len(clusters.columns)+1)]\n", "clusters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trying to interpret the clusters:\n", "* The first cluster seems to contain mostly Turkish musicians of popular and folk music.\n", "* The second cluster seems to contain mostly Indian-Punjabi musicians, also of popular and folk music.\n", "* The third cluster seems to contain mostly German-speaking singers of pop and movie-derived songs.\n", "* The fourth cluster seems to contain mostly east European small artists of alternative rock and indie art.\n", "* The fifth cluster seems to contain mostly country artists from the US." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }