{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering in Real World\n", "> A Summary of lecture \"Cluster Analysis in Python\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/batman.jpg" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dominant colors in images\n", "- Dominant colors in images\n", " - All images consist of pixels\n", " - Each pixel has three values: R, G, B\n", " - Pixel Color: combination of these RGB values\n", " - Perform k-means on standardized RGB values to find cluster centers\n", " - Uses: Identifying features in statelite images" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract RGB values from image\n", "There are broadly three steps to find the dominant colors in an image:\n", "\n", "- Extract RGB values into three lists.\n", "- Perform k-means clustering on scaled RGB values.\n", "- Display the colors of cluster centers.\n", "![batman](dataset/batman.jpg)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(169, 269, 3)\n" ] } ], "source": [ "import matplotlib.image as img\n", "\n", "r = []\n", "g = []\n", "b = []\n", "\n", "# Read batman image and print dimensions\n", "batman_image = img.imread('./dataset/batman.jpg')\n", "print(batman_image.shape)\n", "\n", "# Store RGB values of all pixels in lists r, g, and b\n", "for row in batman_image:\n", " for temp_r, temp_g, temp_b in row: \n", " r.append(temp_r)\n", " g.append(temp_g)\n", " b.append(temp_b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How many dominant colors?\n", "Construct an elbow plot with the data frame. How many dominant colors are present?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Preprocess" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from scipy.cluster.vq import whiten\n", "\n", "batman_df = pd.DataFrame({'red':r, 'blue':b, 'green':g})\n", "batman_df['scaled_red'] = whiten(batman_df['red'])\n", "batman_df['scaled_blue'] = whiten(batman_df['blue'])\n", "batman_df['scaled_green'] = whiten(batman_df['green'])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from scipy.cluster.vq import kmeans\n", "\n", "distortions = []\n", "num_clusters = range(1, 7)\n", "\n", "# Create a list of distortions from the kmeans function\n", "for i in num_clusters:\n", " cluster_centers, distortion = kmeans(batman_df[['scaled_red', 'scaled_blue', 'scaled_green']], i)\n", " distortions.append(distortion)\n", " \n", "# Create a data frame with two lists, num_clusters and distortions\n", "elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})\n", "\n", "# Create a line plot of num_clusters and distortions\n", "sns.lineplot(x='num_clusters', y='distortions', data=elbow_plot);\n", "plt.xticks(num_clusters);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Display dominant colors\n", "To display the dominant colors, convert the colors of the cluster centers to their raw values and then converted them to the range of 0-1, using the following formula: \n", "```python\n", "converted_pixel = standardized_pixel * pixel_std / 255\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAABbCAYAAABwOT7wAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAHeUlEQVR4nO3db6jWZx3H8fdnxzNXrpZuukQtV4k0Fqw6CCFEbNlcDR3VhkLDweQQZFv0oJRg1J5kPaie7MnYpL/kwvXnsEmboSO22uZx6ZaazWQwUbBlq86ghuvTg/MbHE6385zz+93nd+5zfV5wc//+XN7X90L8nIvL330d2SYiIma/i9ouICIipkcCPyKiEAn8iIhCJPAjIgqRwI+IKEQCPyKiELUCX9ICSXskvVC9zz9Pu9clHaxeQ3X6jIiIqVGd5/AlfRs4a3u7pK3AfNtf7dBuxPalNeqMiIia6gb+MeBjtk9LWgw8bntlh3YJ/IiIltVdw7/S9mmA6n3RedpdImlY0lOSbq7ZZ0RETMGcCzWQ9BvgnR1ufW0S/bzL9ilJ7wH2Snre9l869DUIDFanH9YkOug1i/rntl1CV7264Hw/+3tf//wFbZfQVSMnjrZdQlet6H9r2yV01ZFXX3nZ9sJO96ZlSWfcn/k+8LDtXW/W7iLJ/XP6plzbTLdl8XvbLqGrDmy8s+0SumbRp29tu4SuenLDqrZL6KrHllzbdglddc2Tvzxge6DTvbpLOkPApup4E/Cr8Q0kzZc0tzq+AlgNHKnZb0RETFLdwN8OrJH0ArCmOkfSgKT7qzbvB4YlHQL2AdttJ/AjIqbZBdfw34ztvwHXd7g+DGyujn8HfKBOPxERUV++aRsRUYgEfkREIRL4ERGFSOBHRBQigR8RUYgEfkREIRL4ERGFSOBHRBQigR8RUYgEfkREIRL4ERGFSOBHRBQigR8RUYgEfkREIRL4ERGFSOBHRBQigR8RUYgEfkREIRL4ERGFaCTwJa2VdEzScUlbO9yfK+nB6v7TkpY30W9ERExc7cCX1AfcC9wIXA1slHT1uGZ3AH+3/T7gu8C36vYbERGT08QMfxVw3PYJ268BO4H149qsB35QHe8CrpekBvqOiIgJaiLwlwAvjTk/WV3r2Mb2OeAfwOUN9B0RERM0p4HP6DRT9xTaIGkQGGygpoiIGKeJGf5JYNmY86XAqfO1kTQHuAw4O/6DbN9ne8D2QNZ7IiKa1UTg7wdWSLpK0sXABmBoXJshYFN1/Flgr+3/m+FHRET31F7SsX1O0hbgUaAP2GH7sKR7gGHbQ8ADwI8kHWd0Zr+hbr8RETE5TazhY3s3sHvctbvHHP8buKWJviIiYmryTduIiEIk8CMiCpHAj4goRAI/IqIQCfyIiEIk8CMiCpHAj4goRAI/IqIQCfyIiEIk8CMiCpHAj4goRAI/IqIQCfyIiEIk8CMiCpHAj4goRAI/IqIQCfyIiEIk8CMiCpHAj4goRCOBL2mtpGOSjkva2uH+7ZL+Kulg9drcRL8RETFxtX+JuaQ+4F5gDXAS2C9pyPaRcU0ftL2lbn8RETE1TczwVwHHbZ+w/RqwE1jfwOdGRESDmgj8JcBLY85PVtfG+4yk5yTtkrSsgX4jImISZLveB0i3ADfY3lyd3wassv3FMW0uB0Zs/0fS54FbbV/X4bMGgcHqdCVwrFZxk3MF8PI09jfdMr7elvH1ruke27ttL+x0o4nA/wjwdds3VOfbAGx/8zzt+4Czti+r1XHDJA3bHmi7jm7J+Hpbxte7ZtLYmljS2Q+skHSVpIuBDcDQ2AaSFo85XQccbaDfiIiYhNpP6dg+J2kL8CjQB+ywfVjSPcCw7SHgTknrgHPAWeD2uv1GRMTk1A58ANu7gd3jrt095ngbsK2JvrrovrYL6LKMr7dlfL1rxoyt9hp+RET0hmytEBFRiAQ+F94aopdJ2iHpjKQ/tl1LN0haJmmfpKOSDku6q+2amiLpEknPSDpUje0bbdfUDZL6JP1B0sNt19I0SS9Ker7aUma49XpKX9KpHhP9M2O2hgA2dtgaoidJ+igwAvzQ9jVt19O06gmwxbaflfQ24ABw82z4+5MkYJ7tEUn9wBPAXbafarm0Rkn6MjAAvN32TW3X0yRJLwIDtmfEdwwyw5/lW0PY/i2jT0bNSrZP2362Ov4Xo4/8dvqmd8/xqJHqtL96zaoZmqSlwKeA+9uupQQJ/IlvDREznKTlwAeBp9utpDnVcsdB4Aywx/asGVvle8BXgP+2XUiXGHhM0oFqJ4FWJfBBHa7NqllUCSRdCjwEfMn2P9uupym2X7d9LbAUWCVp1izLSboJOGP7QNu1dNFq2x8CbgS+UC2xtiaBPzqjH7uZ21LgVEu1xBRU69sPAT+x/fO26+kG268AjwNrWy6lSauBddU6907gOkk/brekZtk+Vb2fAX7B6BJyaxL4E9gaImau6j82HwCO2v5O2/U0SdJCSe+ojt8CfBz4U7tVNcf2NttLbS9n9N/dXtufa7msxkiaVz1IgKR5wCeAVp+WKz7wbZ8D3tga4ijwM9uH262qOZJ+CvweWCnppKQ72q6pYauB2xidHb7xG9U+2XZRDVkM7JP0HKMTkz22Z92ji7PYlcATkg4BzwCP2P51mwUV/1hmREQpip/hR0SUIoEfEVGIBH5ERCES+BERhUjgR0QUIoEfEVGIBH5ERCES+BERhfgfDcE5kV85BhcAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "colors = []\n", "\n", "# Get standard deviations of each color\n", "r_std, g_std, b_std = batman_df[['red', 'green', 'blue']].std()\n", "\n", "for cluster_center in cluster_centers:\n", " scaled_r, scaled_g, scaled_b = cluster_center\n", " # Convert each standardized value to scaled value\n", " colors.append((\n", " scaled_r * r_std / 255.0,\n", " scaled_g * g_std / 255.0,\n", " scaled_b * b_std / 255.0\n", " )\n", " )\n", " \n", "# Display colors of cluster centers\n", "plt.imshow([colors])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Document clustering\n", "- Document clustering: concepts\n", " - 1. Clean data before processing\n", " - 2. Determine the importance of the terms in a document (in tf-idf matrix)\n", " - 3. Cluster the tf-idf matrix\n", " - 4. Find top terms, documents in each cluster\n", "- TF-IDF (Term Frequency - Inverse Document Frequency)\n", " - A weighted measure: evaluate how important a word is to a document in a collection\n", "- Top terms per cluster\n", " - Cluster centers: lists with a size equal to the number of terms\n", " - Each value in the cluster center is its importance\n", "- More considerations\n", " - Work with hyperlinks, emoticons etc.\n", " - Normalize words (e.g. run, ran, running -> run)\n", " - ```.todense()``` may not work with large datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TF-IDF of movie plots\n", "Let us use the plots of randomly selected movies to perform document clustering on. Before performing clustering on documents, they need to be cleaned of any unwanted noise (such as special characters and stop words) and converted into a sparse matrix through TF-IDF of the documents.\n", "\n", "Use the ```TfidfVectorizer``` class to perform the TF-IDF of movie plots stored in the list ```plots```. The ```remove_noise()``` function is available to use as a ```tokenizer``` in the ```TfidfVectorizer``` class. The ```.fit_transform()``` method fits the data into the ```TfidfVectorizer``` objects and then generates the TF-IDF sparse matrix.\n", "\n", "**Note: It takes a few seconds to run the ```.fit_transform()``` method.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Preprocess" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitlePlot
0The Ballad of Cable HogueCable Hogue is isolated in the desert, awaitin...
1Monsters vs. AliensIn the far reaches of space, a planet explodes...
2The Bandit QueenZarra Montalvo is the daughter of an American ...
3Broken ArrowMajor Vic Deakins (John Travolta) and Captain ...
4DolemiteDolemite is a pimp and nightclub owner who is ...
\n", "
" ], "text/plain": [ " Title \\\n", "0 The Ballad of Cable Hogue \n", "1 Monsters vs. Aliens \n", "2 The Bandit Queen \n", "3 Broken Arrow \n", "4 Dolemite \n", "\n", " Plot \n", "0 Cable Hogue is isolated in the desert, awaitin... \n", "1 In the far reaches of space, a planet explodes... \n", "2 Zarra Montalvo is the daughter of an American ... \n", "3 Major Vic Deakins (John Travolta) and Captain ... \n", "4 Dolemite is a pimp and nightclub owner who is ... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie = pd.read_csv('./dataset/movies_plot.csv')\n", "movie.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "plots = movie['Plot'].values" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to\n", "[nltk_data] C:\\Users\\kcsgo\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n" ] } ], "source": [ "from nltk.tokenize import word_tokenize\n", "import re\n", "\n", "import nltk\n", "nltk.download('punkt')\n", "\n", "def remove_noise(text, stop_words = []):\n", " tokens = word_tokenize(text)\n", " cleaned_tokens = []\n", " for token in tokens:\n", " token = re.sub('[^A-Za-z0-9]+', '', token)\n", " if len(token) > 1 and token.lower() not in stop_words:\n", " # Get lowercase\n", " cleaned_tokens.append(token.lower())\n", " return cleaned_tokens" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "# Initialize TfidfVectorizer\n", "tfidf_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.75, max_features=50, tokenizer=remove_noise)\n", "\n", "# Use the .fit_transform() on the list plots\n", "tfidf_matrix = tfidf_vectorizer.fit_transform(plots)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Top terms in movie clusters\n", "Now that you have created a sparse matrix, generate cluster centers and print the top three terms in each cluster. Use the ```.todense()``` method to convert the sparse matrix, ```tfidf_matrix``` to a normal matrix for the ```kmeans()``` function to process. Then, use the ```.get_feature_names()``` method to get a list of terms in the ```tfidf_vectorizer``` object. The zip() function in Python joins two lists.\n", "\n", "With a higher number of data points, the clusters formed would be defined more clearly. However, this requires some computational power, making it difficult to accomplish in an exercise here.\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['her', 'she', 'him']\n", "['him', 'they', 'who']\n" ] } ], "source": [ "num_clusters = 2\n", "\n", "# Generate cluster centers through the kmeans function\n", "cluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters)\n", "\n", "# Generate terms from the tfidf_vectorizer object\n", "terms = tfidf_vectorizer.get_feature_names()\n", "\n", "for i in range(num_clusters):\n", " # Sort the terms and print top 3 terms\n", " center_terms = dict(zip(terms, list(cluster_centers[i])))\n", " sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True)\n", " print(sorted_terms[:3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering with multiple features\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic checks on clusters\n", "In the FIFA 18 dataset, we have concentrated on defenders in previous exercises. Let us try to focus on attacking attributes of a player. Pace (```pac```), Dribbling (```dri```) and Shooting (```sho```) are features that are present in attack minded players. In this exercise, k-means clustering has already been applied on the data using the scaled values of these three attributes. Try some basic checks on the clusters so formed.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Preprocess" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDnamefull_nameclubclub_logospecialageleaguebirth_dateheight_cm...prefers_cbprefers_lbprefers_lwbprefers_lsprefers_lfprefers_lamprefers_lcmprefers_ldmprefers_lcbprefers_gk
020801Cristiano RonaldoC. Ronaldo dos Santos AveiroReal Madrid CFhttps://cdn.sofifa.org/18/teams/243.png222832Spanish Primera División1985-02-05185.0...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1158023L. MessiLionel MessiFC Barcelonahttps://cdn.sofifa.org/18/teams/241.png215830Spanish Primera División1987-06-24170.0...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2190871NeymarNeymar da Silva Santos Jr.Paris Saint-Germainhttps://cdn.sofifa.org/18/teams/73.png210025French Ligue 11992-02-05175.0...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3176580L. SuárezLuis SuárezFC Barcelonahttps://cdn.sofifa.org/18/teams/241.png229130Spanish Primera División1987-01-24182.0...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4167495M. NeuerManuel NeuerFC Bayern Munichhttps://cdn.sofifa.org/18/teams/21.png149331German Bundesliga1986-03-27193.0...FalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
\n", "

5 rows × 185 columns

\n", "
" ], "text/plain": [ " ID name full_name \\\n", "0 20801 Cristiano Ronaldo C. Ronaldo dos Santos Aveiro \n", "1 158023 L. Messi Lionel Messi \n", "2 190871 Neymar Neymar da Silva Santos Jr. \n", "3 176580 L. Suárez Luis Suárez \n", "4 167495 M. Neuer Manuel Neuer \n", "\n", " club club_logo special age \\\n", "0 Real Madrid CF https://cdn.sofifa.org/18/teams/243.png 2228 32 \n", "1 FC Barcelona https://cdn.sofifa.org/18/teams/241.png 2158 30 \n", "2 Paris Saint-Germain https://cdn.sofifa.org/18/teams/73.png 2100 25 \n", "3 FC Barcelona https://cdn.sofifa.org/18/teams/241.png 2291 30 \n", "4 FC Bayern Munich https://cdn.sofifa.org/18/teams/21.png 1493 31 \n", "\n", " league birth_date height_cm ... prefers_cb \\\n", "0 Spanish Primera División 1985-02-05 185.0 ... False \n", "1 Spanish Primera División 1987-06-24 170.0 ... False \n", "2 French Ligue 1 1992-02-05 175.0 ... False \n", "3 Spanish Primera División 1987-01-24 182.0 ... False \n", "4 German Bundesliga 1986-03-27 193.0 ... False \n", "\n", " prefers_lb prefers_lwb prefers_ls prefers_lf prefers_lam prefers_lcm \\\n", "0 False False False False False False \n", "1 False False False False False False \n", "2 False False False False False False \n", "3 False False False False False False \n", "4 False False False False False False \n", "\n", " prefers_ldm prefers_lcb prefers_gk \n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "3 False False False \n", "4 False False True \n", "\n", "[5 rows x 185 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fifa = pd.read_csv('./dataset/fifa_18_sample_data.csv')\n", "fifa.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "fifa['scaled_pac'] = whiten(fifa['pac'])\n", "fifa['scaled_dri'] = whiten(fifa['dri'])\n", "fifa['scaled_sho'] = whiten(fifa['sho'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from scipy.cluster.vq import vq\n", "\n", "cluster_centers, _ = kmeans(fifa[['scaled_pac', 'scaled_dri', 'scaled_sho']], 3)\n", "\n", "fifa['cluster_labels'], _ = vq(fifa[['scaled_pac', 'scaled_dri', 'scaled_sho']], cluster_centers)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cluster_labels\n", "0 182\n", "1 457\n", "2 361\n", "Name: ID, dtype: int64\n", "cluster_labels\n", "0 63225.274725\n", "1 77297.592998\n", "2 62603.878116\n", "Name: eur_wage, dtype: float64\n" ] } ], "source": [ "# Print the size of the clusters\n", "print(fifa.groupby('cluster_labels')['ID'].count())\n", "\n", "# Print the mean value of wages in each cluster\n", "print(fifa.groupby('cluster_labels')['eur_wage'].mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### FIFA 18: what makes a complete player?\n", "The overall level of a player in FIFA 18 is defined by six characteristics: pace (```pac```), shooting (```sho```), passing (```pas```), dribbling (```dri```), defending (```def```), physical (```phy```)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "fifa['scaled_def'] = whiten(fifa['def'])\n", "fifa['scaled_phy'] = whiten(fifa['phy'])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "scaled_features = ['scaled_pac', 'scaled_sho', 'scaled_pac', 'scaled_dri', 'scaled_def', 'scaled_phy']" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " scaled_pac scaled_sho scaled_pac scaled_dri scaled_def \\\n", "cluster_labels \n", "0 6.828114 5.475576 6.828114 8.579753 2.366369 \n", "1 5.491062 3.998012 5.491062 7.059324 3.858428 \n", "\n", " scaled_phy \n", "cluster_labels \n", "0 8.274933 \n", "1 9.109248 \n", "0 ['Cristiano Ronaldo' 'L. Messi' 'Neymar' 'L. Suárez' 'M. Neuer']\n", "1 ['T. Kroos' 'Sergio Ramos' 'G. Chiellini' 'L. Bonucci' 'J. Boateng']\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Create centroids with kmeans for 2 clusters\n", "cluster_centers, _ = kmeans(fifa[scaled_features], 2)\n", "\n", "# Assign cluster labels and print cluster centers\n", "fifa['cluster_labels'], _ = vq(fifa[scaled_features], cluster_centers)\n", "print(fifa.groupby('cluster_labels')[scaled_features].mean())\n", "\n", "# Plot cluster centers to visualize clusters\n", "fifa.groupby('cluster_labels')[scaled_features].mean().plot(legend=True, kind='bar')\n", "\n", "# Get the name column of first 5 players in each cluster\n", "for cluster in fifa['cluster_labels'].unique():\n", " print(cluster, fifa[fifa['cluster_labels'] == cluster]['name'].values[:5])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }