{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "

Методы машинного обучения

\n", "

Кластеризация

" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/andrey.shestakov/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\n", " return f(*args, **kwds)\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "plt.style.use('ggplot')\n", "plt.rcParams['figure.figsize'] = (12,8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Применение K-means" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Загрузите [данные](https://github.com/brenden17/sklearnlab/blob/master/facebook/snsdata.csv) в которых содержится описание интересов профилей учеников старшей школы США." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gradyeargenderagefriendsbasketballfootballsoccersoftballvolleyballswimming...blondemallshoppingclotheshollisterabercrombiediedeathdrunkdrugs
02006M18.9827000000...0000000000
12006F18.8010010000...0100000000
22006M18.33569010000...0000000100
32006F18.8750000000...0000000000
42006NaN18.99510000000...0020000011
\n", "

5 rows × 40 columns

\n", "
" ], "text/plain": [ " gradyear gender age friends basketball football soccer softball \\\n", "0 2006 M 18.982 7 0 0 0 0 \n", "1 2006 F 18.801 0 0 1 0 0 \n", "2 2006 M 18.335 69 0 1 0 0 \n", "3 2006 F 18.875 0 0 0 0 0 \n", "4 2006 NaN 18.995 10 0 0 0 0 \n", "\n", " volleyball swimming ... blonde mall shopping clothes hollister \\\n", "0 0 0 ... 0 0 0 0 0 \n", "1 0 0 ... 0 1 0 0 0 \n", "2 0 0 ... 0 0 0 0 0 \n", "3 0 0 ... 0 0 0 0 0 \n", "4 0 0 ... 0 0 2 0 0 \n", "\n", " abercrombie die death drunk drugs \n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 1 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 1 1 \n", "\n", "[5 rows x 40 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sns = pd.read_csv('data/snsdata.csv', sep=',')\n", "df_sns.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Данные устроены так: \n", "* Год выпуска\n", "* Пол\n", "* Возраст\n", "* Количество друзей\n", "* 36 ключевых слов, которые встречаются в профилe facebook (интересы, сообщества, встречи)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['basketball', 'football', 'soccer', 'softball', 'volleyball',\n", " 'swimming', 'cheerleading', 'baseball', 'tennis', 'sports', 'cute',\n", " 'sex', 'sexy', 'hot', 'kissed', 'dance', 'band', 'marching',\n", " 'music', 'rock', 'god', 'church', 'jesus', 'bible', 'hair',\n", " 'dress', 'blonde', 'mall', 'shopping', 'clothes', 'hollister',\n", " 'abercrombie', 'die', 'death', 'drunk', 'drugs'], dtype=object)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sns.columns[4:].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Задание\n", "\n", "* Удалите все признаки кроме 36 ключевых слов.\n", "* Нормализуйте данные - из каждого столбца вычтите его среднее значение и поделите на стандартное отклонение.\n", "* Используйте метод k-means чтобы выделить 9 кластеров\n", "* Попробуйте проинтерпретировать каждый кластер проанализировав полученные центройды (Некоторые кластеры могут быть очень большие и очень маленькие - плохо интерпретируются)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "X = df_sns.iloc[:, 4:].values" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/andrey.shestakov/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:590: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n", " warnings.warn(msg, DataConversionWarning)\n", "/Users/andrey.shestakov/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:590: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n", " warnings.warn(msg, DataConversionWarning)\n" ] } ], "source": [ "# нормализуем данные\n", "scaler = StandardScaler()\n", "X_ = scaler.fit_transform(X)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n", " n_clusters=9, n_init=10, n_jobs=None, precompute_distances='auto',\n", " random_state=123, tol=0.0001, verbose=0)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#применим к-средних с к=9\n", "kmeans = KMeans(n_clusters=9, random_state=123)\n", "kmeans.fit(X_)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "labels = kmeans.labels_ # метки кластеров для объектов из Х" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "centroids = kmeans.cluster_centers_ # координаты центройдов\n", "criterion = kmeans.inertia_ # значения критерия для разбиения" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "861745.6454158238" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "criterion" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9, 36)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "centroids.shape" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "df_sns.loc[:, 'cluster_label'] = labels" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "4 20024\n", "0 5036\n", "1 1337\n", "6 846\n", "8 841\n", "2 752\n", "3 697\n", "7 466\n", "5 1\n", "Name: cluster_label, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sns.cluster_label.value_counts()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==========\n", "Cluster 0\n", "music 1.066521\n", "dance 1.051033\n", "shopping 0.890191\n", "cute 0.828435\n", "basketball 0.722597\n", "hair 0.691223\n", "mall 0.652502\n", "football 0.617752\n", "god 0.573272\n", "church 0.496426\n", "dtype: float64\n", "==========\n", "Cluster 1\n", "drunk 1.409873\n", "music 0.707554\n", "hair 0.629020\n", "god 0.522064\n", "dance 0.439043\n", "cute 0.384443\n", "sex 0.380703\n", "shopping 0.326103\n", "mall 0.287210\n", "die 0.275991\n", "dtype: float64\n", "==========\n", "Cluster 2\n", "band 4.105053\n", "marching 1.418883\n", "music 1.215426\n", "god 0.505319\n", "dance 0.464096\n", "hair 0.371011\n", "rock 0.344415\n", "shopping 0.289894\n", "football 0.275266\n", "cute 0.275266\n", "dtype: float64\n", "==========\n", "Cluster 3\n", "soccer 4.901004\n", "music 0.773314\n", "shopping 0.499283\n", "god 0.469154\n", "hair 0.440459\n", "basketball 0.428981\n", "dance 0.398852\n", "football 0.397418\n", "cute 0.337159\n", "church 0.321377\n", "dtype: float64\n", "==========\n", "Cluster 4\n", "music 0.554035\n", "god 0.311626\n", "dance 0.230423\n", "hair 0.192419\n", "shopping 0.181632\n", "cute 0.162855\n", "band 0.156962\n", "rock 0.152867\n", "football 0.136187\n", "church 0.135238\n", "dtype: float64\n", "==========\n", "Cluster 5\n", "blonde 327.0\n", "sex 22.0\n", "hair 12.0\n", "god 10.0\n", "death 6.0\n", "die 6.0\n", "drunk 6.0\n", "football 2.0\n", "dress 2.0\n", "sexy 1.0\n", "dtype: float64\n", "==========\n", "Cluster 6\n", "hair 3.475177\n", "sex 2.760047\n", "music 2.374704\n", "kissed 1.874704\n", "die 1.269504\n", "rock 1.257683\n", "drugs 1.076832\n", "dance 1.005910\n", "god 0.964539\n", "clothes 0.812057\n", "dtype: float64\n", "==========\n", "Cluster 7\n", "god 4.725322\n", "church 2.180258\n", "jesus 2.049356\n", "music 1.066524\n", "bible 0.972103\n", "hair 0.448498\n", "dance 0.427039\n", "band 0.407725\n", "shopping 0.396996\n", "die 0.371245\n", "dtype: float64\n", "==========\n", "Cluster 8\n", "hollister 1.512485\n", "abercrombie 1.173603\n", "shopping 0.932224\n", "music 0.909631\n", "hair 0.897741\n", "dance 0.693222\n", "mall 0.673008\n", "cute 0.612366\n", "god 0.474435\n", "clothes 0.424495\n", "dtype: float64\n" ] } ], "source": [ "for k, group in df_sns.groupby('cluster_label'):\n", " print('='*10)\n", " print('Cluster {}'.format(k))\n", " \n", " top_words = group.iloc[:, 4:-1].mean()\\\n", " .sort_values(ascending=False)\\\n", " .head(10)\n", " print(top_words)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Пищевая ценность продуктов" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Загрузите файл `food.txt`. В нем содержится информация о пищевой ценности разных продуктов" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "# \"Name\" is the name of the item.\n", "#\n", "# \"Energy\" is the number of calories.\n", "#\n", "# \"Protein\" is the amount of protein in grams.\n", "#\n", "# \"Fat\" is the amount of fat in grams.\n", "#\n", "# \"Calcium\" is the amount of calcium in milligrams.\n", "#\n", "# \"Iron\" is the amount of iron in milligrams." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Подготовте данные к кластеризации и сделайте иерарническую кластеризацию этого набора данных.\n", "* Изобразите дендрограмму\n", "* Выверите число кластеров и интерпретируйте их\n", "\n", "Почему перед применением кластеризации признки необходимо нормализовать?" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameEnergyProteinFatCalciumIron
0Braised beef340202892.6
1Hamburger245211792.7
2Roast beef420153972.0
3Beefsteak375193292.6
4Canned beef1802210173.7
\n", "
" ], "text/plain": [ " Name Energy Protein Fat Calcium Iron\n", "0 Braised beef 340 20 28 9 2.6\n", "1 Hamburger 245 21 17 9 2.7\n", "2 Roast beef 420 15 39 7 2.0\n", "3 Beefsteak 375 19 32 9 2.6\n", "4 Canned beef 180 22 10 17 3.7" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/food.txt', sep=' ')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "from scipy.cluster.hierarchy import linkage, dendrogram, fcluster" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "X = df.iloc[:, 1:].values" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler()\n", "X_ = scaler.fit_transform(X)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(27, 5)" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_.shape" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "Z = linkage(X_, method='average', metric='euclidean')" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "names = df.Name.values\n", "dend = dendrogram(Z, color_threshold=0, labels=names, \n", " orientation='left')" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "t = 2.3\n", "labels = fcluster(Z, t, criterion='distance')\n", "# labels = fcluster(Z, t, criterion='maxclust')" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "df.loc[:, 'label'] = labels" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameEnergyProteinFatCalciumIronlabel
0Braised beef340202892.62
1Hamburger245211792.74
2Roast beef420153972.02
3Beefsteak375193292.62
4Canned beef1802210173.74
\n", "
" ], "text/plain": [ " Name Energy Protein Fat Calcium Iron label\n", "0 Braised beef 340 20 28 9 2.6 2\n", "1 Hamburger 245 21 17 9 2.7 4\n", "2 Roast beef 420 15 39 7 2.0 2\n", "3 Beefsteak 375 19 32 9 2.6 2\n", "4 Canned beef 180 22 10 17 3.7 4" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==========\n", "Cluster 1\n", " Name Energy Protein Fat Calcium Iron label\n", "16 Raw clams 70 11 1 82 6.0 1\n", "17 Canned clams 45 7 1 74 5.4 1\n", "==========\n", "Cluster 2\n", " Name Energy Protein Fat Calcium Iron label\n", "0 Braised beef 340 20 28 9 2.6 2\n", "2 Roast beef 420 15 39 7 2.0 2\n", "3 Beefsteak 375 19 32 9 2.6 2\n", "9 Roast lamb shoulder 300 18 25 9 2.3 2\n", "10 Smoked ham 340 20 28 9 2.5 2\n", "11 Pork roast 340 19 29 9 2.5 2\n", "12 Pork simmered 355 19 30 9 2.4 2\n", "==========\n", "Cluster 3\n", " Name Energy Protein Fat Calcium Iron label\n", "21 Canned mackerel 155 16 9 157 1.8 3\n", "23 Canned salmon 120 17 5 159 0.7 3\n", "==========\n", "Cluster 4\n", " Name Energy Protein Fat Calcium Iron label\n", "1 Hamburger 245 21 17 9 2.7 4\n", "4 Canned beef 180 22 10 17 3.7 4\n", "5 Broiled chicken 115 20 3 8 1.4 4\n", "6 Canned chicken 170 25 7 12 1.5 4\n", "8 Roast lamb leg 265 20 20 9 2.6 4\n", "13 Beef tongue 205 18 14 7 2.5 4\n", "14 Veal cutlet 185 23 9 9 2.7 4\n", "15 Baked bluefish 135 22 4 25 0.6 4\n", "18 Canned crabmeat 90 14 2 38 0.8 4\n", "19 Fried haddock 135 16 5 15 0.5 4\n", "20 Broiled mackerel 200 19 13 5 1.0 4\n", "22 Fried perch 195 16 11 14 1.3 4\n", "25 Canned tuna 170 25 7 7 1.2 4\n", "26 Canned shrimp 110 23 1 98 2.6 4\n", "==========\n", "Cluster 5\n", " Name Energy Protein Fat Calcium Iron label\n", "7 Beef heart 160 26 5 14 5.9 5\n", "==========\n", "Cluster 6\n", " Name Energy Protein Fat Calcium Iron label\n", "24 Canned sardines 180 22 9 367 2.5 6\n" ] } ], "source": [ "for k, group in df.groupby('label'):\n", " print('='*10)\n", " print('Cluster {}'.format(k))\n", " print(group)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" }, "toc": { "base_numbering": 1, "nav_menu": { "height": "142px", "width": "252px" }, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }