{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\" />\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
    "\n",
    "Author: [Yury Kashnitskiy](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center>Topic 7. Unsupervised learning\n",
    "## <center>Part 2. Clustering. k-Means\n",
    "    \n",
    "**This is mostly to demonstrate some applications of k-Means, for theory, study [topic 7](https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic07_unsupervised/topic7_pca_clustering.ipynb?flush_cache=true) in our course**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering NBA players"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some <a href=\"http://www.databasebasketball.com/about/aboutstats.htm\">info</a> on players' features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "nba = pd.read_csv(\"../../data/nba_2013.csv\")\n",
    "nba.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "kmeans = KMeans(n_clusters=5, random_state=1)\n",
    "numeric_cols = nba._get_numeric_data().dropna(axis=1)\n",
    "kmeans.fit(numeric_cols)\n",
    "\n",
    "\n",
    "# Visualizing using PCA\n",
    "pca = PCA(n_components=2)\n",
    "res = pca.fit_transform(numeric_cols)\n",
    "plt.figure(figsize=(12,8))\n",
    "plt.scatter(res[:,0], res[:,1], c=kmeans.labels_, s=50, cmap='viridis')\n",
    "plt.title('PCA')\n",
    "\n",
    "# Visualizing using 2 features: Total points vs. Total assists\n",
    "plt.figure(figsize=(12,8))\n",
    "plt.scatter(nba['pts'], nba['ast'], \n",
    "            c=kmeans.labels_, s=50, cmap='viridis')\n",
    "plt.xlabel('Total points')\n",
    "plt.ylabel('Total assitances')\n",
    "plt.title('Some interpretation');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compressing images with k-Means \n",
    "*not a popular technique*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.image as mpimg\n",
    "\n",
    "img = mpimg.imread('../../img/woman.jpg')[..., 1]\n",
    "plt.figure(figsize = (20, 12))\n",
    "plt.axis('off')\n",
    "plt.imshow(img, cmap='gray');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import randint\n",
    "from sklearn.cluster import MiniBatchKMeans\n",
    "\n",
    "X = img.reshape((-1, 1))\n",
    "k_means = MiniBatchKMeans(n_clusters=3)\n",
    "k_means.fit(X) \n",
    "values = k_means.cluster_centers_\n",
    "labels = k_means.labels_\n",
    "img_compressed = values[labels].reshape(img.shape)\n",
    "plt.figure(figsize = (20, 12))\n",
    "plt.axis('off')\n",
    "plt.imshow(img_compressed, cmap = 'gray');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Finding latent topics in texts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**We'll apply k-Means to cluster texts from 4 categories.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import time\n",
    "\n",
    "from sklearn import metrics\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer\n",
    "from sklearn.preprocessing import Normalizer\n",
    "\n",
    "categories = [\n",
    "    'alt.atheism',\n",
    "    'talk.religion.misc',\n",
    "    'comp.graphics',\n",
    "    'sci.space']\n",
    "\n",
    "print(\"Loading 20 newsgroups dataset for categories:\")\n",
    "print(categories)\n",
    "\n",
    "dataset = fetch_20newsgroups(subset='all', categories=categories,\n",
    "                             shuffle=True, random_state=42)\n",
    "\n",
    "print(\"%d documents\" % len(dataset.data))\n",
    "print(\"%d categories\" % len(dataset.target_names))\n",
    "\n",
    "labels = dataset.target\n",
    "true_k = np.unique(labels).shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Build Tf-Idf features for texts**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Extracting features from the training dataset using a sparse vectorizer\")\n",
    "vectorizer = TfidfVectorizer(max_df=0.5, max_features=1000,\n",
    "                             min_df=2, stop_words='english')\n",
    "\n",
    "X = vectorizer.fit_transform(dataset.data)\n",
    "print(\"n_samples: %d, n_features: %d\" % X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Apply k-Means to the vectors that we've got. Also, calculate clustering metrics.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "km = KMeans(n_clusters=true_k, init='k-means++', \n",
    "            max_iter=100, n_init=1)\n",
    "\n",
    "print(\"Clustering sparse data with %s\" % km)\n",
    "t0 = time()\n",
    "km.fit(X)\n",
    "\n",
    "print(\"Homogeneity: %0.3f\" % metrics.homogeneity_score(labels, km.labels_))\n",
    "print(\"Completeness: %0.3f\" % metrics.completeness_score(labels, km.labels_))\n",
    "print(\"V-measure: %0.3f\" % metrics.v_measure_score(labels, km.labels_))\n",
    "print(\"Adjusted Rand-Index: %.3f\"\n",
    "      % metrics.adjusted_rand_score(labels, km.labels_))\n",
    "print(\"Silhouette Coefficient: %0.3f\"\n",
    "      % metrics.silhouette_score(X, km.labels_, sample_size=1000))\n",
    "\n",
    "order_centroids = km.cluster_centers_.argsort()[:, ::-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Output words that are close to cluster centers**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terms = vectorizer.get_feature_names()\n",
    "for i in range(true_k):\n",
    "    print(\"Cluster %d:\" % (i + 1), end='')\n",
    "    for ind in order_centroids[i, :10]:\n",
    "        print(' %s' % terms[ind], end='')\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering handwritten digits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "\n",
    "digits = load_digits()\n",
    "\n",
    "X, y = digits.data, digits.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kmeans = KMeans(n_clusters=10)\n",
    "kmeans.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import adjusted_rand_score\n",
    "\n",
    "adjusted_rand_score(y, kmeans.predict(X))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_, axes = plt.subplots(2, 5)\n",
    "for ax, center in zip(axes.ravel(), kmeans.cluster_centers_):\n",
    "    ax.matshow(center.reshape(8, 8), cmap=plt.cm.gray)\n",
    "    ax.set_xticks(())\n",
    "    ax.set_yticks(())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  },
  "name": "lesson8_part1_kmeans.ipynb"
 },
 "nbformat": 4,
 "nbformat_minor": 1
}