{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Identifying household power consumption usage profile\n",
    "\n",
    "### Objective\n",
    "\n",
    "Extract patterns in the daily load profiles of a single-household using the [k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) algorithm.\n",
    "\n",
    "### Learning objective\n",
    "\n",
    "After finished this notebook, you should be able to explain **k-means clustering** algorithm, including how to use the scikit-learn implementation. \n",
    "\n",
    "### Individual household electric power consumption data set\n",
    "\n",
    "\n",
    "**Description**: \n",
    "\n",
    "Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.\n",
    "\n",
    "**Data set information**:\n",
    "\n",
    "* It contains 2075259 measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010 (47 months).\n",
    "\n",
    "**Notes**:\n",
    "\n",
    "1. (global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3) represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3.\n",
    "\n",
    "2. The dataset contains some missing values in the measurements (nearly 1,25% of the rows). All calendar timestamps are present in the dataset but for some timestamps, the measurement values are missing: a missing value is represented by the absence of value between two consecutive semi-colon attribute separators. For instance, the dataset shows missing values on April 28, 2007.\n",
    "\n",
    "\n",
    "**Attribute information:**\n",
    "\n",
    "  1. **date**: date in format dd/mm/yyyy\n",
    "  2. **time**: time in format hh:mm:ss\n",
    "  3. **global_active_power**: household global minute-averaged active power (in kilowatt)\n",
    "  4. **global_reactive_power**: household global minute-averaged reactive power (in kilowatt)\n",
    "  5. **voltage**: minute averaged voltage (in volt)\n",
    "  6. **global_intensity**: household global minute-averaged current intensity (in ampere)\n",
    "  7. **sub_metering_1**: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).\n",
    "  8. **sub_metering_2**: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.\n",
    "  9. **sub_metering_3**: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.\n",
    "\n",
    "\n",
    "**Source**: https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "assert sys.version_info >= (3, 6)\n",
    "\n",
    "import numpy\n",
    "assert numpy.__version__ >=\"1.17.3\" \n",
    "import numpy as np\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import pandas\n",
    "assert pandas.__version__ >= \"0.25.1\"\n",
    "import pandas as pd\n",
    "\n",
    "import sklearn\n",
    "assert sklearn.__version__ >= \"0.21.3\"\n",
    "\n",
    "from sklearn import datasets\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Load the data set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "household_pc = None\n",
    "household_pc.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "household_pc.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "household_power_consumption = household_pc.iloc[0:, 2:9].dropna()\n",
    "household_power_consumption.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = household_power_consumption.values\n",
    "X_train, X_test = train_test_split(X, train_size=.01, random_state = 42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Reduce the number of dimensions using the principal components analysis (PCA)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import decomposition\n",
    "\n",
    "# compute the two principal components\n",
    "pca = None\n",
    "\n",
    "pca.fit(X_train)\n",
    "\n",
    "X_projected = None \n",
    "\n",
    "print(pca.explained_variance_ratio_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Compute the number of clusters through k-means algorithm.\n",
    "\n",
    "In scikit-learn provides a k-means implementation through the `sklearn.cluster.KMeans`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import cluster\n",
    "\n",
    "kmeans = None\n",
    "kmeans.fit(X_projected)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_clusters_map(X, cluster_model):\n",
    "    \n",
    "    x_min, x_max = X[:, 0].min() - 5, X[:, 0].max() - 1\n",
    "    y_min, y_max = X[:, 1].min(), X[:, 1].max() + 5\n",
    "    \n",
    "    xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))\n",
    "    Z = cluster_model.predict(np.c_[xx.ravel(), yy.ravel()])\n",
    "    Z = Z.reshape(xx.shape)\n",
    "    \n",
    "    plt.figure(1)\n",
    "    plt.clf()\n",
    "    plt.imshow(Z, \n",
    "               interpolation='nearest',\n",
    "               extent=(xx.min(), xx.max(), yy.min(), yy.max()),\n",
    "               cmap=plt.cm.Paired,\n",
    "               aspect='auto', origin='lower')\n",
    "    \n",
    "    plt.plot(X[:, 0], X[:, 1], 'k.', markersize=4)\n",
    "    centroids = cluster_model.cluster_centers_\n",
    "    inert = cluster_model.inertia_\n",
    "    plt.scatter(centroids[:, 0], \n",
    "                centroids[:, 1],\n",
    "                marker='x', s=169, linewidths=3,\n",
    "                color='w', \n",
    "                zorder=8)\n",
    "    \n",
    "    plt.xlim(x_min, x_max)\n",
    "    plt.ylim(y_min, y_max)\n",
    "    plt.xticks(())\n",
    "    plt.yticks(());"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_clusters_map(X_projected, kmeans)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Visualizing the **variance explained** in function of the number of clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.spatial.distance import cdist, pdist\n",
    "\n",
    "# Create a set of clusters\n",
    "k_range = range(1, 14)\n",
    "\n",
    "# Fit the kmeans clustering model for each number of cluster.\n",
    "kmeans_var = [None.fit(X_projected) for k in k_range] \n",
    "\n",
    "# Get the centers for each cluster model\n",
    "centroids = [X.cluster_centers_ for X in kmeans_var]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate the Euclidean distance from each point to each cluster center"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "k_euclidean = [cdist(X_projected, cent, 'euclidean') for cent in centroids]\n",
    "\n",
    "distances = [np.min(ke, axis=1) for ke in k_euclidean]\n",
    "\n",
    "# Total within-cluster sum of squares\n",
    "wcss = [sum(d**2) for d in distances]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the total sum of squares\n",
    "tss = np.sum(pdist(X_projected)**2) / X_projected.shape[0]\n",
    "\n",
    "# Compute the sum of squares difference between the clusters\n",
    "bss = tss - wcss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the curve of the variance explained in function of the number of clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure()\n",
    "ax = fig.add_subplot(111)\n",
    "ax.plot(k_range, bss/tss*100, 'b*-')\n",
    "ax.set_ylim((0,100))\n",
    "plt.grid(True)\n",
    "plt.xlabel('n_clusters')\n",
    "plt.ylabel('Percentage of variance explained')\n",
    "plt.title('Variance Explained vs. # of cluster (k)');"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}