{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Identifying household power consumption usage profile\n", "\n", "### Objective\n", "\n", "Extract patterns in the daily load profiles of a single-household using the [k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) algorithm.\n", "\n", "### Learning objective\n", "\n", "After finished this notebook, you should be able to explain **k-means clustering** algorithm, including how to use the scikit-learn implementation. \n", "\n", "### Individual household electric power consumption data set\n", "\n", "\n", "**Description**: \n", "\n", "Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.\n", "\n", "**Data set information**:\n", "\n", "* It contains 2075259 measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010 (47 months).\n", "\n", "**Notes**:\n", "\n", "1. (global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3) represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3.\n", "\n", "2. The dataset contains some missing values in the measurements (nearly 1,25% of the rows). All calendar timestamps are present in the dataset but for some timestamps, the measurement values are missing: a missing value is represented by the absence of value between two consecutive semi-colon attribute separators. For instance, the dataset shows missing values on April 28, 2007.\n", "\n", "\n", "**Attribute information:**\n", "\n", " 1. **date**: date in format dd/mm/yyyy\n", " 2. **time**: time in format hh:mm:ss\n", " 3. **global_active_power**: household global minute-averaged active power (in kilowatt)\n", " 4. **global_reactive_power**: household global minute-averaged reactive power (in kilowatt)\n", " 5. **voltage**: minute averaged voltage (in volt)\n", " 6. **global_intensity**: household global minute-averaged current intensity (in ampere)\n", " 7. **sub_metering_1**: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).\n", " 8. **sub_metering_2**: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.\n", " 9. **sub_metering_3**: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.\n", "\n", "\n", "**Source**: https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "assert sys.version_info >= (3, 6)\n", "\n", "import numpy\n", "assert numpy.__version__ >=\"1.17.3\" \n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "import pandas\n", "assert pandas.__version__ >= \"0.25.1\"\n", "import pandas as pd\n", "\n", "import sklearn\n", "assert sklearn.__version__ >= \"0.21.3\"\n", "\n", "from sklearn import datasets\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Load the data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "household_pc = None\n", "household_pc.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "household_pc.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "household_power_consumption = household_pc.iloc[0:, 2:9].dropna()\n", "household_power_consumption.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X = household_power_consumption.values\n", "X_train, X_test = train_test_split(X, train_size=.01, random_state = 42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Reduce the number of dimensions using the principal components analysis (PCA)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import decomposition\n", "\n", "# compute the two principal components\n", "pca = None\n", "\n", "pca.fit(X_train)\n", "\n", "X_projected = None \n", "\n", "print(pca.explained_variance_ratio_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Compute the number of clusters through k-means algorithm.\n", "\n", "In scikit-learn provides a k-means implementation through the `sklearn.cluster.KMeans`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import cluster\n", "\n", "kmeans = None\n", "kmeans.fit(X_projected)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_clusters_map(X, cluster_model):\n", " \n", " x_min, x_max = X[:, 0].min() - 5, X[:, 0].max() - 1\n", " y_min, y_max = X[:, 1].min(), X[:, 1].max() + 5\n", " \n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))\n", " Z = cluster_model.predict(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " \n", " plt.figure(1)\n", " plt.clf()\n", " plt.imshow(Z, \n", " interpolation='nearest',\n", " extent=(xx.min(), xx.max(), yy.min(), yy.max()),\n", " cmap=plt.cm.Paired,\n", " aspect='auto', origin='lower')\n", " \n", " plt.plot(X[:, 0], X[:, 1], 'k.', markersize=4)\n", " centroids = cluster_model.cluster_centers_\n", " inert = cluster_model.inertia_\n", " plt.scatter(centroids[:, 0], \n", " centroids[:, 1],\n", " marker='x', s=169, linewidths=3,\n", " color='w', \n", " zorder=8)\n", " \n", " plt.xlim(x_min, x_max)\n", " plt.ylim(y_min, y_max)\n", " plt.xticks(())\n", " plt.yticks(());" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_clusters_map(X_projected, kmeans)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Visualizing the **variance explained** in function of the number of clusters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.spatial.distance import cdist, pdist\n", "\n", "# Create a set of clusters\n", "k_range = range(1, 14)\n", "\n", "# Fit the kmeans clustering model for each number of cluster.\n", "kmeans_var = [None.fit(X_projected) for k in k_range] \n", "\n", "# Get the centers for each cluster model\n", "centroids = [X.cluster_centers_ for X in kmeans_var]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the Euclidean distance from each point to each cluster center" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "k_euclidean = [cdist(X_projected, cent, 'euclidean') for cent in centroids]\n", "\n", "distances = [np.min(ke, axis=1) for ke in k_euclidean]\n", "\n", "# Total within-cluster sum of squares\n", "wcss = [sum(d**2) for d in distances]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute the total sum of squares\n", "tss = np.sum(pdist(X_projected)**2) / X_projected.shape[0]\n", "\n", "# Compute the sum of squares difference between the clusters\n", "bss = tss - wcss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the curve of the variance explained in function of the number of clusters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "ax.plot(k_range, bss/tss*100, 'b*-')\n", "ax.set_ylim((0,100))\n", "plt.grid(True)\n", "plt.xlabel('n_clusters')\n", "plt.ylabel('Percentage of variance explained')\n", "plt.title('Variance Explained vs. # of cluster (k)');" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }