{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Unsupervised Learning Part 2 -- Clustering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clustering is the task of gathering samples into groups of similar\n",
"samples according to some predefined similarity or distance (dissimilarity)\n",
"measure, such as the Euclidean distance.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section we will explore a basic clustering task on some synthetic and real-world datasets.\n",
"\n",
"Here are some common applications of clustering algorithms:\n",
"\n",
"- Compression for data reduction\n",
"- Summarizing data as a reprocessing step for recommender systems\n",
"- Similarly:\n",
" - grouping related web news (e.g. Google News) and web search results\n",
" - grouping related stock quotes for investment portfolio management\n",
" - building customer profiles for market analysis\n",
"- Building a code book of prototype samples for unsupervised feature extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start by creating a simple, 2-dimensional, synthetic dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import make_blobs\n",
"\n",
"X, y = make_blobs(random_state=42)\n",
"X.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(8, 8))\n",
"plt.scatter(X[:, 0], X[:, 1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the scatter plot above, we can see three separate groups of data points and we would like to recover them using clustering -- think of \"discovering\" the class labels that we already take for granted in a classification task.\n",
"\n",
"Even if the groups are obvious in the data, it is hard to find them when the data lives in a high-dimensional space, which we can't visualize in a single histogram or scatterplot."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will use one of the simplest clustering algorithms, K-means.\n",
"This is an iterative algorithm which searches for three cluster\n",
"centers such that the distance from each point to its cluster is\n",
"minimized. The standard implementation of K-means uses the Euclidean distance, which is why we want to make sure that all our variables are measured on the same scale if we are working with real-world datastets. In the previous notebook, we talked about one technique to achieve this, namely, standardization.\n",
"\n",
"
\n",
"