{
"metadata": {
"name": "",
"signature": "sha256:1cefe61c75c7b2978f2dc0da0eb34fe73f53ce7d9aa412cde12cefd0b8ffaf13"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Chapter 10 - Learning without Supervision"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-- *A Python Course for the Humanities*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"In the previous chapter we have developed a system that on the basis of examples attempts to learn a function to classify new, unseen examples. Not always do we have the luxury of a labeled data set. In fact, most of the time only unlabeled data is available. In unsupervized machine learning, or learning without supervision, we attempt the create systems that detect patterns in our data, such as groupings or clusters. Given a collection of texts, we could for example try to measure the pairwise distances between all texts and given these distances construct a grouping of the texts. Another example of unsupervized learning is the popular method of *Topic Modeling* in which we attempt to find clusters of semantically coherent words that together form a topic. \n",
"\n",
"In this chapter we will introduce you to some of the techniques to cluster you data without supervision. As is the case with supervized learning, there are many different approaches to clustering. We will discuss one of the most popular ones: hierachical agglomerative clustering. We will develop a general hierarchical cluster module and implement a number of different cluster procedures. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Hierarchical Clustering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All cluster techniques follow a similar procedure. We start with a data set consisting of $n$ different data points. The end state will be a data set with $k$ clusters each consisting of a number of original data points. The cluster procedure iteratively moves through all data points and assigns each data point to a particular cluster. Cluster techniques differ with respect to how the merging of data points happens. In this section we will look at hierarchical clustering, which is a clustering method that builds hierarchies of clusters. Typically hierarchical clustering techniques construct a so-called dendrogram, which has a top or root node covering all other data points. The leaf nodes of the dendrogram, or cluster tree, consists of all original data points. If we think of the original datapoints as singleton clusters, hierarchical clustering is an iterative procedure in which in each iteration two clusters are merged into a new cluster. This process is repeated until we arrive at the root node.\n",
"\n",
"At each iteration, the two clusters with the highest similarity combined. The definition of what counts as being similar is what differentiates between the different hierarchical clustering methods. One popular clustering method is **single-linkage clustering**. Mathematically, the single linkage function \u2013 the distance $D(X,Y)$ between clusters $X$ and $Y$ \u2013 is described by the expression\n",
"\n",
"$$D(X,Y)=\\min_{x\\in X, y\\in Y} d(x,y),$$\n",
"\n",
"where $X$ and $Y$ are any two clusters, and $d(x,y)$ represents the distance between the two data points $x$ and $y$. The two clusters $A$ and $B$ of which $D(A,B)$ is the smallest are merged into a new cluster $C$.\n",
"\n",
"Let's have a look at a small example to make this all a little more concrete. The following table presents a data set of apples. Each apple is described according to a number of different features:\n",
"\n",
"\n",
"
\n", " | quality | \n", "color | \n", "firmness | \n", "taste | \n", "
---|---|---|---|---|
0 | \n", "bad | \n", "red | \n", "firm | \n", "sweet | \n", "
1 | \n", "bad | \n", "red | \n", "firm | \n", "sweet | \n", "
2 | \n", "bad | \n", "yellow | \n", "firm | \n", "sour | \n", "
3 | \n", "good | \n", "red | \n", "soft | \n", "sour | \n", "
4 | \n", "good | \n", "yellow | \n", "soft | \n", "sweet | \n", "
5 | \n", "bad | \n", "yellow | \n", "firm | \n", "sour | \n", "
Python Programming for the Humanities by http://fbkarsdorp.github.io/python-course is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/fbkarsdorp/python-course.