{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "About the Author: Some of Sebastian Raschka's greatest passions are \"Data Science\" and machine learning. Sebastian enjoys everything that involves working with data: The discovery of interesting patterns and coming up with insightful conclusions using techniques from the fields of data mining and machine learning for predictive modeling.\n", "\n", "Currently, Sebastian is sharpening his analytical skills as a PhD candidate at Michigan State University where he is working on a highly efficient virtual screening software for computer-aided drug-discovery and a novel approach to protein ligand docking (among other projects). Basically, it is about the screening of a database of millions of 3-dimensional structures of chemical compounds in order to identifiy the ones that could potentially bind to specific protein receptors in order to trigger a biological response.\n", "\n", "You can follow Sebastian on Twitter ([@rasbt](https://twitter.com/rasbt)) or read more about his favorite projects on [his blog](http://sebastianraschka.com/articles.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Principal Component Analysis in 3 Simple Steps\n", "Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique that is used in numerous applications, such as stock market predictions, the analysis of gene expression data, and many more. In this tutorial, we will see that PCA is not just a \"black box\", and we are going to unravel its internals in 3 basic steps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Introduction\n", "The sheer size of data in the modern age is not only a challenge for computer hardware but also a main bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis is to identify patterns in data; PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense. In a nutshell, this is what PCA is all about: Finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PCA Vs. LDA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both Linear Discriminant Analysis (LDA) and PCA are linear transformation methods. PCA yields the directions (principal components) that maximize the variance of the data, whereas LDA also aims to find the directions that maximize the separation (or discrimination) between different classes, which can be useful in pattern classification problem (PCA \"ignores\" class labels). \n", "***In other words, PCA projects the entire dataset onto a different feature (sub)space, and LDA tries to determine a suitable feature (sub)space in order to distinguish between patterns that belong to different classes.*** " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PCA and Dimensionality Reduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, the desired goal is to reduce the dimensions of a $d$-dimensional dataset by projecting it onto a $(k)$-dimensional subspace (where $k\\;<\\;d$) in order to increase the computational efficiency while retaining most of the information. An important question is \"what is the size of $k$ that represents the data 'well'?\"\n", "\n", "Later, we will compute eigenvectors (the principal components) of a dataset and collect them in a projection matrix. Each of those eigenvectors is associated with an eigenvalue which can be interpreted as the \"length\" or \"magnitude\" of the corresponding eigenvector. If some eigenvalues have a significantly larger magnitude than others that the reduction of the dataset via PCA onto a smaller dimensional subspace by dropping the \"less informative\" eigenpairs is reasonable.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Summary of the PCA Approach" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Standardize the data.\n", "- Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.\n", "- Sort eigenvalues in descending order and choose the $k$ eigenvectors that correspond to the $k$ largest eigenvalues where $k$ is the number of dimensions of the new feature subspace ($k \\le d$)/.\n", "- Construct the projection matrix $\\mathbf{W}$ from the selected $k$ eigenvectors.\n", "- Transform the original dataset $\\mathbf{X}$ via $\\mathbf{W}$ to obtain a $k$-dimensional feature subspace $\\mathbf{Y}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the Iris Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the following tutorial, we will be working with the famous \"Iris\" dataset that has been deposited on the UCI machine learning repository \n", "([https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)).\n", "\n", "The iris dataset contains measurements for 150 iris flowers from three different species.\n", "\n", "The three classes in the Iris dataset are:\n", "\n", "1. Iris-setosa (n=50)\n", "2. Iris-versicolor (n=50)\n", "3. Iris-virginica (n=50)\n", "\n", "And the four features of in Iris dataset are:\n", "\n", "1. sepal length in cm\n", "2. sepal width in cm\n", "3. petal length in cm\n", "4. petal width in cm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to load the Iris data directly from the UCI repository, we are going to use the superb [pandas](http://pandas.pydata.org) library. If you haven't used pandas yet, I want encourage you to check out the [pandas tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html). If I had to name one Python library that makes working with data a wonderfully simple task, this would definitely be pandas!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | sepal_len | \n", "sepal_wid | \n", "petal_len | \n", "petal_wid | \n", "class | \n", "
|---|---|---|---|---|---|
| 145 | \n", "6.7 | \n", "3.0 | \n", "5.2 | \n", "2.3 | \n", "Iris-virginica | \n", "
| 146 | \n", "6.3 | \n", "2.5 | \n", "5.0 | \n", "1.9 | \n", "Iris-virginica | \n", "
| 147 | \n", "6.5 | \n", "3.0 | \n", "5.2 | \n", "2.0 | \n", "Iris-virginica | \n", "
| 148 | \n", "6.2 | \n", "3.4 | \n", "5.4 | \n", "2.3 | \n", "Iris-virginica | \n", "
| 149 | \n", "5.9 | \n", "3.0 | \n", "5.1 | \n", "1.8 | \n", "Iris-virginica | \n", "