{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dimensionality Reduction\n", "\n", "Datasets are sometimes very large, containing potentially millions of data points across a large numbers of features. \n", "\n", "Each feature can also be thought of as a 'dimension'. In some cases, for high-dimensional data, we may want or need to try to reduce the number of dimensions. Reducing the number of dimensions (or reducing the number of features in a dataset) is called 'dimensionality reduction'. \n", "\n", "The simplest way to do so could simply be to drop some dimensions, and we could even choose to drop the dimensions that seem likely to be the least useful. This would be a simple method of dimensionality reduction. However, this approach is likely to throw away a lot of information, and we wouldn't necessarily know which features to keep. Typically we want to try to reduce the number of dimensions while still preserving the most information we can from the dataset. \n", "\n", "As we saw before, one way we could try and do something like this is by doing clustering. When we run a clustering analysis on high dimensional data, we can try and re-code data to store each point by it's cluster label, potentially maintaining more information in a smaller number of dimensions.\n", "\n", "Here we will introduce and explore a different approach to dimensionality reduction. Instead of dropping or clustering our features, we are going to try and learn a new representation of our data, choosing a set of feature dimensions that capture the most variance of our data. This allows us to drop low information dimensions, meaning we can reduce the dimensionality of our data, while preserving the most information. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "