{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 11 - Dimensions and visualisation\n", "\n", "> Introduction to dimensionality reduction and visualisation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[](https://mybinder.org/v2/gh/lewtun/dslectures/master?urlpath=lab/tree/notebooks%2Flesson11_visualisation.ipynb) [](https://drive.google.com/open?id=1zEGTza7zgBY6bDUxW6spcgr5InYY3Fub)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning objectives\n", "In this lecture we will have a look at dimensionality reduction and data visualisation. When we look at the feature vector of a dataset it can contain many features and each feature corresponds to its own dimension. In the first part of the notebook we will explore a method to reduce the dimensionality of a dataset. In the second part we will look at a an algorithm to extract information from high-dimensional data and display its structure in two dimensions. Both these methods fall into the cateogory of unsupersvised algorithms. The learning objectives are to be able to answer the following questions:\n", "\n", "* What is dimensionality reduction?\n", "* When is dimensionality reduction useful?\n", "* What is topology and how is it connected to Mapper?\n", "* What are the steps involved in Mapper?\n", "\n", "## References\n", "\n", "* Chapter 8: Dimensionality Reduction, Section PCA of _Hands-On Machine Learning with Scikit-Learn and TensorFlow_ by Aurèlien Geron\n", "* [Santa2Graph: visualize high dimensional data with Giotto Mapper](https://towardsdatascience.com/visualising-high-dimensional-data-with-giotto-mapper-897fcdb575d7)\n", "\n", "## Homework\n", "As homework read the references, work carefully through the notebook and solve the exercises. This lecture covers several complex topics and it is important that you familiarise yourself by experimenting with the notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Principal component analysis\n", "\n", "First we have a look at a method to reduce the dimensions of a dataset. There are two main reasons why we would like to reduce the dimensionality of such datasets:\n", "\n", "1. Some machine learning algorithms struggle with high dimensional data.\n", "2. It is hard to visualize high dimensional data. In practice it is hard to visualise datasets that have more than two or three dimensions.\n", "\n", "There is one very common approach to reduce the dimensionality of data: principal component analysis (PCA). The idea is that not all axis contain the same amount of variance and that there are even combination of axis that contain most variance. PCA seeks to find new coordinate axis such that the variance along the axis is maximised and ordered (the first principal component conatains most variance).\n", "\n", "
\n",
" Figure: PCA can look intimidating at the beginning.
\n", "| \n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "total_rooms | \n", "total_bedrooms | \n", "population | \n", "households | \n", "median_income | \n", "median_house_value | \n", "city | \n", "postal_code | \n", "rooms_per_household | \n", "bedrooms_per_household | \n", "bedrooms_per_room | \n", "population_per_household | \n", "ocean_proximity_INLAND | \n", "ocean_proximity_<1H OCEAN | \n", "ocean_proximity_NEAR BAY | \n", "ocean_proximity_NEAR OCEAN | \n", "ocean_proximity_ISLAND | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "-122.23 | \n", "37.88 | \n", "41.0 | \n", "880.0 | \n", "129.0 | \n", "322.0 | \n", "126.0 | \n", "8.3252 | \n", "452600.0 | \n", "69 | \n", "94705 | \n", "6.984127 | \n", "1.023810 | \n", "0.146591 | \n", "2.555556 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
| 1 | \n", "-122.22 | \n", "37.86 | \n", "21.0 | \n", "7099.0 | \n", "1106.0 | \n", "2401.0 | \n", "1138.0 | \n", "8.3014 | \n", "358500.0 | \n", "620 | \n", "94611 | \n", "6.238137 | \n", "0.971880 | \n", "0.155797 | \n", "2.109842 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
| 2 | \n", "-122.24 | \n", "37.85 | \n", "52.0 | \n", "1467.0 | \n", "190.0 | \n", "496.0 | \n", "177.0 | \n", "7.2574 | \n", "352100.0 | \n", "620 | \n", "94618 | \n", "8.288136 | \n", "1.073446 | \n", "0.129516 | \n", "2.802260 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
| 3 | \n", "-122.25 | \n", "37.85 | \n", "52.0 | \n", "1274.0 | \n", "235.0 | \n", "558.0 | \n", "219.0 | \n", "5.6431 | \n", "341300.0 | \n", "620 | \n", "94618 | \n", "5.817352 | \n", "1.073059 | \n", "0.184458 | \n", "2.547945 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
| 4 | \n", "-122.25 | \n", "37.85 | \n", "52.0 | \n", "1627.0 | \n", "280.0 | \n", "565.0 | \n", "259.0 | \n", "3.8462 | \n", "342200.0 | \n", "620 | \n", "94618 | \n", "6.281853 | \n", "1.081081 | \n", "0.172096 | \n", "2.181467 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
| \n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "total_rooms | \n", "total_bedrooms | \n", "population | \n", "households | \n", "median_income | \n", "median_house_value | \n", "city | \n", "postal_code | \n", "rooms_per_household | \n", "bedrooms_per_household | \n", "bedrooms_per_room | \n", "population_per_household | \n", "ocean_proximity_INLAND | \n", "ocean_proximity_<1H OCEAN | \n", "ocean_proximity_NEAR BAY | \n", "ocean_proximity_NEAR OCEAN | \n", "ocean_proximity_ISLAND | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "19443.000000 | \n", "
| mean | \n", "-119.560363 | \n", "35.646739 | \n", "28.435118 | \n", "2617.678548 | \n", "538.136964 | \n", "1442.129970 | \n", "501.427352 | \n", "3.675099 | \n", "191793.406162 | \n", "541.629224 | \n", "93030.145605 | \n", "5.340245 | \n", "1.091741 | \n", "0.214812 | \n", "3.095953 | \n", "0.331482 | \n", "0.439953 | \n", "0.106774 | \n", "0.121535 | \n", "0.000257 | \n", "
| std | \n", "2.002697 | \n", "2.145335 | \n", "12.504584 | \n", "2179.553070 | \n", "420.168532 | \n", "1140.254218 | \n", "383.064222 | \n", "1.569687 | \n", "96775.724042 | \n", "260.704512 | \n", "1853.684352 | \n", "2.190405 | \n", "0.429728 | \n", "0.056667 | \n", "10.679036 | \n", "0.470758 | \n", "0.496394 | \n", "0.308833 | \n", "0.326756 | \n", "0.016035 | \n", "
| min | \n", "-124.350000 | \n", "32.550000 | \n", "1.000000 | \n", "2.000000 | \n", "2.000000 | \n", "3.000000 | \n", "2.000000 | \n", "0.499900 | \n", "14999.000000 | \n", "1.000000 | \n", "85344.000000 | \n", "0.846154 | \n", "0.333333 | \n", "0.100000 | \n", "0.750000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
| 25% | \n", "-121.760000 | \n", "33.930000 | \n", "18.000000 | \n", "1438.500000 | \n", "299.000000 | \n", "799.000000 | \n", "282.000000 | \n", "2.526500 | \n", "116700.000000 | \n", "328.000000 | \n", "91706.000000 | \n", "4.412378 | \n", "1.006140 | \n", "0.177906 | \n", "2.449692 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
| 50% | \n", "-118.490000 | \n", "34.260000 | \n", "29.000000 | \n", "2111.000000 | \n", "436.000000 | \n", "1181.000000 | \n", "411.000000 | \n", "3.446400 | \n", "173400.000000 | \n", "545.000000 | \n", "92860.000000 | \n", "5.180451 | \n", "1.048276 | \n", "0.204545 | \n", "2.841155 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
| 75% | \n", "-117.990000 | \n", "37.730000 | \n", "37.000000 | \n", "3119.000000 | \n", "644.000000 | \n", "1746.500000 | \n", "606.000000 | \n", "4.579750 | \n", "247100.000000 | \n", "770.000000 | \n", "94606.000000 | \n", "5.963796 | \n", "1.097701 | \n", "0.240414 | \n", "3.308208 | \n", "1.000000 | \n", "1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
| max | \n", "-114.310000 | \n", "41.950000 | \n", "52.000000 | \n", "39320.000000 | \n", "6445.000000 | \n", "35682.000000 | \n", "6082.000000 | \n", "15.000100 | \n", "499100.000000 | \n", "977.000000 | \n", "96161.000000 | \n", "132.533333 | \n", "34.066667 | \n", "1.000000 | \n", "1243.333333 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "
| \n", " | 0 | \n", "1 | \n", "2 | \n", "
|---|---|---|---|
| longitude | \n", "0.096240 | \n", "-0.440492 | \n", "0.279899 | \n", "
| latitude | \n", "-0.093538 | \n", "0.511034 | \n", "-0.231581 | \n", "
| housing_median_age | \n", "-0.228990 | \n", "-0.102574 | \n", "-0.203651 | \n", "
| total_rooms | \n", "0.480823 | \n", "0.110303 | \n", "0.003569 | \n", "
| total_bedrooms | \n", "0.481068 | \n", "0.046098 | \n", "-0.114091 | \n", "
| population | \n", "0.466057 | \n", "-0.002745 | \n", "-0.121771 | \n", "
| households | \n", "0.482901 | \n", "0.023411 | \n", "-0.149648 | \n", "
| median_income | \n", "0.083829 | \n", "0.066170 | \n", "0.381905 | \n", "
| rooms_per_household | \n", "0.029542 | \n", "0.299368 | \n", "0.519937 | \n", "
| bedrooms_per_household | \n", "0.006354 | \n", "0.204508 | \n", "0.343374 | \n", "
| bedrooms_per_room | \n", "-0.029537 | \n", "-0.230823 | \n", "-0.379778 | \n", "
| population_per_household | \n", "-0.002497 | \n", "-0.003664 | \n", "0.003476 | \n", "
| ocean_proximity_INLAND | \n", "-0.010714 | \n", "0.326173 | \n", "0.037222 | \n", "
| ocean_proximity_<1H OCEAN | \n", "0.054867 | \n", "-0.404136 | \n", "0.152350 | \n", "
| ocean_proximity_NEAR BAY | \n", "-0.067325 | \n", "0.233822 | \n", "-0.269103 | \n", "
| ocean_proximity_NEAR OCEAN | \n", "-0.003977 | \n", "-0.076523 | \n", "-0.030817 | \n", "
| ocean_proximity_ISLAND | \n", "-0.006251 | \n", "-0.009071 | \n", "0.001835 | \n", "
| \n", " | longitude | \n", "latitude | \n", "rooms_per_household | \n", "bedrooms_per_household | \n", "bedrooms_per_room | \n", "ocean_proximity_INLAND | \n", "ocean_proximity_<1H OCEAN | \n", "ocean_proximity_NEAR BAY | \n", "
|---|---|---|---|---|---|---|---|---|
| count | \n", "8602.000000 | \n", "8602.000000 | \n", "8602.000000 | \n", "8602.000000 | \n", "8602.000000 | \n", "8602.000000 | \n", "8602.000000 | \n", "8602.000000 | \n", "
| mean | \n", "0.767273 | \n", "-0.858163 | \n", "-0.236650 | \n", "-0.099011 | \n", "0.289184 | \n", "-0.630324 | \n", "0.729185 | \n", "-0.345365 | \n", "
| std | \n", "0.374053 | \n", "0.273084 | \n", "0.523087 | \n", "0.205740 | \n", "1.184709 | \n", "0.389128 | \n", "0.802984 | \n", "0.034913 | \n", "
| min | \n", "-1.577731 | \n", "-1.443513 | \n", "-2.051770 | \n", "-1.764901 | \n", "-1.787272 | \n", "-0.704163 | \n", "-0.886320 | \n", "-0.345741 | \n", "
| 25% | \n", "0.619362 | \n", "-0.902791 | \n", "-0.611887 | \n", "-0.208570 | \n", "-0.519333 | \n", "-0.704163 | \n", "1.128260 | \n", "-0.345741 | \n", "
| 50% | \n", "0.709243 | \n", "-0.804902 | \n", "-0.279346 | \n", "-0.111908 | \n", "0.105280 | \n", "-0.704163 | \n", "1.128260 | \n", "-0.345741 | \n", "
| 75% | \n", "0.874025 | \n", "-0.734981 | \n", "0.105771 | \n", "-0.012303 | \n", "0.870705 | \n", "-0.704163 | \n", "1.128260 | \n", "-0.345741 | \n", "
| max | \n", "2.501871 | \n", "1.367309 | \n", "1.901482 | \n", "2.484505 | \n", "13.856510 | \n", "1.420126 | \n", "1.128260 | \n", "2.892336 | \n", "
| \n", " | longitude | \n", "latitude | \n", "rooms_per_household | \n", "bedrooms_per_household | \n", "bedrooms_per_room | \n", "ocean_proximity_INLAND | \n", "ocean_proximity_<1H OCEAN | \n", "ocean_proximity_NEAR BAY | \n", "
|---|---|---|---|---|---|---|---|---|
| count | \n", "10841.000000 | \n", "10841.000000 | \n", "10841.000000 | \n", "10841.000000 | \n", "10841.000000 | \n", "10841.000000 | \n", "10841.000000 | \n", "10841.000000 | \n", "
| mean | \n", "-0.608807 | \n", "0.680926 | \n", "0.187775 | \n", "0.078563 | \n", "-0.229459 | \n", "0.500143 | \n", "-0.578586 | \n", "0.274036 | \n", "
| std | \n", "0.919116 | \n", "0.828470 | \n", "1.223449 | \n", "1.321400 | \n", "0.748986 | \n", "1.052636 | \n", "0.724778 | \n", "1.273936 | \n", "
| min | \n", "-2.391655 | \n", "-1.406222 | \n", "-1.655426 | \n", "-1.667938 | \n", "-2.026123 | \n", "-0.704163 | \n", "-0.886320 | \n", "-0.345741 | \n", "
| 25% | \n", "-1.298101 | \n", "0.299849 | \n", "-0.246291 | \n", "-0.191329 | \n", "-0.724091 | \n", "-0.704163 | \n", "-0.886320 | \n", "-0.345741 | \n", "
| 50% | \n", "-0.943570 | \n", "0.929138 | \n", "0.056233 | \n", "-0.092209 | \n", "-0.309306 | \n", "1.420126 | \n", "-0.886320 | \n", "-0.345741 | \n", "
| 75% | \n", "-0.034773 | \n", "1.143562 | \n", "0.386952 | \n", "0.040083 | \n", "0.146278 | \n", "1.420126 | \n", "-0.886320 | \n", "-0.345741 | \n", "
| max | \n", "2.621713 | \n", "2.938200 | \n", "58.069790 | \n", "76.736388 | \n", "6.038055 | \n", "1.420126 | \n", "1.128260 | \n", "2.892336 | \n", "
\n",
" Figure: Satellite image of California from Google Maps.
\n", "
\n",
" Figure: A coffee mug is a donut.
\n", "
\n",
" Figure: Schematic overview of Mapper [1].
\n", "
\n",
" Figure reference: https://github.com/NSHipster/DBSCAN
\n", "
\n",
" Figure: Mapper applied to breast cancer data.
\n", "