{ "cells": [ { "cell_type": "code", "execution_count": 2, "id": "clean-convention", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from matplotlib import gridspec\n", "from sklearn.decomposition import PCA\n", "m = ['o', 's', '^', 'x']\n", "countries = ['Argentina', 'Australia', 'Chile', 'South Africa']" ] }, { "cell_type": "markdown", "id": "catholic-watson", "metadata": {}, "source": [ "# PCA in action\n", "\n", "We saw previously a simple conceptual introduction to PCA, in that it is a method to find the vectors describing the features over which the largest spread is. \n", "However, it is not yet clear how it can be used for dimensionality reduction. \n", "For this, we will use a [real dataset](./example), which is information about 44 different wines (the data file is available [here](https://github.com/arm61/trad_ml_methods/raw/main/wine_data.csv)). \n", "The features we are investigating are the column in the data below, while the rows are the different wines. " ] }, { "cell_type": "code", "execution_count": 3, "id": "warming-transparency", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('wine_data.csv')" ] }, { "cell_type": "code", "execution_count": 4, "id": "departmental-abraham", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Ethanol | \n", "TotalAcid | \n", "VolatileA | \n", "MalicAcid | \n", "pH | \n", "LacticAcid | \n", "ReSugar | \n", "CitricAcid | \n", "CO2 | \n", "Density | \n", "FolinC | \n", "Glycerol | \n", "Methanol | \n", "TartaricA | \n", "Country | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "13.62 | \n", "3.54 | \n", "0.29 | \n", "0.89 | \n", "3.71 | \n", "0.78 | \n", "1.46 | \n", "0.31 | \n", "85.610001 | \n", "0.99 | \n", "60.919998 | \n", "9.72 | \n", "0.16 | \n", "1.74 | \n", "Argentina | \n", "
1 | \n", "14.06 | \n", "3.74 | \n", "0.59 | \n", "0.24 | \n", "3.73 | \n", "1.25 | \n", "2.42 | \n", "0.18 | \n", "175.199997 | \n", "1.00 | \n", "70.639999 | \n", "10.05 | \n", "0.20 | \n", "1.58 | \n", "Argentina | \n", "
2 | \n", "13.74 | \n", "3.27 | \n", "0.47 | \n", "-0.07 | \n", "3.87 | \n", "1.13 | \n", "1.52 | \n", "0.39 | \n", "513.739990 | \n", "0.99 | \n", "63.590000 | \n", "10.92 | \n", "0.18 | \n", "1.24 | \n", "Argentina | \n", "
3 | \n", "13.95 | \n", "3.66 | \n", "0.47 | \n", "0.09 | \n", "3.79 | \n", "1.00 | \n", "4.17 | \n", "0.41 | \n", "379.399994 | \n", "1.00 | \n", "73.300003 | \n", "9.69 | \n", "0.23 | \n", "2.26 | \n", "Argentina | \n", "
4 | \n", "14.47 | \n", "3.66 | \n", "0.38 | \n", "0.61 | \n", "3.70 | \n", "0.81 | \n", "1.25 | \n", "0.14 | \n", "154.880005 | \n", "0.99 | \n", "71.690002 | \n", "10.81 | \n", "0.20 | \n", "1.22 | \n", "Argentina | \n", "