{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Decorrelating your data and dimension reduction\n", "> A Summary of lecture \"Unsupervised Learning with scikit-learn\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/pca-arrow.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing the PCA transformation\n", "- Dimension reduction\n", " - More efficient storage and computation\n", " - Remove less-informative \"noise\" features, which cause problems for prediction tasks, e.g. classification, regression.\n", "- Principal Component Analysis (PCA)\n", " - Fundamental dimension reduction technique\n", " - \"Decorrelation\"\n", " - Reduce dimension\n", "- PCA aligns data with axes\n", " - Rotates data samples to be aligned with axes\n", " - Shifts data samples so they have mean 0\n", " - No information is lost\n", "- PCA features\n", " - Rows : samples\n", " - Columns : PCA features\n", " - Row gives PCA feature values of corresponding sample\n", "- Pearson Correlation\n", " - Measures linear correlation of features\n", " - Value between -1 and 1\n", " - Value of 0 means no linear correlation\n", "- Principal components\n", " - directions of variance\n", " - PCA aligns principal components with the axes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correlated data in nature\n", "You are given an array ```grains``` giving the width and length of samples of grain. You suspect that width and length will be correlated. To confirm this, make a scatter plot of width vs length and measure their Pearson correlation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocess" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
03.3125.763
13.3335.554
23.3375.291
33.3795.324
43.5625.658
\n", "
" ], "text/plain": [ " 0 1\n", "0 3.312 5.763\n", "1 3.333 5.554\n", "2 3.337 5.291\n", "3 3.379 5.324\n", "4 3.562 5.658" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./dataset/seeds-width-vs-length.csv', header=None)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "grains = df.values" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.8604149377143466\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from scipy.stats import pearsonr\n", "\n", "# Assign the 0th column of grains: width\n", "width = grains[:, 0]\n", "\n", "# Assign the 1st column of grains: length\n", "length = grains[:, 1]\n", "\n", "# Scatter plot width vs length\n", "plt.scatter(width, length)\n", "plt.axis('equal');\n", "\n", "# Calculate the Pearson correlation\n", "correlation, pvalue = pearsonr(width, length)\n", "\n", "# Display the correlation\n", "print(correlation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Decorrelating the grain measurements with PCA\n", "You observed in the previous exercise that the width and length measurements of the grain are correlated. Now, you'll use PCA to decorrelate these measurements, then plot the decorrelated points and measure their Pearson correlation.\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.85722573273506e-17\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "# Create a PCA instance: model\n", "model = PCA()\n", "\n", "# Apply the fit_transform method of model to grains: pca_features\n", "pca_features = model.fit_transform(grains)\n", "\n", "# Assign 0th column of pca_features: xs\n", "xs = pca_features[:, 0]\n", "\n", "# Assign 1st column of pca_features: ys\n", "ys = pca_features[:, 1]\n", "\n", "# Scatter plot xs vs ys\n", "plt.scatter(xs, ys)\n", "plt.axis('equal');\n", "\n", "# Calculate the Pearson correlation of xs and ys\n", "correlation, pvalue = pearsonr(xs, ys)\n", "\n", "# Display the correlation\n", "print(correlation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Intrinsic dimension\n", "- Intrinsic dimension\n", " - Intrinsic dimension = number of features needed to approximate the dataset\n", " - Essential idea behind dimension reduction\n", " - What is the most compact representation of the samples?\n", " - Can be detected with PCA\n", "- PCA identifies intrinsic dimension\n", " - Scatter plots work only if samples have 2 or 3 features\n", " - PCA identifies intrinsic dimension when samples have any number of features\n", " - Intrinsic dimension = **number of PCA features with signficant variance**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The first principal component\n", "The first principal component of the data is the direction in which the data varies the most. In this exercise, your job is to use PCA to find the first principal component of the length and width measurements of the grain samples, and represent it as an arrow on the scatter plot." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Make a scatter plot of the untransformed points\n", "plt.scatter(grains[:, 0], grains[:, 1])\n", "\n", "# Create a PCA instance: model\n", "model = PCA()\n", "\n", "# Fit model to points\n", "model.fit(grains)\n", "\n", "# Get the mean of the grain samples: mean\n", "mean = model.mean_\n", "\n", "# Get the first principal component: first_pc\n", "first_pc = model.components_[0, :]\n", "\n", "# Plot first_pc as an arrow, starting at mean\n", "plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)\n", "\n", "# keep axes on same scale\n", "plt.axis('equal');\n", "plt.savefig('../images/pca-arrow.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variance of the PCA features\n", "The fish dataset is 6-dimensional. But what is its intrinsic dimension? Make a plot of the variances of the PCA features to find out. As before, ```samples``` is a 2D array, where each row represents a fish. You'll need to standardize the features first." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocess" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
0Bream242.023.225.430.038.413.4
1Bream290.024.026.331.240.013.8
2Bream340.023.926.531.139.815.1
3Bream363.026.329.033.538.013.3
4Bream430.026.529.034.036.615.1
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "0 Bream 242.0 23.2 25.4 30.0 38.4 13.4\n", "1 Bream 290.0 24.0 26.3 31.2 40.0 13.8\n", "2 Bream 340.0 23.9 26.5 31.1 39.8 15.1\n", "3 Bream 363.0 26.3 29.0 33.5 38.0 13.3\n", "4 Bream 430.0 26.5 29.0 34.0 36.6 15.1" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./dataset/fish.csv', header=None)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "samples = df.loc[:, 1:].values" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEGCAYAAABo25JHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAU2klEQVR4nO3df5BlZX3n8ffHYQQVhI3TRsIMtAksKbQMYIO4VCUImEJhIZRYDrUxkqgTLSlxk2wWrC2ibG1WN1WaNRqpSaCChgVcEDNB1OACAYz86MHhxzCYTFhYJrA1HUBgDKgD3/3jHpa25/b0ne4599J93q+qU3N+PPfc7xmY/vQ55znPSVUhSequl426AEnSaBkEktRxBoEkdZxBIEkdZxBIUsftMeoCdtWKFStqfHx81GVI0qKyfv36f66qsX7bFl0QjI+PMzk5OeoyJGlRSfLQbNu8NCRJHWcQSFLHGQSS1HGtB0GSZUm+l+SaPtv2THJFks1Jbksy3nY9kqSfNowzgnOATbNsez/wRFUdDHwW+PQQ6pEkTdNqECRZCZwM/PksTU4DLmnmrwROSJI2a5Ik/bS2zwj+GPh94PlZth8APAxQVduBJ4HXzGyUZE2SySSTU1NTbdUqSZ3UWhAkOQXYWlXrd9asz7odxsWuqrVVNVFVE2NjfZ+HkCTNU5tnBMcCpyZ5ELgcOD7JX85oswVYBZBkD2Bf4PEWa5IkzdDak8VVdR5wHkCS44Dfq6pfn9FsHfA+4LvAGcD11eKbcsbP/Xpbu96tHvzUyaMuQVKHDH2IiSQXAJNVtQ64CPhyks30zgRWD7seSeq6oQRBVd0I3NjMnz9t/bPAu4dRgySpP58slqSOMwgkqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAkjrOIJCkjmstCJLsleT2JHcl2Zjkk33anJVkKsmGZvpAW/VIkvpr81WVPwKOr6ptSZYDtyT5RlXdOqPdFVV1dot1SJJ2orUgqKoCtjWLy5up2vo+SdL8tHqPIMmyJBuArcB1VXVbn2bvSnJ3kiuTrJplP2uSTCaZnJqaarNkSeqcVoOgqp6rqsOBlcDRSd44o8lfA+NV9Sbg28Als+xnbVVNVNXE2NhYmyVLUucMpddQVf0AuBE4acb6x6rqR83inwFvHkY9kqQXtdlraCzJfs38K4ATgftntNl/2uKpwKa26pEk9ddmr6H9gUuSLKMXOF+pqmuSXABMVtU64KNJTgW2A48DZ7VYjySpjzZ7Dd0NHNFn/fnT5s8DzmurBknS3HyyWJI6ziCQpI4zCCSp4wwCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOq7NdxbvleT2JHcl2Zjkk33a7JnkiiSbk9yWZLyteiRJ/bV5RvAj4Piq+iXgcOCkJMfMaPN+4ImqOhj4LPDpFuuRJPXRWhBUz7ZmcXkz1YxmpwGXNPNXAickSVs1SZJ21Oo9giTLkmwAtgLXVdVtM5ocADwMUFXbgSeB1/TZz5okk0kmp6am2ixZkjqn1SCoqueq6nBgJXB0kjfOaNLvt/+ZZw1U1dqqmqiqibGxsTZKlaTOGkqvoar6AXAjcNKMTVuAVQBJ9gD2BR4fRk2SpJ42ew2NJdmvmX8FcCJw/4xm64D3NfNnANdX1Q5nBJKk9uzR4r73By5Jsoxe4Hylqq5JcgEwWVXrgIuALyfZTO9MYHWL9UiS+mgtCKrqbuCIPuvPnzb/LPDutmqQJM3NJ4slqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAkjrOIJCkjjMIJKnj2nxn8aokNyTZlGRjknP6tDkuyZNJNjTT+f32JUlqT5vvLN4O/G5V3ZlkH2B9kuuq6r4Z7W6uqlNarEOStBOtnRFU1aNVdWcz/zSwCTigre+TJM3PUO4RJBmn9yL72/psfmuSu5J8I8kbhlGPJOlFbV4aAiDJ3sBVwMeq6qkZm+8EDqqqbUneCXwNOKTPPtYAawAOPPDAliuWpG5p9YwgyXJ6IXBpVX115vaqeqqqtjXz1wLLk6zo025tVU1U1cTY2FibJUtS57TZayjARcCmqvrMLG1e17QjydFNPY+1VZMkaUdzXhpqflD/O+Dnq+qCJAcCr6uq2+f46LHAe4F7kmxo1n0cOBCgqi4EzgA+nGQ78AywuqpqfociSZqPQe4R/CnwPHA8cAHwNL3LPUft7ENVdQuQOdp8Hvj8QJVKkloxSBC8paqOTPI9gKp6IsnLW65LkjQkg9wj+EmSZUABJBmjd4YgSVoCBgmCzwFXA69N8l+AW4A/bLUqSdLQzHlpqKouTbIeOIHeNf9fq6pNrVcmSRqKQXoNHQNsrKovNMv7JHlLVfV7SliStMgMcmnoi8C2acs/bNZJkpaAQYIg0/v2V9XzDGFoCknScAwSBA8k+WiS5c10DvBA24VJkoZjkCD4EPBvgH8CtgBvoRkATpK0+A3Sa2grsHoItUiSRmCQXkNjwAeB8entq+q32itLkjQsg9z0/SvgZuDbwHPtliNJGrZBguCVVfUfW69EkjQSg9wsvqZ5e5gkaQkaJAjOoRcGzyR5KsnTSWa+clKStEgN0mton2EUIkkajYGeEE7yr+i9VH6vF9ZV1U1tFSVJGp5Buo9+gN7loZXABuAY4Lv03lgmSVrkBr1HcBTwUFW9DTgCmJrrQ0lWJbkhyaYkG5uhKWa2SZLPJdmc5O4kR+7yEUiSFmSQS0PPVtWzSUiyZ1Xdn+TQAT63HfjdqrozyT7A+iTXVdV909q8g94lp0PoDV3xxeZPSdKQDBIEW5LsB3wNuC7JE8Ajc32oqh4FHm3mn06yCTgAmB4EpwFfakY3vTXJfkn2bz4rSRqCQXoNnd7MfiLJDcC+wDd35UuSjNO7pDTzZTYHAA9PW97SrPupIEiyhmaguwMPPHBXvlqSNIdZ7xEkeXXz58+8MAH30Htn8d6DfkGSvYGrgI9V1cznD9LnI7XDiqq1VTVRVRNjY2ODfrUkaQA7OyP4H8ApwHp6P5wz48+fn2vnSZbTC4FLq+qrfZpsAVZNW17JAJedJEm7z6xBUFWnJAnwK1X1f3Z1x81nLwI2VdVnZmm2Djg7yeX0bhI/6f0BSRqund4jqKpKcjXw5nns+1jgvcA9STY06z4OHNjs+0LgWuCdwGbgX4DfnMf3SJIWYJBeQ7cmOaqq7tiVHVfVLfS/BzC9TQEf2ZX9SpJ2r0GC4G3Abyd5CPghzT2CqnpTq5VJkoZikCB4R+tVSJJGZpDnCB4CSPJapg06J0laGuYcayjJqUn+AfjfwN8CDwLfaLkuSdKQDDLo3H+mN+Lo31fV64ETgO+0WpUkaWgGuUfwk6p6LMnLkrysqm5I8unWK9NAxs/9+qhLGMiDnzp51CVImsUgQfCDZpiIm4FLk2ylN7KoJGkJGOTS0E3AfvTeS/BN4B+Bf9tmUZKk4RkkCAJ8C7iR3mBzV1TVY20WJUkanjmDoKo+WVVvoPcE8M8Bf5vk261XJkkaikHOCF6wFfi/wGPAa9spR5I0bIM8R/DhJDcC/wtYAXzQ4SUkaekYpNfQQfReKrNhzpaSpEVnkCEmzh1GIZKk0diVewSSpCXIIJCkjjMIJKnjWguCJBcn2Zrk3lm2H5fkySQbmun8tmqRJM1ukF5D8/UXwOeBL+2kzc1VdUqLNUiS5tDaGUFV3QQ83tb+JUm7x6jvEbw1yV1JvpHkDbM1SrImyWSSyampqWHWJ0lL3iiD4E7goKr6JeBPgK/N1rCq1lbVRFVNjI2NDa1ASeqCkQVBVT1VVdua+WuB5UlWjKoeSeqqkQVBktclSTN/dFOLw1tL0pC11msoyWXAccCKJFuAPwCWA1TVhcAZwIeTbAeeAVZXVbVVjySpv9aCoKrOnGP75+l1L5UkjdCoew1JkkbMIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6rrUgSHJxkq1J7p1le5J8LsnmJHcnObKtWiRJs2vzjOAvgJN2sv0dwCHNtAb4You1SJJm0VoQVNVNwOM7aXIa8KXquRXYL8n+bdUjSepvlPcIDgAenra8pVm3gyRrkkwmmZyamhpKcZLUFaMMgvRZV/0aVtXaqpqoqomxsbGWy5KkbhllEGwBVk1bXgk8MqJaJKmzRhkE64DfaHoPHQM8WVWPjrAeSeqkPdracZLLgOOAFUm2AH8ALAeoqguBa4F3ApuBfwF+s61aJEmzay0IqurMObYX8JG2vl+SNBifLJakjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI5rNQiSnJTk+0k2Jzm3z/azkkwl2dBMH2izHknSjtp8Z/Ey4AvA24EtwB1J1lXVfTOaXlFVZ7dVhyRp59o8Izga2FxVD1TVj4HLgdNa/D5J0jy0GQQHAA9PW97SrJvpXUnuTnJlklUt1iNJ6qPNIEifdTVj+a+B8ap6E/Bt4JK+O0rWJJlMMjk1NbWby5SkbmszCLYA03/DXwk8Mr1BVT1WVT9qFv8MeHO/HVXV2qqaqKqJsbGxVoqVpK5qMwjuAA5J8vokLwdWA+umN0iy/7TFU4FNLdYjSeqjtV5DVbU9ydnAt4BlwMVVtTHJBcBkVa0DPprkVGA78DhwVlv1SJL6ay0IAKrqWuDaGevOnzZ/HnBemzVIknbOJ4slqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI5r9YEyaVeNn/v1UZcwkAc/dfKoS5B2G88IJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOs7nCKSW+WyEXuo8I5Ckjms1CJKclOT7STYnObfP9j2TXNFsvy3JeJv1SJJ21FoQJFkGfAF4B3AYcGaSw2Y0ez/wRFUdDHwW+HRb9UiS+mvzjOBoYHNVPVBVPwYuB06b0eY04JJm/krghCRpsSZJ0gxt3iw+AHh42vIW4C2ztamq7UmeBF4D/PP0RknWAGuaxW1Jvt9KxfOzghn1LlRGf1601I5pqR0PLL1j2u3H8xLwUjumg2bb0GYQ9PvNvubRhqpaC6zdHUXtbkkmq2pi1HXsTkvtmJba8cDSO6aldjywuI6pzUtDW4BV05ZXAo/M1ibJHsC+wOMt1iRJmqHNILgDOCTJ65O8HFgNrJvRZh3wvmb+DOD6qtrhjECS1J7WLg011/zPBr4FLAMurqqNSS4AJqtqHXAR8OUkm+mdCaxuq54WvSQvWS3QUjumpXY8sPSOaakdDyyiY4q/gEtSt/lksSR1nEEgSR1nECzAXENoLDZJLk6yNcm9o65ld0iyKskNSTYl2ZjknFHXtBBJ9kpye5K7muP55Khr2l2SLEvyvSTXjLqWhUryYJJ7kmxIMjnqegbhPYJ5aobQ+Hvg7fS6wd4BnFlV9420sAVI8svANuBLVfXGUdezUEn2B/avqjuT7AOsB35tsf43ap66f1VVbUuyHLgFOKeqbh1xaQuW5HeACeDVVXXKqOtZiCQPAhNV9VJ6mGynPCOYv0GG0FhUquomltBzHFX1aFXd2cw/DWyi9zT7olQ925rF5c206H+TS7ISOBn481HX0lUGwfz1G0Jj0f6QWeqakW2PAG4bbSUL01xC2QBsBa6rqkV9PI0/Bn4feH7UhewmBfxNkvXN8DgveQbB/A00PIZGL8newFXAx6rqqVHXsxBV9VxVHU7vSf2jkyzqS3hJTgG2VtX6UdeyGx1bVUfSG3n5I80l15c0g2D+BhlCQyPWXEu/Cri0qr466np2l6r6AXAjcNKIS1moY4FTm+vqlwPHJ/nL0Za0MFX1SPPnVuBqepeRX9IMgvkbZAgNjVBzc/UiYFNVfWbU9SxUkrEk+zXzrwBOBO4fbVULU1XnVdXKqhqn92/o+qr69RGXNW9JXtV0TCDJq4BfBV7yvfAMgnmqqu3AC0NobAK+UlUbR1vVwiS5DPgucGiSLUneP+qaFuhY4L30fsvc0EzvHHVRC7A/cEOSu+n9InJdVS367pZLzM8CtyS5C7gd+HpVfXPENc3J7qOS1HGeEUhSxxkEktRxBoEkdZxBIEkdZxBIUscZBFqykjzXdBm9N8n/TPLKZv3rklye5B+T3Jfk2iT/etrn/n2SZ5Psu5N9/1EzAugfzaOuwxd5N1YtMQaBlrJnqurwZiTVHwMfah4yuxq4sap+oaoOAz5Or//3C86k10//9J3s+7eBI6vqP8yjrsOBXQqC9PjvVa3wfyx1xc3AwcDbgJ9U1YUvbKiqDVV1M0CSXwD2Bv4TvUDYQZJ1wKuA25K8p3ni96okdzTTsU27o5P8XTPO/t8lObR5Cv0C4D3N2cp7knwiye9N2/+9ScabaVOSPwXuBFYl+dUk301yZ3OWs3cbf1nqFoNAS16SPegNAHYP8EZ67yWYzZnAZfSC49Akr53ZoKpO5cWzjSuA/w58tqqOAt7Fi8Mp3w/8clUdAZwP/GEzZPn5wBXTPr8zh9J7P8QRwA/pBdSJzaBmk8DvzP03IO3cHqMuQGrRK5ohm6H3g/0i4ENzfGY1cHpVPZ/kq8C7gS/M8ZkTgcN6V50AeHUz3sy+wCVJDqE3Mu3yeRzDQ9NePHMMcBjwnea7Xk5vSBBpQQwCLWXPNEM2/39JNgJn9Guc5E3AIcB1037QPsDcQfAy4K1V9cyM/f0JcENVnd68D+HGWT6/nZ8+O99r2vwPp++S3vhCfS9ZSfPlpSF1zfXAnkk++MKKJEcl+RV6l4U+UVXjzfRzwAFJDppjn39DbwDCF/b3QvjsC/xTM3/WtPZPA/tMW34QOLL57JHA62f5nluBY5Mc3LR95fTeTtJ8GQTqlOqNsng68Pam++hG4BP03iWxml6PoumubtbvzEeBiSR3J7mPFy8//Tfgvyb5DrBsWvsb6F1K2pDkPfTel/AzzWWsD9N7F3a/2qfoBcplzQiktwK/OPdRSzvn6KOS1HGeEUhSxxkEktRxBoEkdZxBIEkdZxBIUscZBJLUcQaBJHXc/wNk/czBO90vPgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "from sklearn.pipeline import make_pipeline\n", "\n", "# Create scaler: scaler\n", "scaler = StandardScaler()\n", "\n", "# Create a PCA instance: pca\n", "pca = PCA()\n", "\n", "# Create pipeline: pipeline\n", "pipeline = make_pipeline(scaler, pca)\n", "\n", "# Fit the pipeline to 'samples'\n", "pipeline.fit(samples)\n", "\n", "# Plot the explained variances\n", "features = range(pca.n_components_)\n", "plt.bar(features, pca.explained_variance_)\n", "plt.xlabel('PCA feature')\n", "plt.ylabel('variance')\n", "plt.xticks(features);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dimension reduction with PCA\n", "- Dimension reduction\n", " - Represent same data, using less features\n", " - Important part of machine-learning pipelines\n", " - Can be performed using PCA\n", "- Dimension reduction with PCA\n", " - PCA features are in decreasing order of variance\n", " - Assumes the low variance features are \"noise\", and high variance features are informative\n", " - Specify how many features to keep\n", " - Intrinsic dimension is a good choice\n", "- Word frequency arrays\n", " - Rows represent documents, columns represent words\n", " - Entries measure presence of each word in each document, measure using \"tf-idf\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dimension reduction of the fish measurements\n", "In a previous exercise, you saw that 2 was a reasonable choice for the \"intrinsic dimension\" of the fish measurements. Now use PCA for dimensionality reduction of the fish measurements, retaining only the 2 most important components." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocess" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler()\n", "scaled_samples = scaler.fit_transform(samples)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(85, 2)\n" ] } ], "source": [ "# Create a PCA model with 2 components: pca\n", "pca = PCA(n_components=2)\n", "\n", "# Fit the PCA instance to the scaled samples\n", "pca.fit(scaled_samples)\n", "\n", "# Transform the scaled samples: pca_features\n", "pca_features = pca.transform(scaled_samples)\n", "\n", "# Print the shape of pca_features\n", "print(pca_features.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A tf-idf word-frequency array\n", "In this exercise, you'll create a tf-idf word frequency array for a toy collection of documents. For this, use the ```TfidfVectorizer``` from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. It has ```fit()``` and ```transform()``` methods like other sklearn objects." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "documents = ['cats say meow', 'dogs say woof', 'dogs chase cats']" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.51785612 0. 0. 0.68091856 0.51785612 0. ]\n", " [0. 0. 0.51785612 0. 0.51785612 0.68091856]\n", " [0.51785612 0.68091856 0.51785612 0. 0. 0. ]]\n", "['cats', 'chase', 'dogs', 'meow', 'say', 'woof']\n" ] } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "# Create a TfidfVectorizer: tfidf\n", "tfidf = TfidfVectorizer()\n", "\n", "# Apply fit_transform to document: csr_mat\n", "csr_mat = tfidf.fit_transform(documents)\n", "\n", "# Print result of toarray() method\n", "print(csr_mat.toarray())\n", "\n", "# Get the word: words\n", "words = tfidf.get_feature_names()\n", "\n", "# Print words\n", "print(words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clustering Wikipedia part I\n", "You saw in the video that ```TruncatedSVD``` is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. Combine your knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. In this exercise, build the pipeline. In the next exercise, you'll apply it to the word-frequency array of some Wikipedia articles.\n", "\n", "Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix for you, so there's no need for a TfidfVectorizer).\n", "\n", "The Wikipedia dataset you will be working with was obtained from [here](https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/)." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import TruncatedSVD\n", "from sklearn.cluster import KMeans\n", "from sklearn.pipeline import make_pipeline\n", "\n", "# Create a TruncatedSVD instance: svd\n", "svd = TruncatedSVD(n_components=50)\n", "\n", "# Create a KMeans instance: kmeans\n", "kmeans = KMeans(n_clusters=6)\n", "\n", "# Create a pipeline: pipeline\n", "pipeline = make_pipeline(svd, kmeans)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clustering Wikipedia part II\n", "It is now time to put your pipeline from the previous exercise to work! You are given an array articles of tf-idf word-frequencies of some popular Wikipedia articles, and a list titles of their titles. Use your pipeline to cluster the Wikipedia articles.\n", "\n", "A solution to the previous exercise has been pre-loaded for you, so a Pipeline pipeline chaining TruncatedSVD with KMeans is available." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocess" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "from scipy.sparse import csc_matrix\n", "\n", "documents = pd.read_csv('./dataset/wikipedia-vectors.csv', index_col=0)\n", "titles = documents.columns\n", "articles = csc_matrix(documents.values).T" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "scipy.sparse.csr.csr_matrix" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(articles)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(13125, 60)" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "articles.T.shape" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " label article\n", "38 0 Neymar\n", "36 0 2014 FIFA World Cup qualification\n", "35 0 Colombia national football team\n", "34 0 Zlatan Ibrahimović\n", "33 0 Radamel Falcao\n", "32 0 Arsenal F.C.\n", "31 0 Cristiano Ronaldo\n", "30 0 France national football team\n", "39 0 Franck Ribéry\n", "37 0 Football\n", "18 1 2010 United Nations Climate Change Conference\n", "17 1 Greenhouse gas emissions by the United States\n", "16 1 350.org\n", "15 1 Kyoto Protocol\n", "19 1 2007 United Nations Climate Change Conference\n", "13 1 Connie Hedegaard\n", "12 1 Nigel Lawson\n", "11 1 Nationally Appropriate Mitigation Action\n", "10 1 Global warming\n", "47 1 Fever\n", "14 1 Climate change\n", "28 2 Anne Hathaway\n", "27 2 Dakota Fanning\n", "25 2 Russell Crowe\n", "26 2 Mila Kunis\n", "23 2 Catherine Zeta-Jones\n", "22 2 Denzel Washington\n", "21 2 Michael Fassbender\n", "20 2 Angelina Jolie\n", "24 2 Jessica Biel\n", "29 2 Jennifer Aniston\n", "50 3 Chad Kroeger\n", "51 3 Nate Ruess\n", "52 3 The Wanted\n", "53 3 Stevie Nicks\n", "54 3 Arctic Monkeys\n", "55 3 Black Sabbath\n", "56 3 Skrillex\n", "57 3 Red Hot Chili Peppers\n", "59 3 Adam Levine\n", "58 3 Sepsis\n", "40 4 Tonsillitis\n", "48 4 Gabapentin\n", "46 4 Prednisone\n", "45 4 Hepatitis C\n", "49 4 Lymphoma\n", "43 4 Leukemia\n", "42 4 Doxycycline\n", "41 4 Hepatitis B\n", "44 4 Gout\n", "9 5 LinkedIn\n", "8 5 Firefox\n", "7 5 Social search\n", "6 5 Hypertext Transfer Protocol\n", "5 5 Tumblr\n", "4 5 Google Search\n", "3 5 HTTP cookie\n", "2 5 Internet Explorer\n", "1 5 Alexa Internet\n", "0 5 HTTP 404\n" ] } ], "source": [ "# Fit the pipeline to articles\n", "pipeline.fit(articles)\n", "\n", "# Calculate the cluster labels: labels\n", "labels = pipeline.predict(articles)\n", "\n", "# Create a DataFrame aligning labels and titles: df\n", "df = pd.DataFrame({'label': labels, 'article': titles})\n", "\n", "# Display df sorted by cluster label\n", "print(df.sort_values('label'))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }