{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Discovering interpretable features\n", "> A Summary of lecture \"Unsupervised Learning with scikit-learn\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/cosine.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Non-Negative matrix factorization (NMF)\n", "- NMF = Non-negative matrix factorization\n", " - Dimension reduction technique\n", " - NMF models are interpretable (unlike PCA)\n", " - Easy to interpret means easy to explain\n", " - However, all sample features must be non-negative ($\\ge 0$)\n", "- NMF components\n", " - Just like PCA has principal components\n", " - Dimension of components = dimension of samples\n", " - Entries are non-negative\n", " - Can be used to reconstruct the samples\n", " - Combine feature values with components\n", "- Sample reconstruction\n", " - Multiply components by feature values, and add up\n", " - Can also be expressed as a product of matrices\n", " - This is the \"Matrix Factorization\" in \"NMF\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NMF applied to Wikipedia articles\n", "In the video, you saw NMF applied to transform a toy word-frequency array. Now it's your turn to apply NMF, this time using the tf-idf word-frequency array of Wikipedia articles, given as a csr matrix ```articles```. Here, fit the model and transform the articles. In the next exercise, you'll explore the result.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocess" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from scipy.sparse import csr_matrix\n", "\n", "documents = pd.read_csv('./dataset/wikipedia-vectors.csv', index_col=0)\n", "titles = documents.columns\n", "articles = csr_matrix(documents.values).T" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 4.40447144e-01]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 5.66581665e-01]\n", " [3.82052712e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 3.98630002e-01]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 3.81723960e-01]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 4.85497565e-01]\n", " [1.29288170e-02 1.37900639e-02 7.76326408e-03 3.34365996e-02\n", " 0.00000000e+00 3.34508155e-01]\n", " [0.00000000e+00 0.00000000e+00 2.06741971e-02 0.00000000e+00\n", " 6.04540794e-03 3.59046120e-01]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 4.90956931e-01]\n", " [1.54271421e-02 1.42828947e-02 3.76635009e-03 2.37026001e-02\n", " 2.62642981e-02 4.80754528e-01]\n", " [1.11736323e-02 3.13702678e-02 3.09484990e-02 6.56762061e-02\n", " 1.96694618e-02 3.38274818e-01]\n", " [0.00000000e+00 0.00000000e+00 5.30717612e-01 0.00000000e+00\n", " 2.83704029e-02 0.00000000e+00]\n", " [0.00000000e+00 0.00000000e+00 3.56508094e-01 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [1.20125112e-02 6.50087569e-03 3.12244190e-01 6.09549744e-02\n", " 1.13871286e-02 1.92593939e-02]\n", " [3.93478571e-03 6.24483457e-03 3.42372089e-01 1.10728765e-02\n", " 0.00000000e+00 0.00000000e+00]\n", " [4.63812699e-03 0.00000000e+00 4.34913555e-01 0.00000000e+00\n", " 3.84308261e-02 3.08119905e-03]\n", " [0.00000000e+00 0.00000000e+00 4.83287460e-01 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [5.65006510e-03 1.83547516e-02 3.76531712e-01 3.25342948e-02\n", " 0.00000000e+00 1.13329771e-02]\n", " [0.00000000e+00 0.00000000e+00 4.80912131e-01 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 9.01923006e-03 5.51006051e-01 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 0.00000000e+00 4.65968041e-01 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 1.14088418e-02 2.08654946e-02 5.17579649e-01\n", " 5.81458673e-02 1.37848139e-02]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 5.10290254e-01\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 5.60141699e-03 0.00000000e+00 4.22226760e-01\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 4.36592958e-01\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 4.97911506e-01\n", " 0.00000000e+00 0.00000000e+00]\n", " [9.88376115e-02 8.60100028e-02 3.91034522e-03 3.80879401e-01\n", " 4.39283084e-04 5.22130114e-03]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 5.71962504e-01\n", " 0.00000000e+00 7.13513359e-03]\n", " [1.31466473e-02 1.04860275e-02 0.00000000e+00 4.68736079e-01\n", " 0.00000000e+00 1.16305318e-02]\n", " [3.84543550e-03 0.00000000e+00 0.00000000e+00 5.75501882e-01\n", " 0.00000000e+00 0.00000000e+00]\n", " [2.25241869e-03 1.38746694e-03 0.00000000e+00 5.27754407e-01\n", " 1.20275139e-02 1.49477806e-02]\n", " [0.00000000e+00 4.07574382e-01 1.85713967e-03 0.00000000e+00\n", " 2.96635743e-03 4.52315589e-04]\n", " [1.53418232e-03 6.08212140e-01 5.22275466e-04 6.24626335e-03\n", " 1.18454877e-03 4.40049387e-04]\n", " [5.38809700e-03 2.65034105e-01 5.38508926e-04 1.86857967e-02\n", " 6.38706684e-03 2.90092523e-03]\n", " [0.00000000e+00 6.44957364e-01 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 6.08946122e-01 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 3.43707347e-01 0.00000000e+00 0.00000000e+00\n", " 3.97828600e-03 0.00000000e+00]\n", " [6.10497459e-03 3.15333091e-01 1.54879481e-02 0.00000000e+00\n", " 5.06288085e-03 4.74315077e-03]\n", " [6.47362189e-03 2.13342287e-01 9.49492529e-03 4.56815320e-02\n", " 1.71929395e-02 9.52023189e-03]\n", " [7.99132601e-03 4.67625236e-01 0.00000000e+00 2.43337052e-02\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 6.42861446e-01 0.00000000e+00 2.35768628e-03\n", " 0.00000000e+00 0.00000000e+00]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 4.77121003e-01 0.00000000e+00]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 4.94295496e-01 0.00000000e+00]\n", " [0.00000000e+00 2.99081204e-04 2.14485182e-03 0.00000000e+00\n", " 3.81809252e-01 5.83752705e-03]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 5.64485513e-03\n", " 5.42284829e-01 0.00000000e+00]\n", " [1.78055699e-03 7.84461186e-04 1.41627290e-02 4.59634651e-04\n", " 4.24336362e-01 0.00000000e+00]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 5.11432598e-01 0.00000000e+00]\n", " [0.00000000e+00 0.00000000e+00 3.28382958e-03 0.00000000e+00\n", " 3.72916714e-01 0.00000000e+00]\n", " [0.00000000e+00 2.62099570e-04 3.61103149e-02 2.32246874e-04\n", " 2.30529171e-01 0.00000000e+00]\n", " [1.12515562e-02 2.12341198e-03 1.60971826e-02 1.02447544e-02\n", " 3.25487703e-01 3.75864568e-02]\n", " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 4.18991783e-01 3.57664717e-04]\n", " [3.08367803e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [3.68174824e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [3.97945914e-01 2.81721215e-02 3.67011224e-03 1.70005030e-02\n", " 1.95983506e-03 2.11635763e-02]\n", " [3.75795603e-01 2.07534002e-03 0.00000000e+00 3.72019376e-02\n", " 0.00000000e+00 5.85903599e-03]\n", " [4.38029361e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [4.57882228e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00]\n", " [2.75477966e-01 4.46985638e-03 0.00000000e+00 5.29463349e-02\n", " 0.00000000e+00 1.90989751e-02]\n", " [4.45195103e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 5.48742823e-03 0.00000000e+00]\n", " [2.92741164e-01 1.33673384e-02 1.14263020e-02 1.05161816e-02\n", " 1.87711505e-01 9.23926402e-03]\n", " [3.78267498e-01 1.43979557e-02 0.00000000e+00 9.84882180e-02\n", " 1.35911385e-02 0.00000000e+00]]\n" ] } ], "source": [ "from sklearn.decomposition import NMF\n", "\n", "# Create an NMF instance: model\n", "model = NMF(n_components=6)\n", "\n", "# Fit the model to articles\n", "model.fit(articles)\n", "\n", "# Transform the articles: nmf_features\n", "nmf_features = model.transform(articles)\n", "\n", "# Print the NMF features\n", "print(nmf_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NMF features of the Wikipedia articles\n", "Now you will explore the NMF features you created in the previous exercise. A solution to the previous exercise has been pre-loaded, so the array ```nmf_features``` is available. Also available is a list titles giving the ```title``` of each Wikipedia article.\n", "\n", "When investigating the features, notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. In the next video, you'll see why: NMF components represent topics (for instance, acting!)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0.003845\n", "1 0.000000\n", "2 0.000000\n", "3 0.575502\n", "4 0.000000\n", "5 0.000000\n", "Name: Anne Hathaway, dtype: float64\n", "0 0.000000\n", "1 0.005601\n", "2 0.000000\n", "3 0.422227\n", "4 0.000000\n", "5 0.000000\n", "Name: Denzel Washington, dtype: float64\n" ] } ], "source": [ "# Create a pandas DataFrame: df\n", "df = pd.DataFrame(nmf_features, index=titles)\n", "\n", "# Print the row for 'Anne Hathaway'\n", "print(df.loc['Anne Hathaway'])\n", "\n", "# Print the row for 'Denzel Washington'\n", "print(df.loc['Denzel Washington'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NMF reconstructs samples\n", "In this exercise, you'll check your understanding of how NMF reconstructs samples from its components using the NMF feature values. On the right are the components of an NMF model. If the NMF feature values of a sample are ```[2, 1]```, then which of the following is most likely to represent the original sample? A pen and paper will help here! You have to apply the same technique Ben used in the video to reconstruct the sample ```[0.1203 0.1764 0.3195 0.141]```." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "sample_feature = np.array([2, 1])\n", "components = np.array([[1. , 0.5, 0. ],\n", " [0.2, 0.1, 2.1]])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2.2, 1.1, 2.1])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.matmul(sample_feature.T, components)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NMF learns interpretable parts\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NMF learns topics of documents\n", "In the video, you learned when NMF is applied to documents, the components correspond to topics of documents, and the NMF features reconstruct the documents from the topics. Verify this for yourself for the NMF model that you built earlier using the Wikipedia articles. Previously, you saw that the 3rd NMF feature value was high for the articles about actors Anne Hathaway and Denzel Washington. In this exercise, identify the topic of the corresponding NMF component.\n", "\n", "The NMF model you built earlier is available as ```model```, while ```words``` is a list of the words that label the columns of the word-frequency array.\n", "\n", "After you are done, take a moment to recognise the topic that the articles about Anne Hathaway and Denzel Washington have in common!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocess" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "words = []\n", "with open('./dataset/wikipedia-vocabulary-utf8.txt') as f:\n", " words = f.read().splitlines()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(6, 13125)\n", "film 0.628104\n", "award 0.253223\n", "starred 0.245373\n", "role 0.211528\n", "actress 0.186465\n", "Name: 3, dtype: float64\n" ] } ], "source": [ "# Create a DataFrame: components_df\n", "components_df = pd.DataFrame(model.components_, columns=words)\n", "\n", "# Print the shape of the DataFrame\n", "print(components_df.shape)\n", "\n", "# Select row 3: component\n", "component = components_df.iloc[3]\n", "\n", "# Print result of nlargest\n", "print(component.nlargest())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore the LED digits dataset\n", "In the following exercises, you'll use NMF to decompose grayscale images into their commonly occurring patterns. Firstly, explore the image dataset and see how it is encoded as an array. You are given 100 images as a 2D array ```samples```, where each row represents a single 13x8 image. The images in your dataset are pictures of a LED digital display." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocess" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "94 | \n", "95 | \n", "96 | \n", "97 | \n", "98 | \n", "99 | \n", "100 | \n", "101 | \n", "102 | \n", "103 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
1 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
2 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
3 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
4 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
5 rows × 104 columns
\n", "