{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "pyJyEQhIFShS"
},
"source": [
"**Name:** \\_\\_\\_\\_\\_\n",
"\n",
"**EID:** \\_\\_\\_\\_\\_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "I7SLallYFShU"
},
"source": [
"# Tutorial 6: Linear Dimensionality Reduction and Face Recognition\n",
"\n",
"In this tutorial, you will use linear dimensionality reduction on face images, and then train a classifier for face recognition.\n",
"\n",
"First, we need to initialize Python. Run the below cell."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"id": "5NSOh-p-FShV"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\HAOYCH~1\\AppData\\Local\\Temp/ipykernel_32480/2185687890.py:4: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`\n",
" IPython.core.display.set_matplotlib_formats(\"svg\")\n"
]
}
],
"source": [
"%matplotlib inline\n",
"import IPython.core.display\n",
"# setup output image format (Chrome works best)\n",
"IPython.core.display.set_matplotlib_formats(\"svg\")\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib\n",
"from joblib import *\n",
"from numpy import *\n",
"from sklearn import *\n",
"import glob\n",
"import os\n",
"random.seed(100)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Vgbb5uKSFShW"
},
"source": [
"## 1. Loading Data and Pre-processing\n",
"We first need to load the images. Download `olivetti_py3.pkz` from Canvas, and place it in the same directory as this ipynb file. _DO NOT UNZIP IT_. Then run the following cell to load the images."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"id": "RdITDZGdFShW"
},
"outputs": [],
"source": [
"oli = datasets.fetch_olivetti_faces(data_home=\"./\")\n",
"X = oli.data\n",
"Y = oli.target\n",
"img = oli.images\n",
"imgsize = oli.images[0].shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iXkZx3S4FShW"
},
"source": [
"Each image is a 64x64 array of pixel values, resulting in a 4096-dimensional vector. Run the below code to show an example:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"id": "h3xSb8eKFShX",
"outputId": "f2506bb5-50cd-413e-a41f-10fbd3bfa5d1",
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(64, 64)\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"print(img[0].shape)\n",
"plt.imshow(img[0], cmap='gray', interpolation='nearest')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LB7HmYLFFShX"
},
"source": [
"Run the below code to show all the images!"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"id": "qWhbQxA7FShY",
"outputId": "e5d8cb1b-4d3b-45a6-9bb9-0afac3c12df4",
"scrolled": false
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"tmp = []\n",
"for i in range(0,400,20):\n",
" tmp.append( hstack(img[i:i+20]) )\n",
"allimg = vstack(tmp)\n",
"plt.figure(figsize=(9,9))\n",
"plt.imshow(allimg, cmap='gray', interpolation='nearest')\n",
"plt.gca().xaxis.set_ticklabels([])\n",
"plt.gca().yaxis.set_ticklabels([])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HGnO1FJRFShY"
},
"source": [
"Each person is considered as one class, and there are 10 images for each class. In total there are 40 classes (people). The data is already vectorized and put into the matrix `X`. The class labels are in vector `Y`. Now we split the data into training and testing sets."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"id": "eDrbZifwFShY",
"outputId": "72c4a286-9aad-4dde-c2a6-988feaf4392a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(320, 4096)\n",
"(80, 4096)\n"
]
}
],
"source": [
"# randomly split data into 80% train and 20% test set\n",
"trainX, testX, trainY, testY = \\\n",
" model_selection.train_test_split(X, Y,\n",
" train_size=0.80, test_size=0.20, random_state=100)\n",
"\n",
"print(trainX.shape)\n",
"print(testX.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7HSm_jeOFShY"
},
"source": [
"## 2. Principal Component Analysis - PCA\n",
"The dimension of the data is too large (4096) so learning classifiers will take a long time. Instead, our strategy is to use PCA to reduce the dimension first and then use the PCA weights as the representation for each image. Run PCA on the data using 9 principal components."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"id": "abyFP8L9FShZ"
},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# 1.decomposition.PCA(n_components=9)"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"id": "b7CWQ-hXFShZ"
},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"pca = decomposition.PCA(n_components=9)\n",
"trainW = pca.fit_transform(trainX) # fit the training set\n",
"testW = pca.transform(testX) # use the pca model to transform the test set"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"id": "ySRMxr7VFShZ",
"outputId": "79d38f00-f5ef-45f1-ff53-7518fc18400b"
},
"outputs": [
{
"data": {
"text/plain": [
"dtype('float32')"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainX.dtype"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Av9H5ZW3FShZ"
},
"source": [
"The below function will plot the basis vectors of PCA. Run the next 2 cells to view the PCs."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"id": "kXlrpFxnFShZ"
},
"outputs": [],
"source": [
"def plot_basis(model, imgsize):\n",
" cname = model.__class__.__name__\n",
" if cname == 'LDA':\n",
" KK = model.n_components\n",
" comps = model.coef_\n",
" mn = None\n",
" elif cname == 'PCA':\n",
" KK = model.n_components_\n",
" comps = model.components_\n",
" mn = model.mean_\n",
" elif cname == 'NMF':\n",
" KK = model.n_components_\n",
" comps = model.components_\n",
" mn = None \n",
" elif cname == 'TruncatedSVD':\n",
" KK = model.components_.shape[0]\n",
" comps = model.components_\n",
" mn = None\n",
" K = KK\n",
" if mn is not None:\n",
" K += 1\n",
" nr = ceil(K/5.0)\n",
" sind = 1\n",
"\n",
" #vmin = comps.flatten().min()\n",
" #vmax = comps.flatten().max()\n",
"\n",
" # plot the mean\n",
" pcfig = plt.figure(figsize=(8,nr*2))\n",
" if mn is not None:\n",
" plt.subplot(int(nr),5,sind)\n",
" plt.imshow(mn.reshape(imgsize), interpolation='nearest')\n",
" plt.title(\"mean\")\n",
" plt.gray()\n",
" plt.gca().xaxis.set_ticklabels([])\n",
" plt.gca().yaxis.set_ticklabels([])\n",
" sind += 1\n",
" # plot the components\n",
" for j in range(0,KK):\n",
" plt.subplot(int(nr),5,sind)\n",
" v = comps[j,:]\n",
" I = v.reshape(imgsize)\n",
" plt.imshow(I, interpolation='nearest')\n",
" plt.gray()\n",
" plt.title(\"basis \" + str(j+1))\n",
" plt.gca().xaxis.set_ticklabels([])\n",
" plt.gca().yaxis.set_ticklabels([])\n",
" sind += 1"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"id": "-6qU4AZ4FShZ",
"outputId": "cbf2c939-7da2-4f1d-a79a-953f557b1910"
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# run the function\n",
"plot_basis(pca, imgsize)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zTF5f1f0FSha"
},
"source": [
"_What do the basis images look like? Do some basis images correspond to particular facial features?_\n",
"- **INSERT YOUR ANSWER HERE**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QOU0A9inFSha"
},
"source": [
"- **INSERT YOUR ANSWER HERE**\n",
"- mean is the average face\n",
"- basis 7, 8, and 9 are different glasses frames\n",
"- basis 3 is about eye brows\n",
"- basis 4 is about eyes."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ATVKJQHyFSha"
},
"source": [
"### Face Recognition\n",
"Now train a logistic classifier to do the face recognition. Use the calculated PCA representation as the new set of inputs. Use cross-validation to set the hyperparameters of the classifier. You do not need to do cross-validation for the number of components. Calculate the average training and testing accuracies. Remember to transform the test data into the PCA representation too!\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"id": "X26KPGqqFSha"
},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# 1. linear_model.LogisticRegressionCV(Cs=logspace(-4,4,20), cv=5, n_jobs=-1)\n",
"# 2. calculate accuracy: metrics.accuracy_score"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"id": "XsCQdpp4FSha",
"outputId": "25b5fae4-2e77-45cb-f79f-f69769176215"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"train accuracy = \n",
"test accuracy = 0.7625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
}
],
"source": [
"### INSERT YOUR CODE HERE\n",
"logreg = linear_model.LogisticRegressionCV(Cs=logspace(-4,4,20), cv=5, n_jobs=-1)\n",
"logreg.fit(trainW, trainY)\n",
"\n",
"# predict from the model\n",
"predYtrain = logreg.predict(trainW)\n",
"predYtest = logreg.predict(testW)\n",
"\n",
"# calculate accuracy\n",
"acc = metrics.accuracy_score\n",
"print(\"train accuracy =\", acc)\n",
"\n",
"# calculate accuracy\n",
"acc = metrics.accuracy_score(testY,predYtest)\n",
"print(\"test accuracy =\", acc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h9u3KCi2FSha"
},
"source": [
"### Finding the Best Number of Components\n",
"Now try a range of number of components for PCA to get the best test accuracy. Train a classifier for each one and see which dimension gives the best testing accuracy. Make a plot of PCA dimension vs. test accuracy."
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"id": "p1fD1CqPFSha"
},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# 1. n = [1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90] #components\n",
"# 2. decomposition.PCA(n_components=n)\n",
"# 3. linear_model.LogisticRegressionCV(Cs=logspace(-4,4,20), cv=5, n_jobs=-1)\n",
"# 4. calculate accuracy: metrics.accuracy_score"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"id": "I7_UlOyQFShb",
"outputId": "92d346b1-da8d-4668-8a29-79e3916ed3fa"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 : 0.1\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"5 : 0.55\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 : 0.8375\n",
"15 : 0.9\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"20 : 0.95\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"25 : 0.9625\n",
"30 : 0.9625\n",
"35 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"40 : 0.9625\n",
"45 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"50 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"55 : 0.9625\n",
"60 : 0.9625\n",
"65 : 0.9625\n",
"70 : 0.9625\n",
"75 : 0.9625\n",
"80 : 0.9625\n",
"85 : 0.9625\n",
"90 : 0.9625\n"
]
}
],
"source": [
"ns=[1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90]\n",
"\n",
"params = {'C': logspace(-6,3,15)}\n",
"trainacc = []\n",
"testacc = []\n",
"for n in ns:\n",
" pca = decomposition.PCA(n_components=n)\n",
" trainW = pca.fit_transform(trainX) # fit the training set\n",
" testW = pca.transform(testX) # use the pca model to transform the test set\n",
"\n",
" clf = linear_model.LogisticRegressionCV(Cs=logspace(-4,4,20), cv=5, n_jobs=-1)\n",
" clf.fit(trainW, trainY)\n",
"\n",
" # predict from the model\n",
" predYtrain = clf.predict(trainW)\n",
" predYtest = clf.predict(testW)\n",
"\n",
" # calculate accuracy\n",
" acc = metrics.accuracy_score(trainY, predYtrain)\n",
" trainacc.append(acc)\n",
"\n",
" # calculate accuracy\n",
" acc = metrics.accuracy_score(testY, predYtest)\n",
" testacc.append(acc)\n",
" print(n, \":\",acc)\n"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"id": "dRQuHqfjFShb",
"outputId": "8d24e6eb-6915-4d43-ab2f-1bc054fb0fa4"
},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'accuracy')"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(ns, trainacc, 'bx-', label='train')\n",
"plt.plot(ns, testacc, 'ro-', label='test')\n",
"plt.title(\"PCA + LR\")\n",
"plt.legend(loc=0)\n",
"plt.grid(True); plt.xlabel('n components'); plt.ylabel('accuracy')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WNBCcpeLFShb"
},
"source": [
"_What is the best number of components? View the basis images to see what they look like._\n",
"- **INSERT YOUR ANSWER HERE**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uELzilH_FShb"
},
"source": [
"- **INSERT YOUR ANSWER HERE**\n",
"- about 20 components is sufficient to get maximum accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "z54C2XanFShb"
},
"source": [
"Plot the basis vectors of PCA with 20 components."
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"id": "nBDn2B1LFShb"
},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"id": "TljsopQ3FShb",
"outputId": "53cac611-83a7-4ca2-b820-d2489b5fa499"
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"### INSERT YOUR CODE HERE\n",
"pca = decomposition.PCA(n_components=20)\n",
"trainW = pca.fit_transform(trainX) # fit the training set\n",
"plot_basis(pca, imgsize)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"id": "mzTlZmIQFShb"
},
"source": [
"# 3. Linear Dimensionality Reduction - SVD"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GwtNf6o4FShc"
},
"source": [
"Now we will repeat the experiment using SVD instead of PCA. Perform SVD with 9 components and visualize the basis images."
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"id": "oMYRAeryFShc"
},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# 1. decomposition.TruncatedSVD(n_components=9)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"id": "fRywhYMYFShc"
},
"outputs": [],
"source": [
"svd = decomposition.TruncatedSVD(n_components=9)\n",
"trainW = svd.fit_transform(trainX)\n",
"testW = svd.transform(testX)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lKkjSt9SFShc"
},
"source": [
"### Finding the Best Number of Components\n",
"Now find the number of components that gives the best test accuracy. Use the same type of classifier that you used in the previous experiment. Use cross-validation to select the hyperparameters of the classifier. You do not need to do cross-validation for the number of components."
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"id": "cU3wEjE2FShc"
},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# 1. n = [1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90] #components\n",
"# 2. decomposition.TruncatedSVD(n_components=n)\n",
"# 3. linear_model.LogisticRegressionCV(Cs=logspace(-4,4,20), cv=5, n_jobs=-1)\n",
"# 4. calculate accuracy: metrics.accuracy_score"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"id": "o1Jqh11gFShc",
"outputId": "5208d1d9-3fb0-4b9c-d7ed-971aaae47026"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 : 0.075\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"5 : 0.525\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 : 0.8375\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"15 : 0.925\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"20 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"25 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"30 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"35 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"40 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"45 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"50 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"55 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"60 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"65 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"70 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"75 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"80 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"85 : 0.9625\n",
"90 : 0.9625\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\haoychen3\\Anaconda3\\envs\\test\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
}
],
"source": [
"ns=[1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90]\n",
"\n",
"params = {'C': logspace(-6,3,15)}\n",
"trainacc = []\n",
"testacc = []\n",
"for n in ns:\n",
" svd = decomposition.TruncatedSVD(n_components=n)\n",
" trainW = svd.fit_transform(trainX) # fit the training set\n",
" testW = svd.transform(testX) # use the pca model to transform the test set\n",
"\n",
" clf = linear_model.LogisticRegressionCV(Cs=logspace(-4,4,20), cv=3, n_jobs=-1)\n",
" clf.fit(trainW, trainY)\n",
"\n",
" # predict from the model\n",
" predYtrain = clf.predict(trainW)\n",
" predYtest = clf.predict(testW)\n",
"\n",
" # calculate accuracy\n",
" acc = mean(trainY==predYtrain)\n",
" trainacc.append(acc)\n",
"\n",
" # calculate accuracy\n",
" acc = mean(testY==predYtest)\n",
" testacc.append(acc)\n",
" print(n,\":\",acc)"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"id": "XmX3S-xdFShg",
"outputId": "ff1413a1-922f-4461-bdb1-4e887852f813"
},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'accuracy')"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(ns, trainacc, 'bx-', label='SVD train')\n",
"plt.plot(ns, testacc, 'ro-', label=' test')\n",
"plt.title(\"SVD + LR\")\n",
"plt.legend(loc=0)\n",
"plt.grid(True); plt.xlabel('n components'); plt.ylabel('accuracy')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "25dkTihfFShg"
},
"source": [
"_Which number of components gives the best test result? How does the accuracy compare to the best PCA result? Why is SVD or PCA better?_\n",
"- **INSERT YOUR ANSWER HERE**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dYP69IUyFShg"
},
"source": [
"- **INSERT YOUR ANSWER HERE**\n",
"- 20 components will have good results.\n",
"- PCA is better (test accuracy > 90%), and more stable than SVD."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LxWvU0txFShg"
},
"source": [
"Plot the basis vectors of PCA with 35 components."
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"id": "-Ro2mtFTFShg"
},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"id": "X4t2i47EFShg",
"outputId": "8e52a322-8a47-48a9-cfbf-addbfc01718b"
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"### INSERT YOUR CODE HERE\n",
"pca = decomposition.PCA(n_components=35)\n",
"trainW = pca.fit_transform(trainX) # fit the training set\n",
"plot_basis(pca, imgsize)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the basis vectors of SVD with 35 components."
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"svd = decomposition.TruncatedSVD(n_components=35)\n",
"trainW = svd.fit_transform(trainX)\n",
"plot_basis(svd, imgsize)"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 1
}