{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HW1 - Basics of ML\n",
    "Include your code in the relevant cells below.\n",
    "Subparts labeled as questions (Q1.1, Q1.2, etc.) should have their answers filled in or plots placed prominently, as appropriate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Important notes: \n",
    "\n",
    "1. On this and future homeworks, depending on the data size and your hardware configuration, experiments may take too long if you use the complete dataset. This may be challenging, as you may need to run multiple experiments. So, if an experiment takes too much time, start first with a smaller sample that will allow you to run your code within a reasonable time. Once you complete all tasks, before the final submission, you can allow longer run times and run your code with the complete set. However, if this is still taking too much time or causing your computer to freeze, it will be OK to submit experiments using a sample size that is feasible for your setting (indicate it clearly in your submission). Grading of the homework will not be affected from this type of variations in the design of your experiments.\n",
    "\n",
    "\n",
    "2. You can switch between 2D image data and 1D vector data using the numpy functions flatten() and resize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S1: Understanding the data\n",
    "- Load MNIST dataset (hint it's available as part of https://keras.io/api/datasets)\n",
    "\n",
    "Q1.1: What is the number of features in the training dataset:   ___\n",
    "\n",
    "Q1.2: What is the number of samples in the training dataset:   ___\n",
    "\n",
    "Q1.1: What is the number of features in the testing dataset:   ___\n",
    "\n",
    "Q1.4: What is the number of samples in the testing dataset:   ___\n",
    "\n",
    "Q1.3: What is the dimensionality of each data sample: ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S2: Viewing the data\n",
    "- Select one random example from each category from the training set. Display the 2D image with the name of the category\n",
    "\n",
    "Q2.1: Visualize the example image:   ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S3: Sub-sampling the data\n",
    "- Reduce training and testing sample sizes by **randomly selecting** %10 of the initial samples\n",
    "\n",
    "Q3.1: What is the distribution of each label in the initial train data (i.e. percentage of each label):   ___\n",
    "\n",
    "Q3.2: What is the distribution of each label in the reduced train data:   ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S4: Sub-sampling the data (again)\n",
    "- Reduce training and testing sample sizes by selecting **the first** %10 of the initial samples\n",
    "\n",
    "Q4.1: What is the distribution of each label in the initial train data (i.e. percentage of each label):   ___\n",
    "\n",
    "Q4.2: What is the distribution of each label in the reduced train data:   ___\n",
    "\n",
    "Q4.3: What are your comments/interpretation on comparison of the results for S3 and S4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ! For the rest of the HW, please discard sub-sampled data from S3 and use subsampled data from S4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S5: Exploring the dataset\n",
    "- Select all **train** images in category \"3\". Create and display a single pixel-wise \"average image\" for this category.\n",
    "- Create and display a single pixel-wise \"standard deviation image\" for this category?\n",
    "- Repeat the items above for category \"3\" images in the **test** set. Compare the average and standard deviation images.\n",
    "- Repeat the items above for a different category you select.\n",
    "\n",
    "Q5.1: Plot the 2D mean and std images for category 3 in training and testing sets:   ___\n",
    "\n",
    "Q5.2: Plot the 2D mean and std images for the category you selected in training and testing sets:   ___\n",
    "\n",
    "Q5.3: Comment on differences between the mean and std images from training and testing datasets? ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S6: Image distances\n",
    "- In the training set, find the image in category 3 that is most dissimilar to the mean image of category 3. Show it as a 2D image\n",
    "- In the training set, find the image in category 3 that is most similar to mean image of category 3. Show it as a 2D image\n",
    "- In the training set, find the image in category 9 that is most similar to mean image of category 3. Show it as a 2D image\n",
    "\n",
    "**Hint:** You can use the \"euclidean distance\" as your similarity metric. Given that an image i is represented with a flattened feature vector v_i , and the second image j with v_m, the distance between these two images can be calculated using the vector norm of their differences ( | v_i - v_j | ) \n",
    "\n",
    "Q6.1: What is the index of most dissimilar image in category 3:   ___\n",
    "\n",
    "Q6.2: Plot the most dissimilar category 3 image in 2D:   ___\n",
    "\n",
    "Q6.3: Plot the most similar category 3 image in 2D:   ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S7: Image distances, part 2\n",
    "- Repeat questions S3 and S4 after binarizing the images first\n",
    "\n",
    "Q7.1: What is the index of most dis-similar category 3 image:   ___\n",
    "\n",
    "Q7.2: What is the index of most similar category 3 image:   ___\n",
    "\n",
    "Q7.3: Did the answer change after binarization? How do you interprete this finding?:   ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S8: Binary classification between category 3 and 9  (split train data)\n",
    "- Select images from these two categories in the training dataset\n",
    "- Split them into two sets (Set1, Set2) with a %60 and %40 random split\n",
    "- Replace category labels as 0 (for 3) and 1 (for 9)\n",
    "- Use Set1 to train a linear SVM classifier with default parameters and predict the class labels for Set2 \n",
    "- Use Set2 to train a linear SVM classifier with default parameters and predict the class labels for Set1 \n",
    "\n",
    "Q8.1: What is the prediction accuracy using the model trained on Set1:   ___\n",
    "\n",
    "Q8.2: What is the prediction accuracy using the model trained on Set2:   ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S9: Binary classification between category 3 and 9 (train + test sets)\n",
    "- Select images from these two categories in the training and testing datasets\n",
    "- Replace category labels as 0 (for 3) and 1 (for 9)\n",
    "- Use training set to train a linear SVM classifier with default parameters and predict the class labels for the testing set\n",
    "- Use testing set to train a linear SVM classifier with default parameters and predict the class labels for the training set\n",
    "\n",
    "Q9.1: What is the prediction accuracy using the model trained on the training set:   ___\n",
    "\n",
    "Q9.2: What is the prediction accuracy using the model trained on the testing set:   ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S10: k-NN Error Analysis\n",
    "- In training and testing datasets select the images in categories: 1, 3, 5, 7 or 9\n",
    "- Train k-NN classifiers using 4 to 40 nearest neighbors with a step size of 4\n",
    "- Calculate and plot overall testing accuracy for each experiment\n",
    "\n",
    "Q10.1: For k=4 what is the label that was predicted with lowest accuracy:   ___\n",
    "\n",
    "Q10.2: For k=20 what is the label that was predicted with lowest accuracy:   ___\n",
    "\n",
    "Q10.3: What is the label pair that was confused most often (i.e. class A is labeled as B, and vice versa):   ___\n",
    "\n",
    "Q10.4: Visualize 5 mislabeled samples with their actual and predicted labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S11: Feature extraction\n",
    "\n",
    "- We describe each image by using a reduced set of features (compared to n=784 initial features for each pixel value) as follows:\n",
    "  \n",
    "  1. Binarize the image (background=0, foreground=1)\n",
    "\n",
    "  2. For each image row i, find n_i, the sum of 1's in the row (28 features) \n",
    "  \n",
    "  3. For each image column j, find n_j, the sum of 1's in the column (28 features)\n",
    "  \n",
    "  4. Concatenate these features into a feature vector of 56 features\n",
    "  \n",
    "Repeat classification experiments in S9 using this reduced feature set.\n",
    "\n",
    "Q11.1: What is the prediction accuracy using the model trained using the train data:   ___\n",
    "\n",
    "Q11.2: What is the prediction accuracy using the model trained using the test data:   ___\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bonus:\n",
    "\n",
    "- This time we describe each 28 x 28 image by using a different feature set (n = 28 x 4 features). This feature set encodes \"index of the first non-zero pixel in image columns or rows\" from each direction (from left, right, top, bottom)\n",
    "\n",
    "Example for a 6 x 6 image:\n",
    "\n",
    "Img:\n",
    " 0 0 0 0 0 0\n",
    " 0 0 0 1 0 0\n",
    " 0 0 0 1 0 0\n",
    " 0 0 0 1 0 0\n",
    " 0 0 0 1 0 0\n",
    " 0 0 0 0 0 0\n",
    " \n",
    "Extracted features:\n",
    " 0 3 3 3 3 0  0 2 2 2 2 0  0 0 0 1 0 0  0 0 0 1 0 0   (left, right, top, bottom)\n",
    "  \n",
    "Repeat classification experiments in S9 using this reduced feature set.\n",
    "\n",
    "Q11.1: What is the prediction accuracy using the model trained using the train data:   ___\n",
    "\n",
    "Q11.2: What is the prediction accuracy using the model trained using the test data:   ___\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}