{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Recap your python skills\n",
    "\n",
    "BMED360-2021  `00-recap-python.ipynb`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/computational-medicine/BMED360-2021/tree/main/Lab2-ML-tissue-classification/00-recap-python.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
    "</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning outcome:\n",
    "\n",
    "- Construct simple functions for editing of strings\n",
    "- Build a `train_test_splitter` from scratch\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### For using Colab\n",
    "**--> (some of) the following libraries must be pip installed (i.e. uncommet these among the following pip commands):**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# These might not be necessary to pip install on colab:\n",
    "#!pip install matplotlib\n",
    "#!pip install gdown\n",
    "#!pip install envoy\n",
    "#!pip install sklearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Download compressed file with data, assets, and solutions from Google Drive using `gdown`**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gdown\n",
    "import shutil\n",
    "import os\n",
    "import sys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "./assets  exists already!\n"
     ]
    }
   ],
   "source": [
    "if os.path.isdir('./assets') == False:\n",
    "    import envoy\n",
    "    \n",
    "    ## Download 'data_assets_sol_utils.tgz' for Google Drive         \n",
    "    # https://drive.google.com/file/d/1rCGmA2K_Q1TcgYz4VIR-SvXwdxw5zqbh/view?usp=sharing\n",
    "    \n",
    "    # https://drive.google.com/file/d/1rCGmA2K_Q1TcgYz4VIR-SvXwdxw5zqbh/view?usp=sharing\n",
    "        \n",
    "        \n",
    "    file_id = '1rCGmA2K_Q1TcgYz4VIR-SvXwdxw5zqbh/view?usp=sharing'\n",
    "    url = 'https://drive.google.com/uc?id=%s' % file_id\n",
    "    output = './data_assets_sol_utils.tar.gz'\n",
    "    gdown.download(url, output, quiet=False)\n",
    "    \n",
    "    ## Untar the data_assets_sol_utils.tar.gz file into `./assets` `./data` `./solutions` and `utils.py` \n",
    "    shutil.unpack_archive(output, '.')\n",
    "    #envoy.run(\"tar xzf %s -C %s\" % (output, .))\n",
    "    !tar zxvf data_assets_sol_utils.tar.gz .\n",
    "    \n",
    "    ## Delete the 'data_assets_sol_utils.tgz' file\n",
    "    os.remove(output)\n",
    "else:\n",
    "    print(f'./assets  exists already!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Ex0.1. Write a function `splitter` which inputs a string and breaks that string into substrings separated by `\\`. \n",
    "\n",
    "For instance\n",
    "\n",
    "`splitter(\"C:\\Users\\Peter\")` \n",
    "\n",
    "should return the list\n",
    "\n",
    "`[\"C:\", \"Users\", \"Peter\"]`\n",
    "\n",
    "\n",
    "**Note:** You will have to find out yourself what string method to apply. Use whatever means possible (the Tab trick, the internet...) to find a solution.\n",
    "\n",
    "**Hint:** you will likely get a formatting error, because `\"\\\"` has a special function in string (escape character). Use a raw string instead: `r\"\\\"`, or `\"\\\\\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/ex0_1.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['C:', 'Users', 'Peter']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# run this to test your code\n",
    "st = r\"C:\\Users\\Peter\"\n",
    "splitter(st)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Ex0.2. Modify the function to be more flexible, so it can decide what kind of separator to use.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/ex0_2.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['GivenName', 'FamilyName']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "st = r\"GivenName_FamilyName\"\n",
    "splitter2(st, '_')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Ex0.3. Modify the code to take not a single, but a list of strings and perform the action on each one of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/ex0_3.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['C:', 'Users', 'Peter'],\n",
       " ['C:', 'Users', 'arvid', 'GitHub', 'computational-medicine', 'BMED360-2021']]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sts = [r\"C:\\Users\\Peter\",\n",
    "       r\"C:\\Users\\arvid\\GitHub\\computational-medicine\\BMED360-2021\"]\n",
    "\n",
    "splitter3(sts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data representation, numpy and matrices\n",
    "A dataset can usually be represented in a tabular format. The convenvtion is to have the rows representing an individual sample, and the columns are features. This corresponds to a matrix of shape rows x columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://cdn.kastatic.org/googleusercontent/_anqPXDhdx2MuQIN7S9F-nYDbxNVMFfrKL-bgihYpi1iqa-bi5Gggwy8k70xZgZ0j84IzMKQDg2VusdRgoUens4\" width=\"500\" height=\"500\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import Image\n",
    "Image(url=\"https://cdn.kastatic.org/googleusercontent/_anqPXDhdx2MuQIN7S9F-nYDbxNVMFfrKL-bgihYpi1iqa-bi5Gggwy8k70xZgZ0j84IzMKQDg2VusdRgoUens4\", width=500, height=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# numpy and pandas are two of the most ubiquitous python libraries, so get used to seeing this.\n",
    "import pandas as pd #this is used for working with tables\n",
    "import numpy as np #this is used for working with matrices and vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n"
     ]
    }
   ],
   "source": [
    "diabetes = 'data/diabetes.csv' # provide the path of the dataset\n",
    "df = pd.read_csv(diabetes) # this loads the file into memory\n",
    "\n",
    "print(type(df))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnancies</th>\n",
       "      <th>glucose</th>\n",
       "      <th>diastolic</th>\n",
       "      <th>triceps</th>\n",
       "      <th>insulin</th>\n",
       "      <th>bmi</th>\n",
       "      <th>dpf</th>\n",
       "      <th>age</th>\n",
       "      <th>diabetes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>148</td>\n",
       "      <td>72</td>\n",
       "      <td>35</td>\n",
       "      <td>0</td>\n",
       "      <td>33.6</td>\n",
       "      <td>0.627</td>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>85</td>\n",
       "      <td>66</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>26.6</td>\n",
       "      <td>0.351</td>\n",
       "      <td>31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>183</td>\n",
       "      <td>64</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>23.3</td>\n",
       "      <td>0.672</td>\n",
       "      <td>32</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "      <td>66</td>\n",
       "      <td>23</td>\n",
       "      <td>94</td>\n",
       "      <td>28.1</td>\n",
       "      <td>0.167</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>137</td>\n",
       "      <td>40</td>\n",
       "      <td>35</td>\n",
       "      <td>168</td>\n",
       "      <td>43.1</td>\n",
       "      <td>2.288</td>\n",
       "      <td>33</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pregnancies  glucose  diastolic  triceps  insulin   bmi    dpf  age  \\\n",
       "0            6      148         72       35        0  33.6  0.627   50   \n",
       "1            1       85         66       29        0  26.6  0.351   31   \n",
       "2            8      183         64        0        0  23.3  0.672   32   \n",
       "3            1       89         66       23       94  28.1  0.167   21   \n",
       "4            0      137         40       35      168  43.1  2.288   33   \n",
       "\n",
       "   diabetes  \n",
       "0         1  \n",
       "1         0  \n",
       "2         1  \n",
       "3         0  \n",
       "4         1  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head() # displays the 5 first rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the rightmost column represents the diagnosis (what we consider the target or label). This is our `y`.\n",
    "\n",
    "The `values` attribute of a dataframe (table) returns the numpy matrix of the entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'numpy.ndarray'>\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],\n",
       "       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],\n",
       "       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],\n",
       "       ...,\n",
       "       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],\n",
       "       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],\n",
       "       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = df.values\n",
    "print(type(data))\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(768, 9)\n",
      "Sample size = 768\n"
     ]
    }
   ],
   "source": [
    "# shape is a fundamental attribute of a matrix. The convention is (rows x cols). This is important!\n",
    "print(data.shape)\n",
    "\n",
    "# so the size (N) of the dataset is just\n",
    "print(f\"Sample size = {data.shape[0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indexing works as follows for matrices (or tables):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. We are gonna build a train_test_splitter from scratch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Ex0.4a. Make a function which inputs a number and a percentage, and reduces that number by the that percentage.\n",
    "**Note:** the output should be rounder to the closest integer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/ex0_4a.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "#### Ex0.4b. Make a function which splits a list of numbers from 0-N into two subsets of variable size.\n",
    "\n",
    "In other words, we want the output to be two lists, such that they in total contain all the numbers from 0 to N. Their relative size should be adjustable by using a parameter `p`, as a percentage.\n",
    "\n",
    "**Hint:** use `np.random.choice`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Ex0.4b. Provided an array of indeces, make a function `data_splitter` which returns the rows corresponding to those indexes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/ex0_4b.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[  6.    148.     72.     35.      0.     33.6     0.627  50.      1.   ]\n",
      " [  8.    183.     64.      0.      0.     23.3     0.672  32.      1.   ]]\n"
     ]
    }
   ],
   "source": [
    "indeces = np.array([0,2]) # select the first and third row\n",
    "print(data_splitter(data, indeces))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Ex0.4c. Provided an integer N, write a function `get_indeces` which splits the numbers (0-N) into two non-overlapping subsets (e.g. training indeces and test indeces).\n",
    "\n",
    "You should be able to call it like `get_indeces(N, p)`, where p is the proportion (between 0 and 1) determining the relative size of the training set.\n",
    "\n",
    "**Hint:** start by creating a vector of the integers 0-N using `np.arange(N)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/ex0_4c.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# test your implementation\n",
    "get_indeces(10, .8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Problem:** what if the data is ordered based on label? Make the selection random."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Ex0.4d. Now chain it all together into a single function `tts`.\n",
    "\n",
    "It should take two inputs: the data matrix and p, and return two numpy arrays, one containing the training samples, one containing the test samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/ex0_4d.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(576, 9) \n",
      " (192, 9)\n"
     ]
    }
   ],
   "source": [
    "# test your implementation\n",
    "train, test = tts(data)\n",
    "print(train.shape,'\\n',test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Ex0.4e. Final adjustment: we want to separate X from y also. Edit the function above to return the values X_train, y_train, X_test, y_test.\n",
    "\n",
    "Remember that the label (y) is the last column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load solutions/ex0_4e.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(576, 8)\n",
      "(576,)\n",
      "(192, 8)\n",
      "(192,)\n"
     ]
    }
   ],
   "source": [
    "# Run this to test. Do the results make sense?\n",
    "\n",
    "X_train, X_test, y_train, y_test = tts(data)\n",
    "for d in [X_train, y_train, X_test, y_test]:\n",
    "    print(d.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "BMED360",
   "language": "python",
   "name": "bmed360"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}