{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Recap your python skills\n", "\n", "BMED360-2021 `00-recap-python.ipynb`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a href=\"https://colab.research.google.com/github/computational-medicine/BMED360-2021/tree/main/Lab2-ML-tissue-classification/00-recap-python.ipynb\">\n", " <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n", "</a>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning outcome:\n", "\n", "- Construct simple functions for editing of strings\n", "- Build a `train_test_splitter` from scratch\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### For using Colab\n", "**--> (some of) the following libraries must be pip installed (i.e. uncommet these among the following pip commands):**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# These might not be necessary to pip install on colab:\n", "#!pip install matplotlib\n", "#!pip install gdown\n", "#!pip install envoy\n", "#!pip install sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download compressed file with data, assets, and solutions from Google Drive using `gdown`**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import gdown\n", "import shutil\n", "import os\n", "import sys" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "./assets exists already!\n" ] } ], "source": [ "if os.path.isdir('./assets') == False:\n", " import envoy\n", " \n", " ## Download 'data_assets_sol_utils.tgz' for Google Drive \n", " # https://drive.google.com/file/d/1rCGmA2K_Q1TcgYz4VIR-SvXwdxw5zqbh/view?usp=sharing\n", " \n", " # https://drive.google.com/file/d/1rCGmA2K_Q1TcgYz4VIR-SvXwdxw5zqbh/view?usp=sharing\n", " \n", " \n", " file_id = '1rCGmA2K_Q1TcgYz4VIR-SvXwdxw5zqbh/view?usp=sharing'\n", " url = 'https://drive.google.com/uc?id=%s' % file_id\n", " output = './data_assets_sol_utils.tar.gz'\n", " gdown.download(url, output, quiet=False)\n", " \n", " ## Untar the data_assets_sol_utils.tar.gz file into `./assets` `./data` `./solutions` and `utils.py` \n", " shutil.unpack_archive(output, '.')\n", " #envoy.run(\"tar xzf %s -C %s\" % (output, .))\n", " !tar zxvf data_assets_sol_utils.tar.gz .\n", " \n", " ## Delete the 'data_assets_sol_utils.tgz' file\n", " os.remove(output)\n", "else:\n", " print(f'./assets exists already!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ex0.1. Write a function `splitter` which inputs a string and breaks that string into substrings separated by `\\`. \n", "\n", "For instance\n", "\n", "`splitter(\"C:\\Users\\Peter\")` \n", "\n", "should return the list\n", "\n", "`[\"C:\", \"Users\", \"Peter\"]`\n", "\n", "\n", "**Note:** You will have to find out yourself what string method to apply. Use whatever means possible (the Tab trick, the internet...) to find a solution.\n", "\n", "**Hint:** you will likely get a formatting error, because `\"\\\"` has a special function in string (escape character). Use a raw string instead: `r\"\\\"`, or `\"\\\\\"`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# %load solutions/ex0_1.py" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['C:', 'Users', 'Peter']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# run this to test your code\n", "st = r\"C:\\Users\\Peter\"\n", "splitter(st)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ex0.2. Modify the function to be more flexible, so it can decide what kind of separator to use.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# %load solutions/ex0_2.py" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['GivenName', 'FamilyName']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "st = r\"GivenName_FamilyName\"\n", "splitter2(st, '_')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ex0.3. Modify the code to take not a single, but a list of strings and perform the action on each one of them." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# %load solutions/ex0_3.py" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['C:', 'Users', 'Peter'],\n", " ['C:', 'Users', 'arvid', 'GitHub', 'computational-medicine', 'BMED360-2021']]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sts = [r\"C:\\Users\\Peter\",\n", " r\"C:\\Users\\arvid\\GitHub\\computational-medicine\\BMED360-2021\"]\n", "\n", "splitter3(sts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data representation, numpy and matrices\n", "A dataset can usually be represented in a tabular format. The convenvtion is to have the rows representing an individual sample, and the columns are features. This corresponds to a matrix of shape rows x columns." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<img src=\"https://cdn.kastatic.org/googleusercontent/_anqPXDhdx2MuQIN7S9F-nYDbxNVMFfrKL-bgihYpi1iqa-bi5Gggwy8k70xZgZ0j84IzMKQDg2VusdRgoUens4\" width=\"500\" height=\"500\"/>" ], "text/plain": [ "<IPython.core.display.Image object>" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import Image\n", "Image(url=\"https://cdn.kastatic.org/googleusercontent/_anqPXDhdx2MuQIN7S9F-nYDbxNVMFfrKL-bgihYpi1iqa-bi5Gggwy8k70xZgZ0j84IzMKQDg2VusdRgoUens4\", width=500, height=500)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# numpy and pandas are two of the most ubiquitous python libraries, so get used to seeing this.\n", "import pandas as pd #this is used for working with tables\n", "import numpy as np #this is used for working with matrices and vectors" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<class 'pandas.core.frame.DataFrame'>\n" ] } ], "source": [ "diabetes = 'data/diabetes.csv' # provide the path of the dataset\n", "df = pd.read_csv(diabetes) # this loads the file into memory\n", "\n", "print(type(df))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>pregnancies</th>\n", " <th>glucose</th>\n", " <th>diastolic</th>\n", " <th>triceps</th>\n", " <th>insulin</th>\n", " <th>bmi</th>\n", " <th>dpf</th>\n", " <th>age</th>\n", " <th>diabetes</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>6</td>\n", " <td>148</td>\n", " <td>72</td>\n", " <td>35</td>\n", " <td>0</td>\n", " <td>33.6</td>\n", " <td>0.627</td>\n", " <td>50</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>85</td>\n", " <td>66</td>\n", " <td>29</td>\n", " <td>0</td>\n", " <td>26.6</td>\n", " <td>0.351</td>\n", " <td>31</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>8</td>\n", " <td>183</td>\n", " <td>64</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>23.3</td>\n", " <td>0.672</td>\n", " <td>32</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1</td>\n", " <td>89</td>\n", " <td>66</td>\n", " <td>23</td>\n", " <td>94</td>\n", " <td>28.1</td>\n", " <td>0.167</td>\n", " <td>21</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0</td>\n", " <td>137</td>\n", " <td>40</td>\n", " <td>35</td>\n", " <td>168</td>\n", " <td>43.1</td>\n", " <td>2.288</td>\n", " <td>33</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " pregnancies glucose diastolic triceps insulin bmi dpf age \\\n", "0 6 148 72 35 0 33.6 0.627 50 \n", "1 1 85 66 29 0 26.6 0.351 31 \n", "2 8 183 64 0 0 23.3 0.672 32 \n", "3 1 89 66 23 94 28.1 0.167 21 \n", "4 0 137 40 35 168 43.1 2.288 33 \n", "\n", " diabetes \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 0 \n", "4 1 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head() # displays the 5 first rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the rightmost column represents the diagnosis (what we consider the target or label). This is our `y`.\n", "\n", "The `values` attribute of a dataframe (table) returns the numpy matrix of the entries." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<class 'numpy.ndarray'>\n" ] }, { "data": { "text/plain": [ "array([[ 6. , 148. , 72. , ..., 0.627, 50. , 1. ],\n", " [ 1. , 85. , 66. , ..., 0.351, 31. , 0. ],\n", " [ 8. , 183. , 64. , ..., 0.672, 32. , 1. ],\n", " ...,\n", " [ 5. , 121. , 72. , ..., 0.245, 30. , 0. ],\n", " [ 1. , 126. , 60. , ..., 0.349, 47. , 1. ],\n", " [ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = df.values\n", "print(type(data))\n", "data" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(768, 9)\n", "Sample size = 768\n" ] } ], "source": [ "# shape is a fundamental attribute of a matrix. The convention is (rows x cols). This is important!\n", "print(data.shape)\n", "\n", "# so the size (N) of the dataset is just\n", "print(f\"Sample size = {data.shape[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indexing works as follows for matrices (or tables):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. We are gonna build a train_test_splitter from scratch." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ex0.4a. Make a function which inputs a number and a percentage, and reduces that number by the that percentage.\n", "**Note:** the output should be rounder to the closest integer." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# %load solutions/ex0_4a.py" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "#### Ex0.4b. Make a function which splits a list of numbers from 0-N into two subsets of variable size.\n", "\n", "In other words, we want the output to be two lists, such that they in total contain all the numbers from 0 to N. Their relative size should be adjustable by using a parameter `p`, as a percentage.\n", "\n", "**Hint:** use `np.random.choice`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ex0.4b. Provided an array of indeces, make a function `data_splitter` which returns the rows corresponding to those indexes." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# %load solutions/ex0_4b.py" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 6. 148. 72. 35. 0. 33.6 0.627 50. 1. ]\n", " [ 8. 183. 64. 0. 0. 23.3 0.672 32. 1. ]]\n" ] } ], "source": [ "indeces = np.array([0,2]) # select the first and third row\n", "print(data_splitter(data, indeces))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ex0.4c. Provided an integer N, write a function `get_indeces` which splits the numbers (0-N) into two non-overlapping subsets (e.g. training indeces and test indeces).\n", "\n", "You should be able to call it like `get_indeces(N, p)`, where p is the proportion (between 0 and 1) determining the relative size of the training set.\n", "\n", "**Hint:** start by creating a vector of the integers 0-N using `np.arange(N)`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# %load solutions/ex0_4c.py" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# test your implementation\n", "get_indeces(10, .8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Problem:** what if the data is ordered based on label? Make the selection random." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ex0.4d. Now chain it all together into a single function `tts`.\n", "\n", "It should take two inputs: the data matrix and p, and return two numpy arrays, one containing the training samples, one containing the test samples." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# %load solutions/ex0_4d.py" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(576, 9) \n", " (192, 9)\n" ] } ], "source": [ "# test your implementation\n", "train, test = tts(data)\n", "print(train.shape,'\\n',test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ex0.4e. Final adjustment: we want to separate X from y also. Edit the function above to return the values X_train, y_train, X_test, y_test.\n", "\n", "Remember that the label (y) is the last column." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# %load solutions/ex0_4e.py" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(576, 8)\n", "(576,)\n", "(192, 8)\n", "(192,)\n" ] } ], "source": [ "# Run this to test. Do the results make sense?\n", "\n", "X_train, X_test, y_train, y_test = tts(data)\n", "for d in [X_train, y_train, X_test, y_test]:\n", " print(d.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "BMED360", "language": "python", "name": "bmed360" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }