{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Recap your python skills\n",
"\n",
"BMED360-2021 `00-recap-python.ipynb`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"
\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Learning outcome:\n",
"\n",
"- Construct simple functions for editing of strings\n",
"- Build a `train_test_splitter` from scratch\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### For using Colab\n",
"**--> (some of) the following libraries must be pip installed (i.e. uncommet these among the following pip commands):**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# These might not be necessary to pip install on colab:\n",
"#!pip install matplotlib\n",
"#!pip install gdown\n",
"#!pip install envoy\n",
"#!pip install sklearn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Download compressed file with data, assets, and solutions from Google Drive using `gdown`**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"import gdown\n",
"import shutil\n",
"import os\n",
"import sys"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"./assets exists already!\n"
]
}
],
"source": [
"if os.path.isdir('./assets') == False:\n",
" import envoy\n",
" \n",
" ## Download 'data_assets_sol_utils.tgz' for Google Drive \n",
" # https://drive.google.com/file/d/1rCGmA2K_Q1TcgYz4VIR-SvXwdxw5zqbh/view?usp=sharing\n",
" \n",
" # https://drive.google.com/file/d/1rCGmA2K_Q1TcgYz4VIR-SvXwdxw5zqbh/view?usp=sharing\n",
" \n",
" \n",
" file_id = '1rCGmA2K_Q1TcgYz4VIR-SvXwdxw5zqbh/view?usp=sharing'\n",
" url = 'https://drive.google.com/uc?id=%s' % file_id\n",
" output = './data_assets_sol_utils.tar.gz'\n",
" gdown.download(url, output, quiet=False)\n",
" \n",
" ## Untar the data_assets_sol_utils.tar.gz file into `./assets` `./data` `./solutions` and `utils.py` \n",
" shutil.unpack_archive(output, '.')\n",
" #envoy.run(\"tar xzf %s -C %s\" % (output, .))\n",
" !tar zxvf data_assets_sol_utils.tar.gz .\n",
" \n",
" ## Delete the 'data_assets_sol_utils.tgz' file\n",
" os.remove(output)\n",
"else:\n",
" print(f'./assets exists already!')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Ex0.1. Write a function `splitter` which inputs a string and breaks that string into substrings separated by `\\`. \n",
"\n",
"For instance\n",
"\n",
"`splitter(\"C:\\Users\\Peter\")` \n",
"\n",
"should return the list\n",
"\n",
"`[\"C:\", \"Users\", \"Peter\"]`\n",
"\n",
"\n",
"**Note:** You will have to find out yourself what string method to apply. Use whatever means possible (the Tab trick, the internet...) to find a solution.\n",
"\n",
"**Hint:** you will likely get a formatting error, because `\"\\\"` has a special function in string (escape character). Use a raw string instead: `r\"\\\"`, or `\"\\\\\"`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/ex0_1.py"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['C:', 'Users', 'Peter']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# run this to test your code\n",
"st = r\"C:\\Users\\Peter\"\n",
"splitter(st)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Ex0.2. Modify the function to be more flexible, so it can decide what kind of separator to use.\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/ex0_2.py"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['GivenName', 'FamilyName']"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st = r\"GivenName_FamilyName\"\n",
"splitter2(st, '_')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Ex0.3. Modify the code to take not a single, but a list of strings and perform the action on each one of them."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/ex0_3.py"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['C:', 'Users', 'Peter'],\n",
" ['C:', 'Users', 'arvid', 'GitHub', 'computational-medicine', 'BMED360-2021']]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sts = [r\"C:\\Users\\Peter\",\n",
" r\"C:\\Users\\arvid\\GitHub\\computational-medicine\\BMED360-2021\"]\n",
"\n",
"splitter3(sts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data representation, numpy and matrices\n",
"A dataset can usually be represented in a tabular format. The convenvtion is to have the rows representing an individual sample, and the columns are features. This corresponds to a matrix of shape rows x columns."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import Image\n",
"Image(url=\"https://cdn.kastatic.org/googleusercontent/_anqPXDhdx2MuQIN7S9F-nYDbxNVMFfrKL-bgihYpi1iqa-bi5Gggwy8k70xZgZ0j84IzMKQDg2VusdRgoUens4\", width=500, height=500)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# numpy and pandas are two of the most ubiquitous python libraries, so get used to seeing this.\n",
"import pandas as pd #this is used for working with tables\n",
"import numpy as np #this is used for working with matrices and vectors"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"diabetes = 'data/diabetes.csv' # provide the path of the dataset\n",
"df = pd.read_csv(diabetes) # this loads the file into memory\n",
"\n",
"print(type(df))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pregnancies | \n",
" glucose | \n",
" diastolic | \n",
" triceps | \n",
" insulin | \n",
" bmi | \n",
" dpf | \n",
" age | \n",
" diabetes | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 6 | \n",
" 148 | \n",
" 72 | \n",
" 35 | \n",
" 0 | \n",
" 33.6 | \n",
" 0.627 | \n",
" 50 | \n",
" 1 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 85 | \n",
" 66 | \n",
" 29 | \n",
" 0 | \n",
" 26.6 | \n",
" 0.351 | \n",
" 31 | \n",
" 0 | \n",
"
\n",
" \n",
" | 2 | \n",
" 8 | \n",
" 183 | \n",
" 64 | \n",
" 0 | \n",
" 0 | \n",
" 23.3 | \n",
" 0.672 | \n",
" 32 | \n",
" 1 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 89 | \n",
" 66 | \n",
" 23 | \n",
" 94 | \n",
" 28.1 | \n",
" 0.167 | \n",
" 21 | \n",
" 0 | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 137 | \n",
" 40 | \n",
" 35 | \n",
" 168 | \n",
" 43.1 | \n",
" 2.288 | \n",
" 33 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pregnancies glucose diastolic triceps insulin bmi dpf age \\\n",
"0 6 148 72 35 0 33.6 0.627 50 \n",
"1 1 85 66 29 0 26.6 0.351 31 \n",
"2 8 183 64 0 0 23.3 0.672 32 \n",
"3 1 89 66 23 94 28.1 0.167 21 \n",
"4 0 137 40 35 168 43.1 2.288 33 \n",
"\n",
" diabetes \n",
"0 1 \n",
"1 0 \n",
"2 1 \n",
"3 0 \n",
"4 1 "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head() # displays the 5 first rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the rightmost column represents the diagnosis (what we consider the target or label). This is our `y`.\n",
"\n",
"The `values` attribute of a dataframe (table) returns the numpy matrix of the entries."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/plain": [
"array([[ 6. , 148. , 72. , ..., 0.627, 50. , 1. ],\n",
" [ 1. , 85. , 66. , ..., 0.351, 31. , 0. ],\n",
" [ 8. , 183. , 64. , ..., 0.672, 32. , 1. ],\n",
" ...,\n",
" [ 5. , 121. , 72. , ..., 0.245, 30. , 0. ],\n",
" [ 1. , 126. , 60. , ..., 0.349, 47. , 1. ],\n",
" [ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = df.values\n",
"print(type(data))\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(768, 9)\n",
"Sample size = 768\n"
]
}
],
"source": [
"# shape is a fundamental attribute of a matrix. The convention is (rows x cols). This is important!\n",
"print(data.shape)\n",
"\n",
"# so the size (N) of the dataset is just\n",
"print(f\"Sample size = {data.shape[0]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Indexing works as follows for matrices (or tables):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. We are gonna build a train_test_splitter from scratch."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Ex0.4a. Make a function which inputs a number and a percentage, and reduces that number by the that percentage.\n",
"**Note:** the output should be rounder to the closest integer."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/ex0_4a.py"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true
},
"source": [
"#### Ex0.4b. Make a function which splits a list of numbers from 0-N into two subsets of variable size.\n",
"\n",
"In other words, we want the output to be two lists, such that they in total contain all the numbers from 0 to N. Their relative size should be adjustable by using a parameter `p`, as a percentage.\n",
"\n",
"**Hint:** use `np.random.choice`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Ex0.4b. Provided an array of indeces, make a function `data_splitter` which returns the rows corresponding to those indexes."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/ex0_4b.py"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 6. 148. 72. 35. 0. 33.6 0.627 50. 1. ]\n",
" [ 8. 183. 64. 0. 0. 23.3 0.672 32. 1. ]]\n"
]
}
],
"source": [
"indeces = np.array([0,2]) # select the first and third row\n",
"print(data_splitter(data, indeces))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Ex0.4c. Provided an integer N, write a function `get_indeces` which splits the numbers (0-N) into two non-overlapping subsets (e.g. training indeces and test indeces).\n",
"\n",
"You should be able to call it like `get_indeces(N, p)`, where p is the proportion (between 0 and 1) determining the relative size of the training set.\n",
"\n",
"**Hint:** start by creating a vector of the integers 0-N using `np.arange(N)`."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/ex0_4c.py"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# test your implementation\n",
"get_indeces(10, .8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Problem:** what if the data is ordered based on label? Make the selection random."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Ex0.4d. Now chain it all together into a single function `tts`.\n",
"\n",
"It should take two inputs: the data matrix and p, and return two numpy arrays, one containing the training samples, one containing the test samples."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/ex0_4d.py"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(576, 9) \n",
" (192, 9)\n"
]
}
],
"source": [
"# test your implementation\n",
"train, test = tts(data)\n",
"print(train.shape,'\\n',test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Ex0.4e. Final adjustment: we want to separate X from y also. Edit the function above to return the values X_train, y_train, X_test, y_test.\n",
"\n",
"Remember that the label (y) is the last column."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/ex0_4e.py"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(576, 8)\n",
"(576,)\n",
"(192, 8)\n",
"(192,)\n"
]
}
],
"source": [
"# Run this to test. Do the results make sense?\n",
"\n",
"X_train, X_test, y_train, y_test = tts(data)\n",
"for d in [X_train, y_train, X_test, y_test]:\n",
" print(d.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "BMED360",
"language": "python",
"name": "bmed360"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}