{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course 2 week 1 lecture notebook Exercise 03\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Combine features\n",
"\n",
"In this exercise, you will practice how to combine features in a pandas dataframe. This will help you in the graded assignment at the end of the week. \n",
"\n",
"In addition, you will explore why it makes more sense to multiply two features rather than add them in order to create interaction terms.\n",
"\n",
"First, you will generate some data to work with."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import pandas\n",
"import pandas as pd\n",
"\n",
"# Import a pre-defined function that generates data\n",
"from utils import load_data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Generate features and labels\n",
"X, y = load_data(100)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Age
\n",
"
Systolic_BP
\n",
"
Diastolic_BP
\n",
"
Cholesterol
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
77.196340
\n",
"
78.784208
\n",
"
87.026569
\n",
"
82.760275
\n",
"
\n",
"
\n",
"
1
\n",
"
63.529850
\n",
"
105.171676
\n",
"
83.396113
\n",
"
80.923284
\n",
"
\n",
"
\n",
"
2
\n",
"
69.003986
\n",
"
117.582259
\n",
"
91.161966
\n",
"
92.915422
\n",
"
\n",
"
\n",
"
3
\n",
"
82.638210
\n",
"
94.131208
\n",
"
69.470423
\n",
"
95.766098
\n",
"
\n",
"
\n",
"
4
\n",
"
78.346286
\n",
"
105.385186
\n",
"
87.250583
\n",
"
120.868124
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Systolic_BP Diastolic_BP Cholesterol\n",
"0 77.196340 78.784208 87.026569 82.760275\n",
"1 63.529850 105.171676 83.396113 80.923284\n",
"2 69.003986 117.582259 91.161966 92.915422\n",
"3 82.638210 94.131208 69.470423 95.766098\n",
"4 78.346286 105.385186 87.250583 120.868124"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Age', 'Systolic_BP', 'Diastolic_BP', 'Cholesterol'], dtype='object')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_names = X.columns\n",
"feature_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combine strings\n",
"Even though you can visually see feature names and type the name of the combined feature, you can programmatically create interaction features so that you can apply this to any dataframe.\n",
"\n",
"Use f-strings to combine two strings. There are other ways to do this, but Python's f-strings are quite useful."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name1: Age\n",
"name2: Systolic_BP\n"
]
}
],
"source": [
"name1 = feature_names[0]\n",
"name2 = feature_names[1]\n",
"\n",
"print(f\"name1: {name1}\")\n",
"print(f\"name2: {name2}\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Age_&_Systolic_BP'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Combine the names of two features into a single string, separated by '_&_' for clarity\n",
"combined_names = f\"{name1}_&_{name2}\"\n",
"combined_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Add two columns\n",
"- Add the values from two columns and put them into a new column.\n",
"- You'll do something similar in this week's assignment."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Age
\n",
"
Systolic_BP
\n",
"
Diastolic_BP
\n",
"
Cholesterol
\n",
"
Age_&_Systolic_BP
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
77.19634
\n",
"
78.784208
\n",
"
87.026569
\n",
"
82.760275
\n",
"
155.980548
\n",
"
\n",
"
\n",
"
1
\n",
"
63.52985
\n",
"
105.171676
\n",
"
83.396113
\n",
"
80.923284
\n",
"
168.701526
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Systolic_BP Diastolic_BP Cholesterol Age_&_Systolic_BP\n",
"0 77.19634 78.784208 87.026569 82.760275 155.980548\n",
"1 63.52985 105.171676 83.396113 80.923284 168.701526"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X[combined_names] = X['Age'] + X['Systolic_BP']\n",
"X.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Why we multiply two features instead of adding\n",
"\n",
"Why do you think it makes more sense to multiply two features together rather than adding them together?\n",
"\n",
"Please take a look at two features, and compare what you get when you add them, versus when you multiply them together."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
v1
\n",
"
v2
\n",
"
v1 + v2
\n",
"
v1 x v2
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
100
\n",
"
101
\n",
"
100
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
200
\n",
"
201
\n",
"
200
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
300
\n",
"
301
\n",
"
300
\n",
"
\n",
"
\n",
"
3
\n",
"
2
\n",
"
100
\n",
"
102
\n",
"
200
\n",
"
\n",
"
\n",
"
4
\n",
"
2
\n",
"
200
\n",
"
202
\n",
"
400
\n",
"
\n",
"
\n",
"
5
\n",
"
2
\n",
"
300
\n",
"
302
\n",
"
600
\n",
"
\n",
"
\n",
"
6
\n",
"
3
\n",
"
100
\n",
"
103
\n",
"
300
\n",
"
\n",
"
\n",
"
7
\n",
"
3
\n",
"
200
\n",
"
203
\n",
"
600
\n",
"
\n",
"
\n",
"
8
\n",
"
3
\n",
"
300
\n",
"
303
\n",
"
900
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" v1 v2 v1 + v2 v1 x v2\n",
"0 1 100 101 100\n",
"1 1 200 201 200\n",
"2 1 300 301 300\n",
"3 2 100 102 200\n",
"4 2 200 202 400\n",
"5 2 300 302 600\n",
"6 3 100 103 300\n",
"7 3 200 203 600\n",
"8 3 300 303 900"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Generate a small dataset with two features\n",
"df = pd.DataFrame({'v1': [1,1,1,2,2,2,3,3,3],\n",
" 'v2': [100,200,300,100,200,300,100,200,300]\n",
" })\n",
"\n",
"# add the two features together\n",
"df['v1 + v2'] = df['v1'] + df['v2']\n",
"\n",
"# multiply the two features together\n",
"df['v1 x v2'] = df['v1'] * df['v2']\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It may not be immediately apparent how adding or multiplying makes a difference; either way you get unique values for each of these operations.\n",
"\n",
"To view the data in a more helpful way, rearrange the data (pivot it) so that:\n",
"- feature 1 is the row index \n",
"- feature 2 is the column name. \n",
"- Then set the sum of the two features as the value. \n",
"\n",
"Display the resulting data in a heatmap."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Import seaborn in order to use a heatmap plot\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"v1 + v2\n",
"\n"
]
},
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Pivot the data so that v1 + v2 is the value\n",
"\n",
"df_add = df.pivot(index='v1',\n",
" columns='v2',\n",
" values='v1 + v2'\n",
" )\n",
"print(\"v1 + v2\\n\")\n",
"display(df_add)\n",
"print()\n",
"sns.heatmap(df_add);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that it doesn't seem like you can easily distinguish clearly when you vary feature 1 (which ranges from 1 to 3), since feature 2 is so much larger in magnitude (100 to 300). This is because you added the two features together."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### View the 'multiply' interaction\n",
"\n",
"Now pivot the data so that:\n",
"- feature 1 is the row index \n",
"- feature 2 is the column name. \n",
"- The values are 'v1 x v2' \n",
"\n",
"Use a heatmap to visualize the table."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"v1 x v2\n"
]
},
{
"data": {
"text/html": [
"