{
"cells": [
{
"cell_type": "markdown",
"id": "99c606f8-037f-4258-81e7-a9f4ac511242",
"metadata": {},
"source": [
"# Introduction to working with DataFrames\n",
"In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that we introduce [panda DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) which are more handy in the next steps. In Python, scientists often call tables \"DataFrames\". "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "0cfceb6c-1acc-4632-b084-8b0871a7c50a",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "8b77888b-c9a8-4a67-a4eb-f7df46eda970",
"metadata": {},
"source": [
"## Creating DataFrames from a dictionary of lists\n",
"Assume we did some image processing and have some results in available in a dictionary that contains lists of numbers:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ff80484f-657b-4231-8d8f-cdc26577542b",
"metadata": {},
"outputs": [],
"source": [
"measurements = {\n",
" \"labels\": [1, 2, 3],\n",
" \"area\": [45, 23, 68],\n",
" \"minor_axis\": [2, 4, 4],\n",
" \"major_axis\": [3, 4, 5],\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "b2afa6a9-e15c-4147-bdd4-ec4d4f87fb36",
"metadata": {},
"source": [
"This data structure can be nicely visualized using a DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8bf4e4b5-ef72-4f63-84d2-48cc3a77c297",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" labels | \n",
" area | \n",
" minor_axis | \n",
" major_axis | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 45 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 23 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 68 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" labels area minor_axis major_axis\n",
"0 1 45 2 3\n",
"1 2 23 4 4\n",
"2 3 68 4 5"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(measurements)\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "930c082b-8f16-4711-b3e0-e56a7ec6d272",
"metadata": {},
"source": [
"Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a34866ff-a2cb-4a7c-a4e8-4544559b634c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" labels | \n",
" area | \n",
" minor_axis | \n",
" major_axis | \n",
" aspect_ratio | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 45 | \n",
" 2 | \n",
" 3 | \n",
" 1.50 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 23 | \n",
" 4 | \n",
" 4 | \n",
" 1.00 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 68 | \n",
" 4 | \n",
" 5 | \n",
" 1.25 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" labels area minor_axis major_axis aspect_ratio\n",
"0 1 45 2 3 1.50\n",
"1 2 23 4 4 1.00\n",
"2 3 68 4 5 1.25"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"aspect_ratio\"] = df[\"major_axis\"] / df[\"minor_axis\"]\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "201a2142-22c7-4607-bc2d-f1dfce4c7e26",
"metadata": {},
"source": [
"## Saving data frames\n",
"We can also save this table for continuing to work with it."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "fb01d2d9-4d8b-4b6a-b158-9516a581e000",
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(\"../../data/short_table.csv\")"
]
},
{
"cell_type": "markdown",
"id": "0240857d-292f-4ac3-ba87-8878aa941cde",
"metadata": {},
"source": [
"## Creating DataFrames from lists of lists\n",
"Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c72a82b1-4da6-468d-afa6-149cb00f7d37",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" labels | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" area | \n",
" 45 | \n",
" 23 | \n",
" 68 | \n",
"
\n",
" \n",
" minor_axis | \n",
" 2 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" major_axis | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"labels 1 2 3\n",
"area 45 23 68\n",
"minor_axis 2 4 4\n",
"major_axis 3 4 5"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"header = ['labels', 'area', 'minor_axis', 'major_axis']\n",
"\n",
"data = [\n",
" [1, 2, 3],\n",
" [45, 23, 68],\n",
" [2, 4, 4],\n",
" [3, 4, 5],\n",
"]\n",
" \n",
"# convert the data and header arrays in a pandas data frame\n",
"data_frame = pd.DataFrame(data, header)\n",
"\n",
"# show it\n",
"data_frame"
]
},
{
"cell_type": "markdown",
"id": "a8b1b6b0-027c-4536-8710-e3f87aca1896",
"metadata": {},
"source": [
"As you can see, this tabls is _rotated_. We can bring it in the usual form like this:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "40669e82-4264-4883-9c4e-8a366b061610",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" labels | \n",
" area | \n",
" minor_axis | \n",
" major_axis | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 45 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 23 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 68 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" labels area minor_axis major_axis\n",
"0 1 45 2 3\n",
"1 2 23 4 4\n",
"2 3 68 4 5"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# rotate/flip it\n",
"data_frame = data_frame.transpose()\n",
"\n",
"# show it\n",
"data_frame"
]
},
{
"cell_type": "markdown",
"id": "ccf08662-fccf-4dc1-91c2-3365fa85a96b",
"metadata": {},
"source": [
"## Loading data frames\n",
"Tables can also be read from CSV files."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "aa7c74db-68ab-4004-aa5e-01ba1ad88c79",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" area | \n",
" mean_intensity | \n",
" minor_axis_length | \n",
" major_axis_length | \n",
" eccentricity | \n",
" extent | \n",
" feret_diameter_max | \n",
" equivalent_diameter_area | \n",
" bbox-0 | \n",
" bbox-1 | \n",
" bbox-2 | \n",
" bbox-3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 422 | \n",
" 192.379147 | \n",
" 16.488550 | \n",
" 34.566789 | \n",
" 0.878900 | \n",
" 0.586111 | \n",
" 35.227830 | \n",
" 23.179885 | \n",
" 0 | \n",
" 11 | \n",
" 30 | \n",
" 35 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 182 | \n",
" 180.131868 | \n",
" 11.736074 | \n",
" 20.802697 | \n",
" 0.825665 | \n",
" 0.787879 | \n",
" 21.377558 | \n",
" 15.222667 | \n",
" 0 | \n",
" 53 | \n",
" 11 | \n",
" 74 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 661 | \n",
" 205.216339 | \n",
" 28.409502 | \n",
" 30.208433 | \n",
" 0.339934 | \n",
" 0.874339 | \n",
" 32.756679 | \n",
" 29.010538 | \n",
" 0 | \n",
" 95 | \n",
" 28 | \n",
" 122 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 437 | \n",
" 216.585812 | \n",
" 23.143996 | \n",
" 24.606130 | \n",
" 0.339576 | \n",
" 0.826087 | \n",
" 26.925824 | \n",
" 23.588253 | \n",
" 0 | \n",
" 144 | \n",
" 23 | \n",
" 167 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 476 | \n",
" 212.302521 | \n",
" 19.852882 | \n",
" 31.075106 | \n",
" 0.769317 | \n",
" 0.863884 | \n",
" 31.384710 | \n",
" 24.618327 | \n",
" 0 | \n",
" 237 | \n",
" 29 | \n",
" 256 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 56 | \n",
" 56 | \n",
" 211 | \n",
" 185.061611 | \n",
" 14.522762 | \n",
" 18.489138 | \n",
" 0.618893 | \n",
" 0.781481 | \n",
" 18.973666 | \n",
" 16.390654 | \n",
" 232 | \n",
" 39 | \n",
" 250 | \n",
" 54 | \n",
"
\n",
" \n",
" 57 | \n",
" 57 | \n",
" 78 | \n",
" 185.230769 | \n",
" 6.028638 | \n",
" 17.579799 | \n",
" 0.939361 | \n",
" 0.722222 | \n",
" 18.027756 | \n",
" 9.965575 | \n",
" 248 | \n",
" 170 | \n",
" 254 | \n",
" 188 | \n",
"
\n",
" \n",
" 58 | \n",
" 58 | \n",
" 86 | \n",
" 183.720930 | \n",
" 5.426871 | \n",
" 21.261427 | \n",
" 0.966876 | \n",
" 0.781818 | \n",
" 22.000000 | \n",
" 10.464158 | \n",
" 249 | \n",
" 117 | \n",
" 254 | \n",
" 139 | \n",
"
\n",
" \n",
" 59 | \n",
" 59 | \n",
" 51 | \n",
" 190.431373 | \n",
" 5.032414 | \n",
" 13.742079 | \n",
" 0.930534 | \n",
" 0.728571 | \n",
" 14.035669 | \n",
" 8.058239 | \n",
" 249 | \n",
" 228 | \n",
" 254 | \n",
" 242 | \n",
"
\n",
" \n",
" 60 | \n",
" 60 | \n",
" 46 | \n",
" 175.304348 | \n",
" 3.803982 | \n",
" 15.948714 | \n",
" 0.971139 | \n",
" 0.766667 | \n",
" 15.033296 | \n",
" 7.653040 | \n",
" 250 | \n",
" 67 | \n",
" 254 | \n",
" 82 | \n",
"
\n",
" \n",
"
\n",
"
61 rows × 13 columns
\n",
"
"
],
"text/plain": [
" Unnamed: 0 area mean_intensity minor_axis_length major_axis_length \\\n",
"0 0 422 192.379147 16.488550 34.566789 \n",
"1 1 182 180.131868 11.736074 20.802697 \n",
"2 2 661 205.216339 28.409502 30.208433 \n",
"3 3 437 216.585812 23.143996 24.606130 \n",
"4 4 476 212.302521 19.852882 31.075106 \n",
".. ... ... ... ... ... \n",
"56 56 211 185.061611 14.522762 18.489138 \n",
"57 57 78 185.230769 6.028638 17.579799 \n",
"58 58 86 183.720930 5.426871 21.261427 \n",
"59 59 51 190.431373 5.032414 13.742079 \n",
"60 60 46 175.304348 3.803982 15.948714 \n",
"\n",
" eccentricity extent feret_diameter_max equivalent_diameter_area \\\n",
"0 0.878900 0.586111 35.227830 23.179885 \n",
"1 0.825665 0.787879 21.377558 15.222667 \n",
"2 0.339934 0.874339 32.756679 29.010538 \n",
"3 0.339576 0.826087 26.925824 23.588253 \n",
"4 0.769317 0.863884 31.384710 24.618327 \n",
".. ... ... ... ... \n",
"56 0.618893 0.781481 18.973666 16.390654 \n",
"57 0.939361 0.722222 18.027756 9.965575 \n",
"58 0.966876 0.781818 22.000000 10.464158 \n",
"59 0.930534 0.728571 14.035669 8.058239 \n",
"60 0.971139 0.766667 15.033296 7.653040 \n",
"\n",
" bbox-0 bbox-1 bbox-2 bbox-3 \n",
"0 0 11 30 35 \n",
"1 0 53 11 74 \n",
"2 0 95 28 122 \n",
"3 0 144 23 167 \n",
"4 0 237 29 256 \n",
".. ... ... ... ... \n",
"56 232 39 250 54 \n",
"57 248 170 254 188 \n",
"58 249 117 254 139 \n",
"59 249 228 254 242 \n",
"60 250 67 254 82 \n",
"\n",
"[61 rows x 13 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_csv = pd.read_csv('../../data/blobs_statistics.csv')\n",
"df_csv"
]
},
{
"cell_type": "markdown",
"id": "01732b57-35d9-4b25-9c1b-d322487d2757",
"metadata": {},
"source": [
"Typically, we don't need all the information in these tables and thus, it makes sense to reduce the table. For that, we print out the column names first."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "cc7d6cbe-6487-49a6-84b2-e837f7070f25",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',\n",
" 'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',\n",
" 'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],\n",
" dtype='object')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_csv.keys()"
]
},
{
"cell_type": "markdown",
"id": "ff187a52-9fc0-4f6f-b143-f872dfe620c2",
"metadata": {},
"source": [
"We can then copy&paste the colum names we're interested in and create a new data frame."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "b1f03533-e9d0-4880-af3f-c9766df56f29",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" area | \n",
" mean_intensity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 422 | \n",
" 192.379147 | \n",
"
\n",
" \n",
" 1 | \n",
" 182 | \n",
" 180.131868 | \n",
"
\n",
" \n",
" 2 | \n",
" 661 | \n",
" 205.216339 | \n",
"
\n",
" \n",
" 3 | \n",
" 437 | \n",
" 216.585812 | \n",
"
\n",
" \n",
" 4 | \n",
" 476 | \n",
" 212.302521 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 56 | \n",
" 211 | \n",
" 185.061611 | \n",
"
\n",
" \n",
" 57 | \n",
" 78 | \n",
" 185.230769 | \n",
"
\n",
" \n",
" 58 | \n",
" 86 | \n",
" 183.720930 | \n",
"
\n",
" \n",
" 59 | \n",
" 51 | \n",
" 190.431373 | \n",
"
\n",
" \n",
" 60 | \n",
" 46 | \n",
" 175.304348 | \n",
"
\n",
" \n",
"
\n",
"
61 rows × 2 columns
\n",
"
"
],
"text/plain": [
" area mean_intensity\n",
"0 422 192.379147\n",
"1 182 180.131868\n",
"2 661 205.216339\n",
"3 437 216.585812\n",
"4 476 212.302521\n",
".. ... ...\n",
"56 211 185.061611\n",
"57 78 185.230769\n",
"58 86 183.720930\n",
"59 51 190.431373\n",
"60 46 175.304348\n",
"\n",
"[61 rows x 2 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_analysis = df_csv[['area', 'mean_intensity']]\n",
"df_analysis"
]
},
{
"cell_type": "markdown",
"id": "64eb1086-ebc8-4905-afc2-ed0dc01620b9",
"metadata": {},
"source": [
"You can then access columns and add new columns."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "402892eb-b1ea-4f11-b272-9c44207f7991",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\rober\\AppData\\Local\\Temp/ipykernel_20588/206920941.py:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" area | \n",
" mean_intensity | \n",
" total_intensity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 422 | \n",
" 192.379147 | \n",
" 81184.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 182 | \n",
" 180.131868 | \n",
" 32784.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 661 | \n",
" 205.216339 | \n",
" 135648.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 437 | \n",
" 216.585812 | \n",
" 94648.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 476 | \n",
" 212.302521 | \n",
" 101056.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 56 | \n",
" 211 | \n",
" 185.061611 | \n",
" 39048.0 | \n",
"
\n",
" \n",
" 57 | \n",
" 78 | \n",
" 185.230769 | \n",
" 14448.0 | \n",
"
\n",
" \n",
" 58 | \n",
" 86 | \n",
" 183.720930 | \n",
" 15800.0 | \n",
"
\n",
" \n",
" 59 | \n",
" 51 | \n",
" 190.431373 | \n",
" 9712.0 | \n",
"
\n",
" \n",
" 60 | \n",
" 46 | \n",
" 175.304348 | \n",
" 8064.0 | \n",
"
\n",
" \n",
"
\n",
"
61 rows × 3 columns
\n",
"
"
],
"text/plain": [
" area mean_intensity total_intensity\n",
"0 422 192.379147 81184.0\n",
"1 182 180.131868 32784.0\n",
"2 661 205.216339 135648.0\n",
"3 437 216.585812 94648.0\n",
"4 476 212.302521 101056.0\n",
".. ... ... ...\n",
"56 211 185.061611 39048.0\n",
"57 78 185.230769 14448.0\n",
"58 86 183.720930 15800.0\n",
"59 51 190.431373 9712.0\n",
"60 46 175.304348 8064.0\n",
"\n",
"[61 rows x 3 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']\n",
"df_analysis"
]
},
{
"cell_type": "markdown",
"id": "9db24255-2290-4e83-ac74-93d780378175",
"metadata": {},
"source": [
"## Exercise\n",
"For the loaded CSV file, create a table that only contains these columns:\n",
"* `minor_axis_length`\n",
"* `major_axis_length`\n",
"* `aspect_ratio`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "87f226cd-721b-43e3-a31a-faed5e8a6733",
"metadata": {},
"outputs": [],
"source": [
"df_shape = pd.read_csv('../../data/blobs_statistics.csv')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}