{ "cells": [ { "cell_type": "markdown", "id": "99c606f8-037f-4258-81e7-a9f4ac511242", "metadata": {}, "source": [ "# Introduction to working with DataFrames\n", "In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that we introduce [panda DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) which are more handy in the next steps. In Python, scientists often call tables \"DataFrames\". " ] }, { "cell_type": "code", "execution_count": 1, "id": "0cfceb6c-1acc-4632-b084-8b0871a7c50a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "8b77888b-c9a8-4a67-a4eb-f7df46eda970", "metadata": {}, "source": [ "## Creating DataFrames from a dictionary of lists\n", "Assume we did some image processing and have some results in available in a dictionary that contains lists of numbers:" ] }, { "cell_type": "code", "execution_count": 2, "id": "ff80484f-657b-4231-8d8f-cdc26577542b", "metadata": {}, "outputs": [], "source": [ "measurements = {\n", " \"labels\": [1, 2, 3],\n", " \"area\": [45, 23, 68],\n", " \"minor_axis\": [2, 4, 4],\n", " \"major_axis\": [3, 4, 5],\n", "}" ] }, { "cell_type": "markdown", "id": "b2afa6a9-e15c-4147-bdd4-ec4d4f87fb36", "metadata": {}, "source": [ "This data structure can be nicely visualized using a DataFrame:" ] }, { "cell_type": "code", "execution_count": 3, "id": "8bf4e4b5-ef72-4f63-84d2-48cc3a77c297", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelsareaminor_axismajor_axis
014523
122344
236845
\n", "
" ], "text/plain": [ " labels area minor_axis major_axis\n", "0 1 45 2 3\n", "1 2 23 4 4\n", "2 3 68 4 5" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(measurements)\n", "df" ] }, { "cell_type": "markdown", "id": "930c082b-8f16-4711-b3e0-e56a7ec6d272", "metadata": {}, "source": [ "Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns:" ] }, { "cell_type": "code", "execution_count": 4, "id": "a34866ff-a2cb-4a7c-a4e8-4544559b634c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelsareaminor_axismajor_axisaspect_ratio
0145231.50
1223441.00
2368451.25
\n", "
" ], "text/plain": [ " labels area minor_axis major_axis aspect_ratio\n", "0 1 45 2 3 1.50\n", "1 2 23 4 4 1.00\n", "2 3 68 4 5 1.25" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"aspect_ratio\"] = df[\"major_axis\"] / df[\"minor_axis\"]\n", "df" ] }, { "cell_type": "markdown", "id": "201a2142-22c7-4607-bc2d-f1dfce4c7e26", "metadata": {}, "source": [ "## Saving data frames\n", "We can also save this table for continuing to work with it." ] }, { "cell_type": "code", "execution_count": 5, "id": "fb01d2d9-4d8b-4b6a-b158-9516a581e000", "metadata": {}, "outputs": [], "source": [ "df.to_csv(\"../../data/short_table.csv\")" ] }, { "cell_type": "markdown", "id": "0240857d-292f-4ac3-ba87-8878aa941cde", "metadata": {}, "source": [ "## Creating DataFrames from lists of lists\n", "Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists" ] }, { "cell_type": "code", "execution_count": 6, "id": "c72a82b1-4da6-468d-afa6-149cb00f7d37", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
labels123
area452368
minor_axis244
major_axis345
\n", "
" ], "text/plain": [ " 0 1 2\n", "labels 1 2 3\n", "area 45 23 68\n", "minor_axis 2 4 4\n", "major_axis 3 4 5" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "header = ['labels', 'area', 'minor_axis', 'major_axis']\n", "\n", "data = [\n", " [1, 2, 3],\n", " [45, 23, 68],\n", " [2, 4, 4],\n", " [3, 4, 5],\n", "]\n", " \n", "# convert the data and header arrays in a pandas data frame\n", "data_frame = pd.DataFrame(data, header)\n", "\n", "# show it\n", "data_frame" ] }, { "cell_type": "markdown", "id": "a8b1b6b0-027c-4536-8710-e3f87aca1896", "metadata": {}, "source": [ "As you can see, this tabls is _rotated_. We can bring it in the usual form like this:" ] }, { "cell_type": "code", "execution_count": 7, "id": "40669e82-4264-4883-9c4e-8a366b061610", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelsareaminor_axismajor_axis
014523
122344
236845
\n", "
" ], "text/plain": [ " labels area minor_axis major_axis\n", "0 1 45 2 3\n", "1 2 23 4 4\n", "2 3 68 4 5" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rotate/flip it\n", "data_frame = data_frame.transpose()\n", "\n", "# show it\n", "data_frame" ] }, { "cell_type": "markdown", "id": "ccf08662-fccf-4dc1-91c2-3365fa85a96b", "metadata": {}, "source": [ "## Loading data frames\n", "Tables can also be read from CSV files." ] }, { "cell_type": "code", "execution_count": 8, "id": "aa7c74db-68ab-4004-aa5e-01ba1ad88c79", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0areamean_intensityminor_axis_lengthmajor_axis_lengtheccentricityextentferet_diameter_maxequivalent_diameter_areabbox-0bbox-1bbox-2bbox-3
00422192.37914716.48855034.5667890.8789000.58611135.22783023.1798850113035
11182180.13186811.73607420.8026970.8256650.78787921.37755815.2226670531174
22661205.21633928.40950230.2084330.3399340.87433932.75667929.01053809528122
33437216.58581223.14399624.6061300.3395760.82608726.92582423.588253014423167
44476212.30252119.85288231.0751060.7693170.86388431.38471024.618327023729256
..........................................
5656211185.06161114.52276218.4891380.6188930.78148118.97366616.3906542323925054
575778185.2307696.02863817.5797990.9393610.72222218.0277569.965575248170254188
585886183.7209305.42687121.2614270.9668760.78181822.00000010.464158249117254139
595951190.4313735.03241413.7420790.9305340.72857114.0356698.058239249228254242
606046175.3043483.80398215.9487140.9711390.76666715.0332967.6530402506725482
\n", "

61 rows × 13 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 area mean_intensity minor_axis_length major_axis_length \\\n", "0 0 422 192.379147 16.488550 34.566789 \n", "1 1 182 180.131868 11.736074 20.802697 \n", "2 2 661 205.216339 28.409502 30.208433 \n", "3 3 437 216.585812 23.143996 24.606130 \n", "4 4 476 212.302521 19.852882 31.075106 \n", ".. ... ... ... ... ... \n", "56 56 211 185.061611 14.522762 18.489138 \n", "57 57 78 185.230769 6.028638 17.579799 \n", "58 58 86 183.720930 5.426871 21.261427 \n", "59 59 51 190.431373 5.032414 13.742079 \n", "60 60 46 175.304348 3.803982 15.948714 \n", "\n", " eccentricity extent feret_diameter_max equivalent_diameter_area \\\n", "0 0.878900 0.586111 35.227830 23.179885 \n", "1 0.825665 0.787879 21.377558 15.222667 \n", "2 0.339934 0.874339 32.756679 29.010538 \n", "3 0.339576 0.826087 26.925824 23.588253 \n", "4 0.769317 0.863884 31.384710 24.618327 \n", ".. ... ... ... ... \n", "56 0.618893 0.781481 18.973666 16.390654 \n", "57 0.939361 0.722222 18.027756 9.965575 \n", "58 0.966876 0.781818 22.000000 10.464158 \n", "59 0.930534 0.728571 14.035669 8.058239 \n", "60 0.971139 0.766667 15.033296 7.653040 \n", "\n", " bbox-0 bbox-1 bbox-2 bbox-3 \n", "0 0 11 30 35 \n", "1 0 53 11 74 \n", "2 0 95 28 122 \n", "3 0 144 23 167 \n", "4 0 237 29 256 \n", ".. ... ... ... ... \n", "56 232 39 250 54 \n", "57 248 170 254 188 \n", "58 249 117 254 139 \n", "59 249 228 254 242 \n", "60 250 67 254 82 \n", "\n", "[61 rows x 13 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_csv = pd.read_csv('../../data/blobs_statistics.csv')\n", "df_csv" ] }, { "cell_type": "markdown", "id": "01732b57-35d9-4b25-9c1b-d322487d2757", "metadata": {}, "source": [ "Typically, we don't need all the information in these tables and thus, it makes sense to reduce the table. For that, we print out the column names first." ] }, { "cell_type": "code", "execution_count": 9, "id": "cc7d6cbe-6487-49a6-84b2-e837f7070f25", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',\n", " 'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',\n", " 'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],\n", " dtype='object')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_csv.keys()" ] }, { "cell_type": "markdown", "id": "ff187a52-9fc0-4f6f-b143-f872dfe620c2", "metadata": {}, "source": [ "We can then copy&paste the colum names we're interested in and create a new data frame." ] }, { "cell_type": "code", "execution_count": 10, "id": "b1f03533-e9d0-4880-af3f-c9766df56f29", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areamean_intensity
0422192.379147
1182180.131868
2661205.216339
3437216.585812
4476212.302521
.........
56211185.061611
5778185.230769
5886183.720930
5951190.431373
6046175.304348
\n", "

61 rows × 2 columns

\n", "
" ], "text/plain": [ " area mean_intensity\n", "0 422 192.379147\n", "1 182 180.131868\n", "2 661 205.216339\n", "3 437 216.585812\n", "4 476 212.302521\n", ".. ... ...\n", "56 211 185.061611\n", "57 78 185.230769\n", "58 86 183.720930\n", "59 51 190.431373\n", "60 46 175.304348\n", "\n", "[61 rows x 2 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_analysis = df_csv[['area', 'mean_intensity']]\n", "df_analysis" ] }, { "cell_type": "markdown", "id": "64eb1086-ebc8-4905-afc2-ed0dc01620b9", "metadata": {}, "source": [ "You can then access columns and add new columns." ] }, { "cell_type": "code", "execution_count": 18, "id": "402892eb-b1ea-4f11-b272-9c44207f7991", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\rober\\AppData\\Local\\Temp/ipykernel_20588/206920941.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areamean_intensitytotal_intensity
0422192.37914781184.0
1182180.13186832784.0
2661205.216339135648.0
3437216.58581294648.0
4476212.302521101056.0
............
56211185.06161139048.0
5778185.23076914448.0
5886183.72093015800.0
5951190.4313739712.0
6046175.3043488064.0
\n", "

61 rows × 3 columns

\n", "
" ], "text/plain": [ " area mean_intensity total_intensity\n", "0 422 192.379147 81184.0\n", "1 182 180.131868 32784.0\n", "2 661 205.216339 135648.0\n", "3 437 216.585812 94648.0\n", "4 476 212.302521 101056.0\n", ".. ... ... ...\n", "56 211 185.061611 39048.0\n", "57 78 185.230769 14448.0\n", "58 86 183.720930 15800.0\n", "59 51 190.431373 9712.0\n", "60 46 175.304348 8064.0\n", "\n", "[61 rows x 3 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']\n", "df_analysis" ] }, { "cell_type": "markdown", "id": "9db24255-2290-4e83-ac74-93d780378175", "metadata": {}, "source": [ "## Exercise\n", "For the loaded CSV file, create a table that only contains these columns:\n", "* `minor_axis_length`\n", "* `major_axis_length`\n", "* `aspect_ratio`" ] }, { "cell_type": "code", "execution_count": null, "id": "87f226cd-721b-43e3-a31a-faed5e8a6733", "metadata": {}, "outputs": [], "source": [ "df_shape = pd.read_csv('../../data/blobs_statistics.csv')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }