{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"dsi\n", "\n", "# Pandas 101 - Series & DataFrames\n", "\n", "Get to know the basic operations on series and dataframes.\n", "\n", "For more details check the [User Guide](https://pandas.pydata.org/docs/user_guide/index.html) or the [API reference](https://pandas.pydata.org/docs/reference/index.html) from https://pandas.pydata.org\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "# Series\n", "\n", "\n", "### Creation\n", "\n", "A `Series` object can be created by `pd.Series()` with a list of values as parameter. You can get the values by the `values` attribute of the `Series`, and the index values by the `index.values` attribute. An index can be a number o a string and it can be passed by the named parameter `index` of the `pd.Series()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "series1 = pd.Series([1,2,3,4])\n", "print(\"series1 values: \" + str(series1.values))\n", "print(\"series1 indexes: \" + str(series1.index.values))\n", "\n", "series2 = pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])\n", "print(\"series2 values: \" + str(series2.values))\n", "print(\"series2 indexes: \" + str(series2.index.values))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "
\n", "Exercise: Create a Series \n", "\n", "Create a series of values with the five largest countries on earth with the name of the country as index\n", "\n", "
\n", "\n", "\n", "| Country | Size ($km^2$) |\n", "| --------------- | ------------ |\n", "| Russia | 17,098,246 |\n", "| Canada | 9,984,670 |\n", "| China | 9,596,961 |\n", "| United States | 9,525,067 |\n", "| Brazil | 8,515,767 |\n", "\n", " \n", "***\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting and Filtering Entries \n", "\n", "Single entries of a `Series` can be selected by the index or the index number.\n", "\n", "A filter can be used to select elements of the `Series` that match a given criteria. The result of the filtered `Series` are `Series` again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "byIndex = series2['c']\n", "print(\"selecting by index: \"+ str(byIndex))\n", "\n", "byIndexNumber = series2[3]\n", "print(\"selecting by index number: \"+ str(byIndexNumber))\n", "\n", "byFilter = series2[series2 > 2]\n", "print(\"Filtering values:\\n\" + str(byFilter))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "
\n", "Exercise: Select & Filter a Series \n", " \n", "1. Select the value for Brazil\n", "\n", "2. Create a filter that select all countries that have more than $9,600,000~km^2$\n", "
\n", "\n", "\n", "***\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*** \n", "\n", "# DataFrame\n", "\n", "### Creation\n", "\n", "A `DataFrame` can be created in several ways. One of the simplest ways to create one from scratch is the pass a `dict` object to `pd.DataFrame()`. In Jupypter the `display()` function can be used to render a `DataFrame` as a pretty table. To print information about the metadata of your `DataFrame` you can use the `.info()` on your dataset. The get only the first or last rows you can use `.head()`and `tail()`, which is especially useful on large datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({\n", " \"Column1\": [1, 4, 8, 7, 9],\n", " \"Column2\": ['a', 'b', 'c', 'd', 'f'],\n", " \"Column3\": [1.1, 23.5, 45.6, 32.1234, 89.453],\n", " \"Column4\": [True, False, True, False, True]\n", "})\n", "\n", "print(\"Essential Information on DataFrame:\\n\")\n", "df.info()\n", "\n", "print(\"\\n\\nFull DataFrame:\")\n", "display(df)\n", "\n", "print(\"\\n\\nHead of DataFrame:\")\n", "display(df.head(2))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "
\n", "Exercise: Create a DataFrame \n", "\n", "Create a DataFrame for the land and water size of the largest countries. What are the datatypes of your columns\n", " \n", "
\n", "\n", "| Country | Land | Water |\n", "| -------------- | ----------- | -------- |\n", "| Russia | 16,377,742 | 720,500 |\n", "| Canada | 9,093,507 | 891,163 |\n", "| China | 9,326,410 | 270,550 |\n", "| United States | 9,147,593 | 377,424 |\n", "| Brazil | 8,460,415 | 55,352 |\n", "\n", "***\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting data by column\n", "\n", "A single column an be selected by `df['column_name']` (the result is a `Series` object), multiple columns by `df[['column_name1', 'column_name2']]` (the result is a `DataFrame` object)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(df['Column1'])\n", "\n", "display(df[['Column2', 'Column3']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "
\n", "Exercise: Select columns of DataFrame \n", "\n", "Use your country dataframe.\n", "1. Select all country names as a series \n", "2. Create a new DataFrame that contains only two columns, the country name and the size of the land. \n", "
\n", "\n", "***\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting data by row\n", "\n", "Every row has an index. This index can be used to select one or multiple rows with the `.loc` function. \n", "\n", "A filter can be used to select elements of the `DataFrame` that match a given criteria. The result of the filtered `DataFrame` are `DataFrame` again.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"A single row as Series object:\")\n", "print(df.loc[2])\n", "\n", "print(\"\\n\\nMultiple rows as DataFrame object:\")\n", "display(df.loc[[2, 4]])\n", "\n", "print(\"\\n\\nMultiple rows (as range) as DataFrame object: \\n\")\n", "display(df.loc[2:5])\n", "\n", "print(\"\\n\\nRows filtered (Column3 > 42):\")\n", "\n", "filtered = df[df['Column3'] < 42]\n", "display(filtered)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "
\n", "Exercise: Filter rows of DataFrame \n", "\n", "1. Filter your country dataframe to only display rows that have a water size of $500000~km^2$ \n", " \n", "2. Filter your country dataframe to only display rows that have a land size that more than 20 times larger than the water size \n", "
\n", "\n", "***\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding / Removing a Series to a DataFrame\n", "\n", "A `Series` can easily be added to a `DataFrame` as a column with the assignment operation. For each index in the `DataFrame` the value at the index in the `Series` is assigned. If an index is not found the cell will be empty.\n", "\n", "With the `.drop` function a column (axis=1) or a row (axis=0) can be removed from a `DataFrame`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['myserie'] = series1\n", "display(df)\n", "\n", "df.drop('myserie', axis=1, inplace=True)\n", "display(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.set_index('Column3')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "
\n", "Exercise: Add a Series to a DataFrame \n", "\n", "1. Add your Series with the total size of the country as a column named 'Total' to your DataFrame.\n", " \n", " Note: Your DataFrame must have the same index as your Series. Use df.set_index('columnname') to transform a regular column to an index. (This can reverted by the .reset_index() function)\n", " \n", "2. Create a new column named 'LandAndWater' that contains the sum of Land and Water. Is it equal to the 'Total' column?\n", "
\n", "\n", "***\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }