{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python 101\n", "## Part XI.\n", "---\n", "\n", "## More pandas are coming!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "BASE_URI = \"./data/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 1. Read demographic data\n", "\n", "- Read data from a csv file" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df = pd.read_csv(BASE_URI + 'population.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Select the columns we need" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "columns = ['Country Name'] + [str(year) for year in range(1960, 2013)]\n", "pop = df[columns].dropna()\n", "pop.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Rename `'Country Name'` to `'country'` and set it as index." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pop = pop.rename(columns={'Country Name': 'country'}).set_index('country')\n", "pop.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Select a few countries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "countries = ['United Kingdom', 'Hungary', 'France', 'Germany']\n", "subpop = pop.loc[pop.index.isin(countries)]\n", "subpop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Transpose the dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "subpop = subpop.transpose()\n", "subpop.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Plot!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "subpop.plot.line();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 2. Read data about alcohol\n", "\n", "- Read the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df = pd.read_csv(BASE_URI + 'alcohol.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Select the columns, and rename them" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "columns = {\n", " 'Country': 'country',\n", " 'Year': 'year',\n", " 'Beverage Types': 'type',\n", " 'Display Value': 'alcohol'\n", "}\n", "alc = df[list(columns.keys())].rename(columns=columns).dropna()\n", "alc.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Select the same country subset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "subalc = alc.loc[alc['country'].isin(countries)]\n", "subalc.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We only care about the combined consumption:\n", "- filter the dataframe, select rows where `type` is `'All'`\n", "- remove the now defunct column `type`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "subalc = subalc.loc[subalc['type'] == 'All'].drop(columns=['type'])\n", "subalc.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Pivot the dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "subalc = subalc.pivot(index='year', columns='country', values='alcohol')\n", "subalc.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Plot!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "subalc.plot();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 3. Merge data\n", "\n", "- Check index types" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print('subpop index type:', subpop.index.dtype)\n", "print('subalc index type:', subalc.index.dtype)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "subpop's index type is unicode, change it to integer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "subpop.index = subpop.index.astype('int')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Check index lengths" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "set(subpop.index.values).symmetric_difference(set(subalc.index.values))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove missing index values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "subpop = subpop.loc[(1989 < subpop.index) & (subpop.index < 2011)]\n", "subalc = subalc.loc[(1989 < subalc.index) & (subalc.index < 2011)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Join the two dataframe!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "merged = subpop.join(subalc, rsuffix='_alc')\n", "merged.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Plot data into separate coordinate systems" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "merged[['Hungary', 'Hungary_alc']].plot.line(subplots=True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Compute the total actual alcohol consumption" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "for country in countries:\n", " merged[country + '_consumption'] = merged[country] * merged[country + '_alc']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- and plot it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "merged[[c + '_consumption' for c in countries]].plot();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Let's do some...\n", "\n", "\n", "
\n", "\n", "### Act IV: Cool library of the week: ydata-profiling\n", "#### Generate detailed reports from pandas dataframes\n", "- import it\n", "- generate report\n", " \n", "Install with:\n", "```bash\n", "pip install ydata-profiling\n", "```\n", "\n", "in case of import error, try:\n", "```bash\n", "pip install numba -U\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ydata_profiling" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile = ydata_profiling.ProfileReport(alc)\n", "profile" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Export it to a file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile.to_file('./alc_profile.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Final Act: your play time!\n", "## It's your turn - write the missing code snippets!\n", "\n", "#### 1. Plot the top 5 alcohol consuming country in 1990 and their consumption" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Compare the average alcohol consumption in France, Germany, UK and Hungary by plotting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. Compute the most spirit consuming country for each year " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4. How does the different type of alcohols changed over time in Hungary? Show it by plotting them!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5. Which countries combined consuption changed the most from 1990 to 2010? (top5)\n", "Hint: `np.abs()` computes the absolute value of a pandas series " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "szisz_python_2025_automn", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 1 }