{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "df = pd.read_csv('data/Consumo_cerveja.csv', \n", " decimal=',', \n", " thousands='.', \n", " header=0, \n", " names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'], \n", " parse_dates=['date'], \n", " nrows=365)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's finish up by talking about a number of useful functions for analysis before we put it all together!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Mapping\n", "Often we have some data that we want to replace with a better representation. For example,\n", "in our beer drinking data, I would assume that the seasons are important, but since winter actually overlaps 2 years (winter is December, January, February) there's no easy numeric way to divide the year up.\n", "\n", "Enter mapping - map is a great convenience function that lets you apply a function or dictionary on every row of data. To create our season mapping, let's set up a dictionary mapping label to number of month:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "season = {\n", " \"winter\": [12, 1, 2],\n", " \"spring\": [3, 4, 5],\n", " \"summer\": [6, 7, 8],\n", " \"autumn\": [9, 10,11]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then invert the dictionary, so we have a nice representation mapping month number to season" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "season_map = {i: k\n", " for k, v in season.items()\n", " for i in v\n", " }\n", "season_map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Something we haven't mentioned yet, but converting date to a datetime type gives us access to the special .dt namespace which has datetime specific functionality. In this example, we use the .month property to get the month of a given date" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.date.dt.month" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to map our season dictionary onto our months" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['season'] = df.date.dt.month.map(season_map)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get some random samples to check that it worked as expected\n", "\n", "![Fun Fact](images/fun_fact.resized.jpeg) Use `.sample` instead of `.head` - you'll catch data errors from the middle of your dataset!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use the `.value_counts` method to check that everything is as expected" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.season.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Binning\n", "\n", "Another very common operation is to want to assign data to a bin. \n", "For example, we might want to turn a regression problem of predicting consumption into a classification problem of low vs high consumption.\n", "\n", "Let's arbitrarily choose 25,000 as our cutoff for high consumption.\n", "Since pandas is based on Numpy, we can often use numpy functions when it suits us.\n", "\n", "There is nothing in pandas that does quite what np.where does, so I use it all the time for this type of operation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['high_consumption'] = np.where(df.consumption < 25000, 0, 1)\n", "df.sample(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.high_consumption.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often our usecase is a bit more complicated than a simple higher or lower. `pd.cut` gives us a lot more flexibility in setting our cutoff points. Let's let pandas figure out the best place to split the data, by specifying that we want 2 bins -this will find the point that divides our data into equal width bins" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.cut(df.consumption, bins=2).value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While descriptive, that's not very pretty nor easy to select - `pd.cut` also supports passing a list of labels, so let's do that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.cut(df.consumption, bins=2, labels=['low', 'high']).value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Same result, but with nice labels for ease of indexing\n", "\n", "Of course we can also pass our own bins if we have irregular intervals" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.cut(df.consumption, bins=[0, 25000, 99999], labels=['low', 'high'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes you want your bins to be based on quantiles instead of arbitrary intervals - `pd.qcut` makes that easy. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['consumption_group'] = pd.qcut(df.consumption, q=3, labels=['low', 'medium', 'high'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can of course specify our quantiles explicitly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "quartiles = pd.qcut(df.consumption, q=[0, 0.25, 0.75, 1], labels=['low', 'medium', 'high'])\n", "quartiles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Categoricals\n", "\n", "If you looked closely, you might have noticed that the dtype of the Series we got back from `.cut` is a Categorical. A Categorical is a new datatype introduced by Pandas, and so there is no Numpy equivalent.\n", "\n", "Categoricals are a neat way of expressing data that represents a category with multiple levels. In this example, we have 3:\n", "- low\n", "- medium\n", "- high\n", "\n", "They are also in order from low to high.\n", "\n", "The primary benefit from categoricals is saving space - Strings are expensive to store, as we have to store a python object inside each array - very inefficient. Categoricals store codes instead - integers that map to a string label. Just like with dates, where we can access special methods using the `.dt` namespace, we can use `.cat` to use categorical specific methods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "quartiles.cat.categories" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "quartiles.cat.codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are storing 365 int8s and 3 strings instead of 365 strings - much nicer!\n", "\n", "We can also use categories to indicate data that should be there, but is not - this is great for machine learning production, where you have to encode a row of data into the correct number of categories" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "example = quartiles.cat.add_categories('example')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "example.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Get Dummies\n", "\n", "One common task in datascience, is to one-hot encode categorical columns. As this is also known as \"creating dummy variables\" pandas has a built in solution for that - `.get_dummies`. It takes your dataframe and one-hot-encodes all categorical columns it finds (usually string columns, but also categoricals). In combination with the categorical dtype from before, it's easy to encode all possible categories, even though they are not present in data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# There are no values labelled 'example' in the data\n", "pd.get_dummies(example)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also pass your full DataFrame to `.get_dummies` - it will generally do the right thing!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.get_dummies(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also specify which columns to encode, as well as passing the `drop_first` parameter if you're trying to avoid multicollinearity. \n", "\n", "Note that the `high_consumption` column is an int dtype - all columns passed to `columns` will be one-hot encoded, regardless of datatype." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.get_dummies(df, drop_first=True, columns=['season', 'high_consumption'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.get_dummies` is a great convenience function that is very handy for modelling work!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Shifting and Diffing\n", "\n", "Sometimes, you want to compare running differences - what's the change between days or months? Pandas provides utility methods to do that in various forms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Subtract the previous value\n", "df.consumption.diff()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Subtract the value from 30 days before\n", "df.consumption.diff(periods=30).dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the percentage change compared to 7 days ago\n", "df.consumption.pct_change(periods=7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are convenience methods built around `.shift` - `.shift` lets you easily compare a row with another row" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Shift all columns one step\n", "df.shift(1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how 25,461 was at index 0 previously and is now at index 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.consumption.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be a great way to create a lagged feature set for time series modelling. If you want to predict next days consumption, I can simply create a dataframe of shifted periods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([df.consumption.shift(i).rename(f't_{-i}') for i in range(5)], axis=1).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A nice special case comes when using a Datetime Index like we did before. Then we can use `.shift`'s frequency argument and get some nice benefits" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.set_index('date')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We don't get any missing values, as we are simply incrementing the index by one period" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shift(1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can set different frequencies to shift by - for example, using 'M' \"rounds up\" to the nearest month end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shift(1, freq='M').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Stacking & Unstacking & MultiIndex\n", "Stacking and unstacking are nice tools to manipulate our data into the form that we want. We often use it when we end up with a `MultiIndex` - an index with multiple `levels`. The most frequent way to end up with a `MultiIndex` is to do a groupby on multiple columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Is there a difference in weekend consumption between seasons?\n", "weekend_season = df.groupby(['season', 'consumption_group']).max_temp.mean()\n", "weekend_season" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This Series has two `levels` in its index - `season` and `consumption_group`. How can I access these values?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This works\n", "weekend_season.loc['autumn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This doesn't\n", "weekend_season.loc['low']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How would I get only the low consumption group?\n", "\n", "I can use tuples to dig into a level and even do slices" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get low consumption in autumn\n", "weekend_season.loc[('autumn', 'low')]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get all values between autumn, low to spring, medium\n", "weekend_season.loc[('autum', 'low'):('spring', 'medium')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if I want all seasons, but only the low consumption group?\n", "\n", "I could swap the levels in the index - this is a neat trick and can be useful sometimes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weekend_season.swaplevel()#.loc['low']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best way is to use the IndexSlice convenience slicer - then we can just use our familiar `:` syntax to indicate \"everything from this level\" and it works on as many levels as you might have!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The recommended method!\n", "weekend_season.loc[pd.IndexSlice[:, 'low']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby(['season', 'consumption_group', 'weekend']).max_temp.mean()#.loc[pd.IndexSlice[:, 'low', :]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to manipulate `MultiIndex` are using `.stack` and `.unstack`. Using these, you can \"pivot\" levels from an index into columns and viceversa - it's a common idiom to do a `.unstack` after a groupby!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# consumption_group is now our columns\n", "weekend_season.unstack()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The outermost level is now stacked\n", "weekend_season.unstack(level=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stacking is not as common, as ending up with a `MultiIndex` on columns is not as common, but it does what one would expect - pivot from columns to index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "agg_groupby = df.groupby('season').agg({\"consumption\": ['mean', 'sum', 'std'],\n", " \"max_temp\": ['mean', 'sum', 'std']\n", " })\n", "agg_groupby" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The aggregations are now in the index\n", "agg_groupby.stack()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pass the level parameter to stack the other level\n", "agg_groupby.stack(level=0)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 4 }