{ "cells": [ { "cell_type": "raw", "metadata": {}, "source": [ "---\n", "title: \"What is the Difference Between axis=0 and axis=1 When Working with Pandas Dataframes?\"\n", "author: \"Andrew Bancroft\"\n", "date: 2019-05-22\n", "description: \"Provides an example scenario to help decipher the difference between axis=0 and axis=1 when working with Pandas dataframes\"\n", "type: technical_note\n", "draft: false\n", "comments: true\n", "wip: false\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, functions ask you to specify an `axis`. The documentation can often feel vague and/or technical.\n", "\n", "For instance, here's a quote from the `apply` function's documentation:\n", " \n", "> axis : {0 or ‘index’, 1 or ‘columns’}, default 0\n", "Axis along which the function is applied:\n", "0 or ‘index’: apply function to each column.\n", "1 or ‘columns’: apply function to each row.\n", "\n", "Uuuum... right.\n", "\n", "So what's the difference? Here's an example..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "
\n", " Resources\n", "
\n", "
\n", " Right-click -> Save as...\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load a CSV file to play with " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prerequisites (if you want to practice)\n", "\n", "* Install the [Pandas](https://pandas.pydata.org/) library for your Python environment\n", "* Cells in this notebook expect the Car Sales.csv file to be in the same directory as your notebook\n", "* [Resources](#resources) to help you practice" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### First Things First" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DealershipNameRedCarsSilverCarsBlackCarsBlueCarsMonthSoldYearSold
0Clyde's Clunkers902.0650.0754.0792.01.02018.0
1Clyde's Clunkers710.0476.0518.0492.02.02018.0
2Clyde's Clunkers248.0912.0606.0350.03.02018.0
3Clyde's Clunkers782.0912.0858.0446.04.02018.0
4Clyde's Clunkers278.0354.0482.0752.05.02018.0
\n", "
" ], "text/plain": [ " DealershipName RedCars SilverCars BlackCars BlueCars MonthSold \\\n", "0 Clyde's Clunkers 902.0 650.0 754.0 792.0 1.0 \n", "1 Clyde's Clunkers 710.0 476.0 518.0 492.0 2.0 \n", "2 Clyde's Clunkers 248.0 912.0 606.0 350.0 3.0 \n", "3 Clyde's Clunkers 782.0 912.0 858.0 446.0 4.0 \n", "4 Clyde's Clunkers 278.0 354.0 482.0 752.0 5.0 \n", "\n", " YearSold \n", "0 2018.0 \n", "1 2018.0 \n", "2 2018.0 \n", "3 2018.0 \n", "4 2018.0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Read the CSV file\n", "# This assumes \"Car Sales.csv\" is in the same directory as your notebook\n", "car_sales_data = pd.read_csv(\"Car Sales.csv\")\n", "\n", "# Show the first 5 rows\n", "first_five = car_sales_data.head(5)\n", "display(first_five)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The car sales data looks like it contains one row that summarizes the **total sales of each color** of car for a **given dealership**, for **each month** of the **year**. \n", "\n", "To state the \"grain\" of the data frame another way, the data frame contains **one row per dealership, month, year** combo and reports the total number of cars sold by color." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choose Your Scenario\n", "\n", "Suppose that two people come to you and ask separate questions about **average car sales**.\n", "\n", "* Lucy asks, \"Can you calculate the average number of cars sold for each color?\"\n", "\n", "* Zack asks, \"Can you calculate the average number of cars sold (regardless of color) for each dealership in each month & year?\" (so basically the average of red, silver, black, and blue cars for each row)\n", "\n", "#### Start with Lucy. \n", "Think about what you'd do to answer Lucy's question by hand, manually, if you didn't have Pandas to do the work for you. Here's what I'd do:\n", "\n", "1. Start with the RedCars column. \n", "2. Add up 902, 710, 248, 782, 278, and so on.\n", "3. Divide that sum by the total number of values. Boom. RedCars average.\n", "4. Rinse and reapeat steps 1-3 for SilverCars, BlackCars, and BlueCars.\n", "\n", "**This is an axis=0 scenario** in Pandas." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RedCars 584.0\n", "SilverCars 660.8\n", "BlackCars 643.6\n", "BlueCars 566.4\n", "dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_five[['RedCars', 'SilverCars', 'BlackCars', 'BlueCars']].mean(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### What about Zack? \n", "What would you do to answer his question by hand, without Pandas? How's this...\n", "\n", "1. Start with the first row of data (Row 0), since his question matches the \"grain\" of the data frame... one row per dealership per month & year.\n", "2. Add up the RedCars, SilverCars, BlackCars, and BlueCars values for Row 0 and divide by 4. So (902 + 650 + 754 + 792)/4\n", "3. Rinse and repeat steps 1 & 2 for every row in the data frame. Boom. Average cars sold by dealer/month/year.\n", "\n", "**This is an axis=1 scenario**." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 774.5\n", "1 549.0\n", "2 529.0\n", "3 749.5\n", "4 466.5\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_five[['RedCars', 'SilverCars', 'BlackCars', 'BlueCars']].mean(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summarizing the Findings\n", "Specifying an `axis` to a function in Pandas is helping answer one of the following questions:\n", "\n", "* Should I (Pandas) **start with a column** and make this function do its job *downward* on all the \"cells\" for that column, and then continue doing the same thing **for all the rest of the columns** in the data frame? (`axis=0`)\n", "\n", "*or*\n", "\n", "* Should I (Pandas) **start with the first row** of data in the data frame and make this function do its job *horizontally* on all of the \"cells\" for that row, and then continue doing the same thing **for all the rest of the rows** in the data frame? (`axis=1`)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }