{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data School: 21 more pandas tricks ([video](https://www.youtube.com/watch?v=tWFQqaRtSQA&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=36))\n", "\n", "**See also:** \"My top 25 pandas tricks\" [video](https://www.youtube.com/watch?v=RlIiVeig3hc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=35) and [notebook](https://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/top_25_pandas_tricks.ipynb)\n", "\n", "- Watch the [complete pandas video series](https://www.dataschool.io/easier-data-analysis-with-pandas/)\n", "- Connect on [Twitter](https://twitter.com/justmarkham), [Facebook](https://www.facebook.com/DataScienceSchool/), and [LinkedIn](https://www.linkedin.com/in/justmarkham/)\n", "- Subscribe on [YouTube](https://www.youtube.com/dataschool?sub_confirmation=1)\n", "- Join the [email newsletter](https://www.dataschool.io/subscribe/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of contents\n", "\n", "1. Check for equality\n", "2. Check for equality (alternative)\n", "3. Use NumPy without importing NumPy\n", "4. Calculate memory usage\n", "5. Count the number of words in a column\n", "6. Convert one set of values to another\n", "7. Convert continuous data into categorical data (alternative)\n", "8. Create a cross-tabulation\n", "9. Create a datetime column from multiple columns\n", "10. Resample a datetime column\n", "11. Read and write from compressed files\n", "12. Fill missing values using interpolation\n", "13. Check for duplicate merge keys\n", "14. Transpose a wide DataFrame\n", "15. Create an example DataFrame (alternative)\n", "16. Identify rows that are missing from a DataFrame\n", "17. Use query to avoid intermediate variables\n", "18. Reshape a DataFrame from wide format to long format\n", "19. Reverse row order (alternative)\n", "20. Reverse column order (alternative)\n", "21. Split a string into multiple columns (alternative)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "drinks = pd.read_csv('http://bit.ly/drinksbycountry')\n", "stocks = pd.read_csv('http://bit.ly/smallstocks', parse_dates=['Date'])\n", "titanic = pd.read_csv('http://bit.ly/kaggletrain')\n", "ufo = pd.read_csv('http://bit.ly/uforeports', parse_dates=['Time'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Check for equality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create an example DataFrame:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
01.01.0
12.02.0
2NaNNaN
\n", "
" ], "text/plain": [ " a b\n", "0 1.0 1.0\n", "1 2.0 2.0\n", "2 NaN NaN" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'a':[1, 2, np.nan], 'b':[1, 2, np.nan]})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you ever have two DataFrame columns that look similar, and you want to know if they are actually identical?\n", "\n", "This is not a reliable method for checking:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 True\n", "2 False\n", "dtype: bool" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.a == df.b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You would think that would return 3 `True` values, but it actually returns `False` any time there is a missing value:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan == np.nan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead, you can check for equality using the `equals()` method:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.a.equals(df.b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, this is how you would check if two DataFrames are identical:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_new = df.copy()\n", "df_new.equals(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We made a copy of \"df\" and then used the DataFrame `equals()` method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Check for equality (alternative)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create another example DataFrame:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cde
011.01.000000
122.02.000000
233.03.000005
\n", "
" ], "text/plain": [ " c d e\n", "0 1 1.0 1.000000\n", "1 2 2.0 2.000000\n", "2 3 3.0 3.000005" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'c':[1, 2, 3], 'd':[1.0, 2.0, 3.0], 'e':[1.0, 2.0, 3.000005]})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's important to note that the `equals()` method (shown in the first trick) requires identical data types in order to return `True`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.c.equals(df.d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This returned `False` because \"c\" is integer and \"d\" is float.\n", "\n", "For more flexibility in how the equality checking is done, use the `assert_series_equal()` function:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "pd.testing.assert_series_equal(df.c, df.d, check_names=False, check_dtype=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The assertion passed (thus no error was raised) because we specified that data type can be ignored.\n", "\n", "As well, you can check whether values are approximately equal, rather than identical:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "pd.testing.assert_series_equal(df.d, df.e, check_names=False, check_exact=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The assertion passed even though \"d\" and \"e\" have slightly different values.\n", "\n", "For checking DataFrames, there's a similar function called `assert_frame_equal()`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "df_new = df.copy()\n", "pd.testing.assert_frame_equal(df, df_new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Use NumPy without importing NumPy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although pandas is mostly a superset of NumPy's functionality, there are occasions on which you still have to import NumPy. One example is if you want to create a DataFrame of random values:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
00.5488140.7151890.6027630.544883
10.4236550.6458940.4375870.891773
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 0.548814 0.715189 0.602763 0.544883\n", "1 0.423655 0.645894 0.437587 0.891773" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(0)\n", "pd.DataFrame(np.random.rand(2, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, it turns out that you can actually access all of NumPy's functionality from within pandas, simply by typing `pd.np.` before the NumPy function name:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/zn/8p5wr_855bjbd9s6fvn6wd7w0000gn/T/ipykernel_39811/2633834213.py:2: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.\n", " pd.np.random.seed(0)\n", "/var/folders/zn/8p5wr_855bjbd9s6fvn6wd7w0000gn/T/ipykernel_39811/2633834213.py:3: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.\n", " pd.DataFrame(pd.np.random.rand(2, 4))\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
00.5488140.7151890.6027630.544883
10.4236550.6458940.4375870.891773
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 0.548814 0.715189 0.602763 0.544883\n", "1 0.423655 0.645894 0.437587 0.891773" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NOTE: pd.np is deprecated in pandas 1.0\n", "pd.np.random.seed(0)\n", "pd.DataFrame(pd.np.random.rand(2, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To be clear, this would have worked even if we had not explicitly imported NumPy at the start of the notebook.\n", "\n", "This could also be used to set a value as missing:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/zn/8p5wr_855bjbd9s6fvn6wd7w0000gn/T/ipykernel_39811/4135636971.py:1: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.\n", " df.loc[0, 'e'] = pd.np.nan\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cde
011.0NaN
122.02.000000
233.03.000005
\n", "
" ], "text/plain": [ " c d e\n", "0 1 1.0 NaN\n", "1 2 2.0 2.000000\n", "2 3 3.0 3.000005" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NOTE: pd.np is deprecated in pandas 1.0\n", "df.loc[0, 'e'] = pd.np.nan\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That being said, I would still recommend following the convention of `import numpy as np` rather than using `pd.np` since that convention is so widespread." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Calculate memory usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a DataFrame of UFO sightings:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedStateTime
0IthacaNaNTRIANGLENY1930-06-01 22:00:00
1WillingboroNaNOTHERNJ1930-06-30 20:00:00
2HolyokeNaNOVALCO1931-02-15 14:00:00
3AbileneNaNDISKKS1931-06-01 13:00:00
4New York Worlds FairNaNLIGHTNY1933-04-18 19:00:00
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State \\\n", "0 Ithaca NaN TRIANGLE NY \n", "1 Willingboro NaN OTHER NJ \n", "2 Holyoke NaN OVAL CO \n", "3 Abilene NaN DISK KS \n", "4 New York Worlds Fair NaN LIGHT NY \n", "\n", " Time \n", "0 1930-06-01 22:00:00 \n", "1 1930-06-30 20:00:00 \n", "2 1931-02-15 14:00:00 \n", "3 1931-06-01 13:00:00 \n", "4 1933-04-18 19:00:00 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can calculate the memory used by the entire DataFrame:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 18241 entries, 0 to 18240\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 City 18216 non-null object \n", " 1 Colors Reported 2882 non-null object \n", " 2 Shape Reported 15597 non-null object \n", " 3 State 18241 non-null object \n", " 4 Time 18241 non-null datetime64[ns]\n", "dtypes: datetime64[ns](1), object(4)\n", "memory usage: 4.0 MB\n" ] } ], "source": [ "ufo.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also calculate memory used by each column (in bytes):" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index 128\n", "City 1205618\n", "Colors Reported 671313\n", "Shape Reported 1065230\n", "State 1076219\n", "Time 145928\n", "dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo.memory_usage(deep=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This information might help you to decide how to optimize your DataFrame storage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Count the number of words in a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's count the values in this column from the \"ufo\" DataFrame:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RED 780\n", "GREEN 531\n", "ORANGE 528\n", "BLUE 450\n", "YELLOW 169\n", "RED GREEN 89\n", "RED BLUE 78\n", "RED ORANGE 44\n", "GREEN BLUE 34\n", "RED GREEN BLUE 33\n", "ORANGE YELLOW 26\n", "RED YELLOW 25\n", "ORANGE GREEN 23\n", "YELLOW GREEN 17\n", "ORANGE BLUE 10\n", "RED YELLOW GREEN 9\n", "YELLOW BLUE 6\n", "YELLOW GREEN BLUE 5\n", "ORANGE GREEN BLUE 5\n", "RED YELLOW GREEN BLUE 4\n", "RED ORANGE YELLOW 4\n", "RED YELLOW BLUE 3\n", "RED ORANGE GREEN 3\n", "RED ORANGE BLUE 3\n", "RED ORANGE YELLOW BLUE 1\n", "ORANGE YELLOW GREEN 1\n", "ORANGE YELLOW BLUE 1\n", "Name: Colors Reported, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo['Colors Reported'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that many of the entries have mulitple colors. What if all we cared about was the number of colors, and not the colors themselves?\n", "\n", "We can count the colors by using a string method to count the number of spaces, and then add 1:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0 2458\n", "2.0 352\n", "3.0 67\n", "4.0 5\n", "Name: Colors Reported, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(ufo['Colors Reported'].str.count(' ') + 1).value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Convert one set of values to another" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the Titanic dataset as an example, I'm going to highlight three different ways that you can convert one set of values to another.\n", "\n", "Let's start with the \"Sex\" column:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 male\n", "1 female\n", "2 female\n", "3 female\n", "4 male\n", "Name: Sex, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic.Sex.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two different values in this column. If you need to convert these values to 0 and 1, you can use the `map()` method and pass it a dictionary specifying how you want to map the values:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 1\n", "3 1\n", "4 0\n", "Name: Sex_num, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic['Sex_num'] = titanic.Sex.map({'male':0, 'female':1})\n", "titanic.Sex_num.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we specified, \"male\" has become 0 and \"female\" has become 1.\n", "\n", "Next, let's look at the \"Embarked\" column:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 S\n", "1 C\n", "2 S\n", "3 S\n", "4 S\n", "5 Q\n", "6 S\n", "7 S\n", "8 S\n", "9 C\n", "Name: Embarked, dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic.Embarked.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three different values in this column: S, C, and Q. If you need to convert them to 0, 1, and 2, you could use the `map()` method, but the `factorize()` method is even easier:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 0\n", "3 0\n", "4 0\n", "5 2\n", "6 0\n", "7 0\n", "8 0\n", "9 1\n", "Name: Embarked_num, dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic['Embarked_num'] = titanic.Embarked.factorize()[0]\n", "titanic.Embarked_num.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`factorize()` returns a tuple in which the first element contains the new values, which is why I had to use `[0]` to extract the values.\n", "\n", "You can see that \"S\" has become 0, \"C\" has become 1, and \"Q\" has become 2. It chose that mapping based on the order in which the values appear in the Series, and if you need to reference the mapping, it's stored in the second value in the tuple:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['S', 'C', 'Q'], dtype='object')" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic.Embarked.factorize()[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's look at the \"SibSp\" column:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 0\n", "3 1\n", "4 0\n", "5 0\n", "6 0\n", "7 3\n", "8 0\n", "9 1\n", "Name: SibSp, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic.SibSp.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say that you needed to keep the zeros as-is and convert all other values to one. You can express this as a condition, `SibSp > 0`, which will return a boolean Series that you can convert to integers using the `astype()` method:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 0\n", "3 1\n", "4 0\n", "5 0\n", "6 0\n", "7 1\n", "8 0\n", "9 1\n", "Name: SibSp_binary, dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic['SibSp_binary'] = (titanic.SibSp > 0).astype('int')\n", "titanic.SibSp_binary.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the only value greater than 1 has been converted to a 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Convert continuous data into categorical data (alternative)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the main tricks video, I used the `cut()` function to convert the \"Age\" column from continuous to categorical data:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 young adult\n", "1 adult\n", "2 adult\n", "3 adult\n", "4 adult\n", "5 NaN\n", "6 adult\n", "7 child\n", "8 adult\n", "9 child\n", "Name: Age, dtype: category\n", "Categories (3, object): ['child' < 'young adult' < 'adult']" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.cut(titanic.Age, bins=[0, 18, 25, 99], labels=['child', 'young adult', 'adult']).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When using `cut()`, we had to choose the edges of each bin. But if you want pandas to choose the bin edges for you, you can use the `qcut()` function instead:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (0.419, 23.0]\n", "1 (34.0, 80.0]\n", "2 (23.0, 34.0]\n", "3 (34.0, 80.0]\n", "4 (34.0, 80.0]\n", "5 NaN\n", "6 (34.0, 80.0]\n", "7 (0.419, 23.0]\n", "8 (23.0, 34.0]\n", "9 (0.419, 23.0]\n", "Name: Age, dtype: category\n", "Categories (3, interval[float64, right]): [(0.419, 23.0] < (23.0, 34.0] < (34.0, 80.0]]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.qcut(titanic.Age, q=3).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We told `qcut()` to create 3 bins, and it chose bin edges that would result in bins of approximately equal size:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.419, 23.0] 246\n", "(34.0, 80.0] 236\n", "(23.0, 34.0] 232\n", "Name: Age, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.qcut(titanic.Age, q=3).value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the three bins are ages 0 to 23, 23 to 34, and 34 to 80, and they all contain roughly the same number of observations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Create a cross-tabulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes you just want to count the number of observations in each category. If you're interested in a single column, you would use the `value_counts()` method:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "male 577\n", "female 314\n", "Name: Sex, dtype: int64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic.Sex.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But if you want to count the number of observations that appear in each combination of categories, you would use the `crosstab()` function:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Pclass123
Sex
female9476144
male122108347
\n", "
" ], "text/plain": [ "Pclass 1 2 3\n", "Sex \n", "female 94 76 144\n", "male 122 108 347" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(titanic.Sex, titanic.Pclass)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like a pivot table, you can include row and column totals by setting `margins=True`:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Pclass123All
Sex
female9476144314
male122108347577
All216184491891
\n", "
" ], "text/plain": [ "Pclass 1 2 3 All\n", "Sex \n", "female 94 76 144 314\n", "male 122 108 347 577\n", "All 216 184 491 891" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(titanic.Sex, titanic.Pclass, margins=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, you can actually create this same table using the `pivot_table()` method with `'count'` as the aggregation function:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Pclass123All
Sex
female9476144314
male122108347577
All216184491891
\n", "
" ], "text/plain": [ "Pclass 1 2 3 All\n", "Sex \n", "female 94 76 144 314\n", "male 122 108 347 577\n", "All 216 184 491 891" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic.pivot_table(index='Sex', columns='Pclass', values='Survived',\n", " aggfunc='count', margins=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Create a datetime column from multiple columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create an example DataFrame:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthdayyearholiday
012252019christmas
111282019thanksgiving
\n", "
" ], "text/plain": [ " month day year holiday\n", "0 12 25 2019 christmas\n", "1 11 28 2019 thanksgiving" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([[12, 25, 2019, 'christmas'], [11, 28, 2019, 'thanksgiving']],\n", " columns=['month', 'day', 'year', 'holiday'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can create a new datetime column simply by passing the relevant columns to `pd.to_datetime()`:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthdayyearholidaydate
012252019christmas2019-12-25
111282019thanksgiving2019-11-28
\n", "
" ], "text/plain": [ " month day year holiday date\n", "0 12 25 2019 christmas 2019-12-25\n", "1 11 28 2019 thanksgiving 2019-11-28" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['date'] = pd.to_datetime(df[['month', 'day', 'year']])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new date column has a datetime data type:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "month int64\n", "day int64\n", "year int64\n", "holiday object\n", "date datetime64[ns]\n", "dtype: object" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that you must include month, day, and year columns at a minimum, but you can also include hour, minute, and second." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10. Resample a datetime column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the stocks dataset:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateCloseVolumeSymbol
02016-10-0331.5014070500CSCO
12016-10-03112.5221701800AAPL
22016-10-0357.4219189500MSFT
32016-10-04113.0029736800AAPL
42016-10-0457.2420085900MSFT
52016-10-0431.3518460400CSCO
62016-10-0557.6416726400MSFT
72016-10-0531.5911808600CSCO
82016-10-05113.0521453100AAPL
\n", "
" ], "text/plain": [ " Date Close Volume Symbol\n", "0 2016-10-03 31.50 14070500 CSCO\n", "1 2016-10-03 112.52 21701800 AAPL\n", "2 2016-10-03 57.42 19189500 MSFT\n", "3 2016-10-04 113.00 29736800 AAPL\n", "4 2016-10-04 57.24 20085900 MSFT\n", "5 2016-10-04 31.35 18460400 CSCO\n", "6 2016-10-05 57.64 16726400 MSFT\n", "7 2016-10-05 31.59 11808600 CSCO\n", "8 2016-10-05 113.05 21453100 AAPL" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if you wanted to calculate the mean closing price by day across all stocks? Use the `resample()` method:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date\n", "2016-10-03 67.146667\n", "2016-10-04 67.196667\n", "2016-10-05 67.426667\n", "Freq: D, Name: Close, dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks.resample('D', on='Date').Close.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can think of resampling as a `groupby()` for datetime data, and in fact the structure of the command looks very similar to a `groupby()`. \"D\" specifies that the resampling frequency should be daily, and the \"on\" parameter specifics the column on which we're resampling.\n", "\n", "If the datetime column is the index, you can skip the `on` parameter. For example, let's give the ufo DataFrame a DatetimeIndex:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedState
Time
1930-06-01 22:00:00IthacaNaNTRIANGLENY
1930-06-30 20:00:00WillingboroNaNOTHERNJ
1931-02-15 14:00:00HolyokeNaNOVALCO
1931-06-01 13:00:00AbileneNaNDISKKS
1933-04-18 19:00:00New York Worlds FairNaNLIGHTNY
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State\n", "Time \n", "1930-06-01 22:00:00 Ithaca NaN TRIANGLE NY\n", "1930-06-30 20:00:00 Willingboro NaN OTHER NJ\n", "1931-02-15 14:00:00 Holyoke NaN OVAL CO\n", "1931-06-01 13:00:00 Abilene NaN DISK KS\n", "1933-04-18 19:00:00 New York Worlds Fair NaN LIGHT NY" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo = ufo.set_index('Time')\n", "ufo.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use `resample()` and it will automatically resample based on the index:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Time\n", "1996-12-31 851\n", "1997-12-31 1237\n", "1998-12-31 1743\n", "1999-12-31 2774\n", "2000-12-31 2635\n", "Freq: A-DEC, Name: State, dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo.resample('Y').State.count().tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's the count of the number of UFO sightings by year.\n", "\n", "We can calculate the count by month by changing the resampling frequency from \"Y\" to \"M\":" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Time\n", "2000-08-31 250\n", "2000-09-30 257\n", "2000-10-31 278\n", "2000-11-30 200\n", "2000-12-31 192\n", "Freq: M, Name: State, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo.resample('M').State.count().tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The string that you pass to `resample()` is known as the offset alias, and pandas supports [many offset aliases](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) other than just \"D\", \"M\", and \"Y\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 11. Read and write from compressed files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you want to save a DataFrame to a CSV file, you use the `to_csv()` method:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "ufo.to_csv('ufo.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, you can actually compress the CSV file as well:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "ufo.to_csv('ufo.csv.zip')\n", "ufo.to_csv('ufo.csv.gz')\n", "ufo.to_csv('ufo.csv.bz2')\n", "ufo.to_csv('ufo.csv.xz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using one of these file extensions, pandas infers the type of compression you want it to use.\n", "\n", "You can use a shell command to see all of the files we've created:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 kevin staff 748033 Apr 27 14:27 ufo.csv\r\n", "-rw-r--r-- 1 kevin staff 129189 Apr 27 14:27 ufo.csv.bz2\r\n", "-rw-r--r-- 1 kevin staff 198030 Apr 27 14:27 ufo.csv.gz\r\n", "-rw-r--r-- 1 kevin staff 149320 Apr 27 14:27 ufo.csv.xz\r\n", "-rw-r--r-- 1 kevin staff 200317 Apr 27 14:27 ufo.csv.zip\r\n" ] } ], "source": [ "!ls -l ufo.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that all of the compressed files are significantly smaller than the uncompressed CSV file.\n", "\n", "Finally, you can actually read directly from a compressed file using `read_csv()`:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityColors ReportedShape ReportedState
Time
1930-06-01 22:00:00IthacaNaNTRIANGLENY
1930-06-30 20:00:00WillingboroNaNOTHERNJ
1931-02-15 14:00:00HolyokeNaNOVALCO
1931-06-01 13:00:00AbileneNaNDISKKS
1933-04-18 19:00:00New York Worlds FairNaNLIGHTNY
\n", "
" ], "text/plain": [ " City Colors Reported Shape Reported State\n", "Time \n", "1930-06-01 22:00:00 Ithaca NaN TRIANGLE NY\n", "1930-06-30 20:00:00 Willingboro NaN OTHER NJ\n", "1931-02-15 14:00:00 Holyoke NaN OVAL CO\n", "1931-06-01 13:00:00 Abilene NaN DISK KS\n", "1933-04-18 19:00:00 New York Worlds Fair NaN LIGHT NY" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo_new = pd.read_csv('ufo.csv.gz', index_col='Time', parse_dates=['Time'])\n", "ufo_new.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can confirm that the new ufo DataFrame is equivalent to the original ufo DataFrame:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ufo_new.equals(ufo)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 12. Fill missing values using interpolation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create an example time series DataFrame with some missing values:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
2019-01-01100.09.0
2019-02-01120.09.0
2019-03-01130.0NaN
2019-04-01NaN7.5
2019-05-01140.06.5
\n", "
" ], "text/plain": [ " a b\n", "2019-01-01 100.0 9.0\n", "2019-02-01 120.0 9.0\n", "2019-03-01 130.0 NaN\n", "2019-04-01 NaN 7.5\n", "2019-05-01 140.0 6.5" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'a':[100, 120, 130, np.nan, 140], 'b':[9, 9, np.nan, 7.5, 6.5]})\n", "df.index = pd.to_datetime(['2019-01', '2019-02', '2019-03', '2019-04', '2019-05'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If appropriate, you can fill in the missing values using interpolation:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
2019-01-01100.09.00
2019-02-01120.09.00
2019-03-01130.08.25
2019-04-01135.07.50
2019-05-01140.06.50
\n", "
" ], "text/plain": [ " a b\n", "2019-01-01 100.0 9.00\n", "2019-02-01 120.0 9.00\n", "2019-03-01 130.0 8.25\n", "2019-04-01 135.0 7.50\n", "2019-05-01 140.0 6.50" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.interpolate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This uses linear interpolation by default, though other methods are supported." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 13. Check for duplicate merge keys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create two example DataFrames:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colornum
0green1
1yellow2
2red3
\n", "
" ], "text/plain": [ " color num\n", "0 green 1\n", "1 yellow 2\n", "2 red 3" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})\n", "left" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorsize
0greenS
1yellowM
2pinkL
3greenXL
\n", "
" ], "text/plain": [ " color size\n", "0 green S\n", "1 yellow M\n", "2 pink L\n", "3 green XL" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "right = pd.DataFrame({'color': ['green', 'yellow', 'pink', 'green'], 'size':['S', 'M', 'L', 'XL']})\n", "right" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to merge these DataFrames.\n", "\n", "What if we wanted to confirm that the merge keys (\"color\" in this case) are unique in the left dataset? We would use \"one-to-many\" validation:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colornumsize
0green1S
1green1XL
2yellow2M
\n", "
" ], "text/plain": [ " color num size\n", "0 green 1 S\n", "1 green 1 XL\n", "2 yellow 2 M" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(left, right, how='inner', validate='one_to_many')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It did the merge, and validated that the values of \"color\" in the left dataset are unique.\n", "\n", "What if we wanted to confirm that the merge keys are unique in the right dataset? We would use \"many-to-one\" validation:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "# pd.merge(left, right, how='inner', validate='many_to_one')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This resulted in an error, because the values of \"color\" in the right dataset are not unique." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 14. Transpose a wide DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create an example DataFrame with 200 rows and 25 columns:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(np.random.rand(200, 25), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXY'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you wanted to get a sense of the data by examining the head, you wouldn't see all of the columns due to the default display options:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDEFGHIJ...PQRSTUVWXY
00.9636630.3834420.7917250.5288950.5680450.9255970.0710360.0871290.0202180.832620...0.7805290.1182740.6399210.1433530.9446690.5218480.4146620.2645560.7742340.456150
10.5684340.0187900.6176350.6120960.6169340.9437480.6818200.3595080.4370320.697631...0.3154280.3637110.5701970.4386020.9883740.1020450.2088770.1613100.6531080.253292
20.4663110.2444260.1589700.1103750.6563300.1381830.1965820.3687250.8209930.097101...0.6048460.7392640.0391880.2828070.1201970.2961400.1187280.3179830.4142630.064147
30.6924720.5666010.2653890.5232480.0939410.5759460.9292960.3185690.6674100.131798...0.8289400.0046950.6778170.2700080.7351940.9621890.2487530.5761570.5920420.572252
40.2230820.9527490.4471250.8464090.6994790.2974370.8137980.3965060.8811030.581273...0.6439900.4238550.6063930.0191930.3015750.6601740.2900780.6180150.4287690.135474
\n", "

5 rows × 25 columns

\n", "
" ], "text/plain": [ " A B C D E F G \\\n", "0 0.963663 0.383442 0.791725 0.528895 0.568045 0.925597 0.071036 \n", "1 0.568434 0.018790 0.617635 0.612096 0.616934 0.943748 0.681820 \n", "2 0.466311 0.244426 0.158970 0.110375 0.656330 0.138183 0.196582 \n", "3 0.692472 0.566601 0.265389 0.523248 0.093941 0.575946 0.929296 \n", "4 0.223082 0.952749 0.447125 0.846409 0.699479 0.297437 0.813798 \n", "\n", " H I J ... P Q R S \\\n", "0 0.087129 0.020218 0.832620 ... 0.780529 0.118274 0.639921 0.143353 \n", "1 0.359508 0.437032 0.697631 ... 0.315428 0.363711 0.570197 0.438602 \n", "2 0.368725 0.820993 0.097101 ... 0.604846 0.739264 0.039188 0.282807 \n", "3 0.318569 0.667410 0.131798 ... 0.828940 0.004695 0.677817 0.270008 \n", "4 0.396506 0.881103 0.581273 ... 0.643990 0.423855 0.606393 0.019193 \n", "\n", " T U V W X Y \n", "0 0.944669 0.521848 0.414662 0.264556 0.774234 0.456150 \n", "1 0.988374 0.102045 0.208877 0.161310 0.653108 0.253292 \n", "2 0.120197 0.296140 0.118728 0.317983 0.414263 0.064147 \n", "3 0.735194 0.962189 0.248753 0.576157 0.592042 0.572252 \n", "4 0.301575 0.660174 0.290078 0.618015 0.428769 0.135474 \n", "\n", "[5 rows x 25 columns]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The easiest solution is just to transpose the head:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
A0.9636630.5684340.4663110.6924720.223082
B0.3834420.0187900.2444260.5666010.952749
C0.7917250.6176350.1589700.2653890.447125
D0.5288950.6120960.1103750.5232480.846409
E0.5680450.6169340.6563300.0939410.699479
F0.9255970.9437480.1381830.5759460.297437
G0.0710360.6818200.1965820.9292960.813798
H0.0871290.3595080.3687250.3185690.396506
I0.0202180.4370320.8209930.6674100.881103
J0.8326200.6976310.0971010.1317980.581273
K0.7781570.0602250.8379450.7163270.881735
L0.8700120.6667670.0960980.2894060.692532
M0.9786180.6706380.9764590.1831910.725254
N0.7991590.2103830.4686510.5865130.501324
O0.4614790.1289260.9767610.0201080.956084
P0.7805290.3154280.6048460.8289400.643990
Q0.1182740.3637110.7392640.0046950.423855
R0.6399210.5701970.0391880.6778170.606393
S0.1433530.4386020.2828070.2700080.019193
T0.9446690.9883740.1201970.7351940.301575
U0.5218480.1020450.2961400.9621890.660174
V0.4146620.2088770.1187280.2487530.290078
W0.2645560.1613100.3179830.5761570.618015
X0.7742340.6531080.4142630.5920420.428769
Y0.4561500.2532920.0641470.5722520.135474
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "A 0.963663 0.568434 0.466311 0.692472 0.223082\n", "B 0.383442 0.018790 0.244426 0.566601 0.952749\n", "C 0.791725 0.617635 0.158970 0.265389 0.447125\n", "D 0.528895 0.612096 0.110375 0.523248 0.846409\n", "E 0.568045 0.616934 0.656330 0.093941 0.699479\n", "F 0.925597 0.943748 0.138183 0.575946 0.297437\n", "G 0.071036 0.681820 0.196582 0.929296 0.813798\n", "H 0.087129 0.359508 0.368725 0.318569 0.396506\n", "I 0.020218 0.437032 0.820993 0.667410 0.881103\n", "J 0.832620 0.697631 0.097101 0.131798 0.581273\n", "K 0.778157 0.060225 0.837945 0.716327 0.881735\n", "L 0.870012 0.666767 0.096098 0.289406 0.692532\n", "M 0.978618 0.670638 0.976459 0.183191 0.725254\n", "N 0.799159 0.210383 0.468651 0.586513 0.501324\n", "O 0.461479 0.128926 0.976761 0.020108 0.956084\n", "P 0.780529 0.315428 0.604846 0.828940 0.643990\n", "Q 0.118274 0.363711 0.739264 0.004695 0.423855\n", "R 0.639921 0.570197 0.039188 0.677817 0.606393\n", "S 0.143353 0.438602 0.282807 0.270008 0.019193\n", "T 0.944669 0.988374 0.120197 0.735194 0.301575\n", "U 0.521848 0.102045 0.296140 0.962189 0.660174\n", "V 0.414662 0.208877 0.118728 0.248753 0.290078\n", "W 0.264556 0.161310 0.317983 0.576157 0.618015\n", "X 0.774234 0.653108 0.414263 0.592042 0.428769\n", "Y 0.456150 0.253292 0.064147 0.572252 0.135474" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the columns have become the rows (and vice versa), we can now easily browse through the DataFrame's head.\n", "\n", "Transposing is also helpful when using the `describe()` method on a wide DataFrame:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
A200.00.4929930.2978210.0003670.2323340.4732080.7520130.995733
B200.00.5112420.2961460.0046550.2447920.5029610.7787990.995713
C200.00.4932750.2653980.0038660.2713780.5075850.6990230.993405
D200.00.5105970.2792090.0052060.2737340.5391890.7367330.999949
E200.00.4882650.2867360.0037100.2646440.4699480.7294160.992820
F200.00.4861800.3004100.0113550.2483370.4555000.7514750.999931
G200.00.4877670.2852900.0181730.2198240.4968470.7431750.980979
H200.00.4695140.2883660.0027030.2178340.4398010.7127900.985155
I200.00.4750420.2842670.0019620.2297870.4557270.7101700.996100
J200.00.4863570.2929390.0038600.2422520.4516370.7349570.983426
K200.00.5263010.2971700.0054950.2555190.5591200.7981000.994496
L200.00.5085080.2844170.0059390.2756340.5031450.7415090.999278
M200.00.5315110.2833770.0006640.3127750.5489910.7657840.997962
N200.00.4924450.2837170.0005460.2537790.4774100.7251160.990345
O200.00.4961020.2894960.0090600.2590390.4681760.7255850.995830
P200.00.5157630.3020130.0044750.2473960.5046640.8065460.997354
Q200.00.4993500.2969640.0009030.2499510.5245570.7477130.999964
R200.00.5065410.2837060.0081870.2586310.4809190.7868130.997994
S200.00.5094910.2922440.0062380.2766560.5006280.7459580.998355
T200.00.4899580.2766070.0050520.2738930.4692940.7339760.998023
U200.00.5020080.2954260.0055100.2525210.5133780.7491460.997858
V200.00.5330490.2982390.0020840.2748290.5539710.8067140.997046
W200.00.4613000.2943280.0000740.1870380.4293690.7093910.975735
X200.00.4702720.2919050.0000720.2194380.4532540.7172790.995813
Y200.00.4680690.2847500.0013830.2446200.4362730.7063540.998199
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "A 200.0 0.492993 0.297821 0.000367 0.232334 0.473208 0.752013 0.995733\n", "B 200.0 0.511242 0.296146 0.004655 0.244792 0.502961 0.778799 0.995713\n", "C 200.0 0.493275 0.265398 0.003866 0.271378 0.507585 0.699023 0.993405\n", "D 200.0 0.510597 0.279209 0.005206 0.273734 0.539189 0.736733 0.999949\n", "E 200.0 0.488265 0.286736 0.003710 0.264644 0.469948 0.729416 0.992820\n", "F 200.0 0.486180 0.300410 0.011355 0.248337 0.455500 0.751475 0.999931\n", "G 200.0 0.487767 0.285290 0.018173 0.219824 0.496847 0.743175 0.980979\n", "H 200.0 0.469514 0.288366 0.002703 0.217834 0.439801 0.712790 0.985155\n", "I 200.0 0.475042 0.284267 0.001962 0.229787 0.455727 0.710170 0.996100\n", "J 200.0 0.486357 0.292939 0.003860 0.242252 0.451637 0.734957 0.983426\n", "K 200.0 0.526301 0.297170 0.005495 0.255519 0.559120 0.798100 0.994496\n", "L 200.0 0.508508 0.284417 0.005939 0.275634 0.503145 0.741509 0.999278\n", "M 200.0 0.531511 0.283377 0.000664 0.312775 0.548991 0.765784 0.997962\n", "N 200.0 0.492445 0.283717 0.000546 0.253779 0.477410 0.725116 0.990345\n", "O 200.0 0.496102 0.289496 0.009060 0.259039 0.468176 0.725585 0.995830\n", "P 200.0 0.515763 0.302013 0.004475 0.247396 0.504664 0.806546 0.997354\n", "Q 200.0 0.499350 0.296964 0.000903 0.249951 0.524557 0.747713 0.999964\n", "R 200.0 0.506541 0.283706 0.008187 0.258631 0.480919 0.786813 0.997994\n", "S 200.0 0.509491 0.292244 0.006238 0.276656 0.500628 0.745958 0.998355\n", "T 200.0 0.489958 0.276607 0.005052 0.273893 0.469294 0.733976 0.998023\n", "U 200.0 0.502008 0.295426 0.005510 0.252521 0.513378 0.749146 0.997858\n", "V 200.0 0.533049 0.298239 0.002084 0.274829 0.553971 0.806714 0.997046\n", "W 200.0 0.461300 0.294328 0.000074 0.187038 0.429369 0.709391 0.975735\n", "X 200.0 0.470272 0.291905 0.000072 0.219438 0.453254 0.717279 0.995813\n", "Y 200.0 0.468069 0.284750 0.001383 0.244620 0.436273 0.706354 0.998199" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 15. Create an example DataFrame (alternative)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the methods that I taught in the main tricks video for creating example DataFrames:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col onecol two
0100300
1200400
\n", "
" ], "text/plain": [ " col one col two\n", "0 100 300\n", "1 200 400" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({'col one':[100, 200], 'col two':[300, 400]})" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdefgh
00.2785100.2880270.8463050.7912840.5786360.2885890.3188780.592218
10.7398670.3840980.5095620.8880330.6497910.5355500.0712220.176015
20.2009920.6231480.1081130.0289950.3603510.7188590.6932490.792670
30.6962480.6132860.4861620.2084980.5685480.6366250.1237430.565147
\n", "
" ], "text/plain": [ " a b c d e f g \\\n", "0 0.278510 0.288027 0.846305 0.791284 0.578636 0.288589 0.318878 \n", "1 0.739867 0.384098 0.509562 0.888033 0.649791 0.535550 0.071222 \n", "2 0.200992 0.623148 0.108113 0.028995 0.360351 0.718859 0.693249 \n", "3 0.696248 0.613286 0.486162 0.208498 0.568548 0.636625 0.123743 \n", "\n", " h \n", "0 0.592218 \n", "1 0.176015 \n", "2 0.792670 \n", "3 0.565147 " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want an even simpler method, you can use `makeDataFrame()` to create a 30x4 DataFrame filled with random values:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/kevin/miniconda3/envs/pd14/lib/python3.9/site-packages/pandas/util/__init__.py:15: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.\n", " import pandas.util.testing\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
EmWXi3UyK3-0.250656-0.040365-0.8292030.296782
jgi4rj56bK0.919812-1.527617-0.670679-0.270356
a6Po2BAIfh2.445794-0.057607-0.3153850.736752
Ys4pDMq9Yd1.205698-2.4900900.4428201.157280
XfLAIHy1QI0.920549-0.026303-1.052564-0.658804
\n", "
" ], "text/plain": [ " A B C D\n", "EmWXi3UyK3 -0.250656 -0.040365 -0.829203 0.296782\n", "jgi4rj56bK 0.919812 -1.527617 -0.670679 -0.270356\n", "a6Po2BAIfh 2.445794 -0.057607 -0.315385 0.736752\n", "Ys4pDMq9Yd 1.205698 -2.490090 0.442820 1.157280\n", "XfLAIHy1QI 0.920549 -0.026303 -1.052564 -0.658804" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NOTE: pd.util.testing is deprecated in pandas 1.0\n", "pd.util.testing.makeDataFrame().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`makeMissingDataframe()` is similar, except that some of the values are missing:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
8JAZ6EeRof0.0003410.5370350.656464-0.941815
GaUe9v7uYV-0.062205-0.2868350.065851-0.702835
No3fmW5dDH-0.3173491.634406-0.873631-1.378837
W5LsjKWdWfNaN0.571227-1.0577542.925250
8ijrQMVxS0-0.211620-0.789954-0.8972370.571752
\n", "
" ], "text/plain": [ " A B C D\n", "8JAZ6EeRof 0.000341 0.537035 0.656464 -0.941815\n", "GaUe9v7uYV -0.062205 -0.286835 0.065851 -0.702835\n", "No3fmW5dDH -0.317349 1.634406 -0.873631 -1.378837\n", "W5LsjKWdWf NaN 0.571227 -1.057754 2.925250\n", "8ijrQMVxS0 -0.211620 -0.789954 -0.897237 0.571752" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NOTE: pd.util.testing is deprecated in pandas 1.0\n", "pd.util.testing.makeMissingDataframe().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`makeTimeDataFrame()` is similar, except it creates a DatetimeIndex:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2000-01-030.470549-0.9811021.3583820.064058
2000-01-04-0.2358040.1651830.060856-1.063780
2000-01-050.678789-0.0166462.005891-0.079188
2000-01-061.1536180.1409330.081658-0.527496
2000-01-07-0.554733-0.725172-0.0792090.460215
\n", "
" ], "text/plain": [ " A B C D\n", "2000-01-03 0.470549 -0.981102 1.358382 0.064058\n", "2000-01-04 -0.235804 0.165183 0.060856 -1.063780\n", "2000-01-05 0.678789 -0.016646 2.005891 -0.079188\n", "2000-01-06 1.153618 0.140933 0.081658 -0.527496\n", "2000-01-07 -0.554733 -0.725172 -0.079209 0.460215" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NOTE: pd.util.testing is deprecated in pandas 1.0\n", "pd.util.testing.makeTimeDataFrame().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, `makeMixedDataFrame()` creates this exact 5x4 DataFrame:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
00.00.0foo12009-01-01
11.01.0foo22009-01-02
22.00.0foo32009-01-05
33.01.0foo42009-01-06
44.00.0foo52009-01-07
\n", "
" ], "text/plain": [ " A B C D\n", "0 0.0 0.0 foo1 2009-01-01\n", "1 1.0 1.0 foo2 2009-01-02\n", "2 2.0 0.0 foo3 2009-01-05\n", "3 3.0 1.0 foo4 2009-01-06\n", "4 4.0 0.0 foo5 2009-01-07" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NOTE: pd.util.testing is deprecated in pandas 1.0\n", "pd.util.testing.makeMixedDataFrame()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It has 2 float columns, 1 object column, and 1 datetime column.\n", "\n", "There are many other similar functions that you can use:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['makeBoolIndex',\n", " 'makeCategoricalIndex',\n", " 'makeCustomDataframe',\n", " 'makeCustomIndex',\n", " 'makeDataFrame',\n", " 'makeDateIndex',\n", " 'makeFloatIndex',\n", " 'makeFloatSeries',\n", " 'makeIntIndex',\n", " 'makeIntervalIndex',\n", " 'makeMissingDataframe',\n", " 'makeMixedDataFrame',\n", " 'makeMultiIndex',\n", " 'makeNumericIndex',\n", " 'makeObjectSeries',\n", " 'makePeriodFrame',\n", " 'makePeriodIndex',\n", " 'makePeriodSeries',\n", " 'makeRangeIndex',\n", " 'makeStringIndex',\n", " 'makeStringSeries',\n", " 'makeTimeDataFrame',\n", " 'makeTimeSeries',\n", " 'makeTimedeltaIndex',\n", " 'makeUIntIndex',\n", " 'makeUnicodeIndex']" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NOTE: pd.util.testing is deprecated in pandas 1.0\n", "[x for x in dir(pd.util.testing) if x.startswith('make')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, keep in mind that most of these have no arguments and no docstring, and none of them are listed in the pandas documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16. Identify rows that are missing from a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a small example DataFrame:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
00.00.0foo12009-01-01
11.01.0foo22009-01-02
22.00.0foo32009-01-05
33.01.0foo42009-01-06
44.00.0foo52009-01-07
\n", "
" ], "text/plain": [ " A B C D\n", "0 0.0 0.0 foo1 2009-01-01\n", "1 1.0 1.0 foo2 2009-01-02\n", "2 2.0 0.0 foo3 2009-01-05\n", "3 3.0 1.0 foo4 2009-01-06\n", "4 4.0 0.0 foo5 2009-01-07" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NOTE: pd.util.testing is deprecated in pandas 1.0\n", "df1 = pd.util.testing.makeMixedDataFrame()\n", "df1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then let's create a copy of that DataFrame in which rows 2 and 3 are missing:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
00.00.0foo12009-01-01
11.01.0foo22009-01-02
44.00.0foo52009-01-07
\n", "
" ], "text/plain": [ " A B C D\n", "0 0.0 0.0 foo1 2009-01-01\n", "1 1.0 1.0 foo2 2009-01-02\n", "4 4.0 0.0 foo5 2009-01-07" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = df1.drop([2, 3], axis='rows')\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we needed to identify which rows are missing from the second DataFrame? The easiest way to do this would be to merge the two DataFrames using a left join and set `indicator=True`:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD_merge
00.00.0foo12009-01-01both
11.01.0foo22009-01-02both
22.00.0foo32009-01-05left_only
33.01.0foo42009-01-06left_only
44.00.0foo52009-01-07both
\n", "
" ], "text/plain": [ " A B C D _merge\n", "0 0.0 0.0 foo1 2009-01-01 both\n", "1 1.0 1.0 foo2 2009-01-02 both\n", "2 2.0 0.0 foo3 2009-01-05 left_only\n", "3 3.0 1.0 foo4 2009-01-06 left_only\n", "4 4.0 0.0 foo5 2009-01-07 both" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = pd.merge(df1, df2, how='left', indicator=True)\n", "df3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This adds a column to the DataFrame which shows the source of each row.\n", "\n", "In order to locate the rows that were missing from \"df2\", we simply filter \"df3\" to show the rows that were only present in the left DataFrame:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD_merge
22.00.0foo32009-01-05left_only
33.01.0foo42009-01-06left_only
\n", "
" ], "text/plain": [ " A B C D _merge\n", "2 2.0 0.0 foo3 2009-01-05 left_only\n", "3 3.0 1.0 foo4 2009-01-06 left_only" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3[df3._merge == 'left_only']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can see that rows 2 and 3 were the missing rows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 17. Use query to avoid intermediate variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take another look at the stocks DataFrame:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateCloseVolumeSymbol
02016-10-0331.5014070500CSCO
12016-10-03112.5221701800AAPL
22016-10-0357.4219189500MSFT
32016-10-04113.0029736800AAPL
42016-10-0457.2420085900MSFT
52016-10-0431.3518460400CSCO
62016-10-0557.6416726400MSFT
72016-10-0531.5911808600CSCO
82016-10-05113.0521453100AAPL
\n", "
" ], "text/plain": [ " Date Close Volume Symbol\n", "0 2016-10-03 31.50 14070500 CSCO\n", "1 2016-10-03 112.52 21701800 AAPL\n", "2 2016-10-03 57.42 19189500 MSFT\n", "3 2016-10-04 113.00 29736800 AAPL\n", "4 2016-10-04 57.24 20085900 MSFT\n", "5 2016-10-04 31.35 18460400 CSCO\n", "6 2016-10-05 57.64 16726400 MSFT\n", "7 2016-10-05 31.59 11808600 CSCO\n", "8 2016-10-05 113.05 21453100 AAPL" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you wanted to filter the DataFrame to only show rows in which the Symbol is \"AAPL\", this is the usual approach:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateCloseVolumeSymbol
12016-10-03112.5221701800AAPL
32016-10-04113.0029736800AAPL
82016-10-05113.0521453100AAPL
\n", "
" ], "text/plain": [ " Date Close Volume Symbol\n", "1 2016-10-03 112.52 21701800 AAPL\n", "3 2016-10-04 113.00 29736800 AAPL\n", "8 2016-10-05 113.05 21453100 AAPL" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks[stocks.Symbol == 'AAPL']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, this can also be done using the `query()` method:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateCloseVolumeSymbol
12016-10-03112.5221701800AAPL
32016-10-04113.0029736800AAPL
82016-10-05113.0521453100AAPL
\n", "
" ], "text/plain": [ " Date Close Volume Symbol\n", "1 2016-10-03 112.52 21701800 AAPL\n", "3 2016-10-04 113.00 29736800 AAPL\n", "8 2016-10-05 113.05 21453100 AAPL" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks.query(\"Symbol == 'AAPL'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three things worth noting about the `query()` method:\n", "\n", "1. You don't have to repeat the name of the DataFrame within the query string.\n", "2. The entire condition is expressed as a string, thus you lose any syntax highlighting.\n", "3. Since there is a string within the condition, you have to use single quotes with the inner string and double quotes with the outer string.\n", "\n", "Let's look at another example that shows the real usefulness of `query()`. First let's group by \"Symbol\" and then take the mean of all numeric columns:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CloseVolume
Symbol
AAPL112.8566672.429723e+07
CSCO31.4800001.477983e+07
MSFT57.4333331.866727e+07
\n", "
" ], "text/plain": [ " Close Volume\n", "Symbol \n", "AAPL 112.856667 2.429723e+07\n", "CSCO 31.480000 1.477983e+07\n", "MSFT 57.433333 1.866727e+07" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks.groupby('Symbol').mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if I wanted to filter this DataFrame to only show rows in which \"Close\" is less than 100? The usual approach would be to create a temporary DataFrame and then filter that:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CloseVolume
Symbol
CSCO31.4800001.477983e+07
MSFT57.4333331.866727e+07
\n", "
" ], "text/plain": [ " Close Volume\n", "Symbol \n", "CSCO 31.480000 1.477983e+07\n", "MSFT 57.433333 1.866727e+07" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp = stocks.groupby('Symbol').mean()\n", "temp[temp.Close < 100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But `query()` works even better in this situation, since you can avoid creating an intermediate object:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CloseVolume
Symbol
CSCO31.4800001.477983e+07
MSFT57.4333331.866727e+07
\n", "
" ], "text/plain": [ " Close Volume\n", "Symbol \n", "CSCO 31.480000 1.477983e+07\n", "MSFT 57.433333 1.866727e+07" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks.groupby('Symbol').mean().query('Close < 100')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, `query()` is a great solution to our previous trick, because it would have allowed us to filter the merged DataFrame without creating the \"df3\" object:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD_merge
22.00.0foo32009-01-05left_only
33.01.0foo42009-01-06left_only
\n", "
" ], "text/plain": [ " A B C D _merge\n", "2 2.0 0.0 foo3 2009-01-05 left_only\n", "3 3.0 1.0 foo4 2009-01-06 left_only" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how='left', indicator=True).query(\"_merge == 'left_only'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18. Reshape a DataFrame from wide format to long format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create another example DataFrame:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
zipfactorywarehouseretail
012345100200300
134567400500600
267890700800900
\n", "
" ], "text/plain": [ " zip factory warehouse retail\n", "0 12345 100 200 300\n", "1 34567 400 500 600\n", "2 67890 700 800 900" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "distances = pd.DataFrame([['12345', 100, 200, 300], ['34567', 400, 500, 600], ['67890', 700, 800, 900]],\n", " columns=['zip', 'factory', 'warehouse', 'retail'])\n", "distances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's pretend that a manufacturing company has three locations: a factory, a warehouse, and a retail store. They've created the DataFrame above, which shows the distance between every US zip code and that particular location.\n", "\n", "Let's create one more DataFrame:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idziplocation_type
0112345factory
1234567warehouse
\n", "
" ], "text/plain": [ " user_id zip location_type\n", "0 1 12345 factory\n", "1 2 34567 warehouse" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users = pd.DataFrame([[1, '12345', 'factory'], [2, '34567', 'warehouse']],\n", " columns=['user_id', 'zip', 'location_type'])\n", "users" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a DataFrame of users. It shows the user's zip code and the location they would like to visit. We want to add a fourth column to \"users\", which shows the distance between that user and the location they want to visit. This information is available in the \"distances\" DataFrame, but how do we get it into the \"users\" DataFrame?\n", "\n", "We actually need to merge the DataFrames, but the problem is that the \"distances\" DataFrame doesn't have the right columns for merging. The solution is to reshape it using the `melt()` method:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ziplocation_typedistance
012345factory100
134567factory400
267890factory700
312345warehouse200
434567warehouse500
567890warehouse800
612345retail300
734567retail600
867890retail900
\n", "
" ], "text/plain": [ " zip location_type distance\n", "0 12345 factory 100\n", "1 34567 factory 400\n", "2 67890 factory 700\n", "3 12345 warehouse 200\n", "4 34567 warehouse 500\n", "5 67890 warehouse 800\n", "6 12345 retail 300\n", "7 34567 retail 600\n", "8 67890 retail 900" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "distances_long = distances.melt(id_vars='zip', var_name='location_type', value_name='distance')\n", "distances_long" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've reshaped the \"distances\" DataFrame from \"wide format\", meaning lots of columns, to \"long format\", meaning lots of rows. It contains the same data as before, but it's now structured such that it can easily be merged with the \"users\" DataFrame:" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idziplocation_typedistance
0112345factory100
1234567warehouse500
\n", "
" ], "text/plain": [ " user_id zip location_type distance\n", "0 1 12345 factory 100\n", "1 2 34567 warehouse 500" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(users, distances_long)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're ever confused about \"wide\" versus \"long\" data, the easiest way to recognize a \"wide format\" DataFrame is that it doesn't tell you what you're looking at. For example, it doesn't tell me what these numbers represent, and it doesn't tell me what these column names represent. In contrast, the \"long format\" DataFrame tells you that the numbers represent distance and these names represent location types." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 19. Reverse row order (alternative)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might remember the drinks DataFrame from the main video:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0Asia
1Albania89132544.9Europe
2Algeria250140.7Africa
3Andorra24513831212.4Europe
4Angola21757455.9Africa
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "0 Afghanistan 0 0 0 \n", "1 Albania 89 132 54 \n", "2 Algeria 25 0 14 \n", "3 Andorra 245 138 312 \n", "4 Angola 217 57 45 \n", "\n", " total_litres_of_pure_alcohol continent \n", "0 0.0 Asia \n", "1 4.9 Europe \n", "2 0.7 Africa \n", "3 12.4 Europe \n", "4 5.9 Africa " ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "drinks.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the method that I taught in the main video for reversing row order, because it will always work:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
192Zimbabwe641844.7Africa
191Zambia321942.5Africa
190Yemen6000.1Asia
189Vietnam111212.0Asia
188Venezuela33310037.7South America
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "192 Zimbabwe 64 18 4 \n", "191 Zambia 32 19 4 \n", "190 Yemen 6 0 0 \n", "189 Vietnam 111 2 1 \n", "188 Venezuela 333 100 3 \n", "\n", " total_litres_of_pure_alcohol continent \n", "192 4.7 Africa \n", "191 2.5 Africa \n", "190 0.1 Asia \n", "189 2.0 Asia \n", "188 7.7 South America " ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "drinks.loc[::-1].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can use Python's built-in `reversed()` function to reverse the index, and then use that to reindex the DataFrame:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
192Zimbabwe641844.7Africa
191Zambia321942.5Africa
190Yemen6000.1Asia
189Vietnam111212.0Asia
188Venezuela33310037.7South America
\n", "
" ], "text/plain": [ " country beer_servings spirit_servings wine_servings \\\n", "192 Zimbabwe 64 18 4 \n", "191 Zambia 32 19 4 \n", "190 Yemen 6 0 0 \n", "189 Vietnam 111 2 1 \n", "188 Venezuela 333 100 3 \n", "\n", " total_litres_of_pure_alcohol continent \n", "192 4.7 Africa \n", "191 2.5 Africa \n", "190 0.1 Asia \n", "189 2.0 Asia \n", "188 7.7 South America " ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "drinks.reindex(reversed(drinks.index)).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you decide to use this alternative method, be aware that it will fail if the DataFrame has duplicate values in the index. To demonstrate this, let's give the stocks DataFrame a non-unique index:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CloseVolumeSymbol
Date
2016-10-0331.5014070500CSCO
2016-10-03112.5221701800AAPL
2016-10-0357.4219189500MSFT
2016-10-04113.0029736800AAPL
2016-10-0457.2420085900MSFT
2016-10-0431.3518460400CSCO
2016-10-0557.6416726400MSFT
2016-10-0531.5911808600CSCO
2016-10-05113.0521453100AAPL
\n", "
" ], "text/plain": [ " Close Volume Symbol\n", "Date \n", "2016-10-03 31.50 14070500 CSCO\n", "2016-10-03 112.52 21701800 AAPL\n", "2016-10-03 57.42 19189500 MSFT\n", "2016-10-04 113.00 29736800 AAPL\n", "2016-10-04 57.24 20085900 MSFT\n", "2016-10-04 31.35 18460400 CSCO\n", "2016-10-05 57.64 16726400 MSFT\n", "2016-10-05 31.59 11808600 CSCO\n", "2016-10-05 113.05 21453100 AAPL" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks = stocks.set_index('Date')\n", "stocks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the index above is not unique, this will result in an error:" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "# stocks.reindex(reversed(stocks.index))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 20. Reverse column order (alternative)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the method that I taught in the main video for reversing column order, because it will always work:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continenttotal_litres_of_pure_alcoholwine_servingsspirit_servingsbeer_servingscountry
0Asia0.0000Afghanistan
1Europe4.95413289Albania
2Africa0.714025Algeria
3Europe12.4312138245Andorra
4Africa5.94557217Angola
\n", "
" ], "text/plain": [ " continent total_litres_of_pure_alcohol wine_servings spirit_servings \\\n", "0 Asia 0.0 0 0 \n", "1 Europe 4.9 54 132 \n", "2 Africa 0.7 14 0 \n", "3 Europe 12.4 312 138 \n", "4 Africa 5.9 45 57 \n", "\n", " beer_servings country \n", "0 0 Afghanistan \n", "1 89 Albania \n", "2 25 Algeria \n", "3 245 Andorra \n", "4 217 Angola " ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "drinks.loc[:, ::-1].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can use Python's built-in `reversed()` function to reverse the columns attribute, and then pass that as a filter to the DataFrame:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continenttotal_litres_of_pure_alcoholwine_servingsspirit_servingsbeer_servingscountry
0Asia0.0000Afghanistan
1Europe4.95413289Albania
2Africa0.714025Algeria
3Europe12.4312138245Andorra
4Africa5.94557217Angola
\n", "
" ], "text/plain": [ " continent total_litres_of_pure_alcohol wine_servings spirit_servings \\\n", "0 Asia 0.0 0 0 \n", "1 Europe 4.9 54 132 \n", "2 Africa 0.7 14 0 \n", "3 Europe 12.4 312 138 \n", "4 Africa 5.9 45 57 \n", "\n", " beer_servings country \n", "0 0 Afghanistan \n", "1 89 Albania \n", "2 25 Algeria \n", "3 245 Andorra \n", "4 217 Angola " ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "drinks[reversed(drinks.columns)].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you decide to use this alternative method, be aware that it will fail if the DataFrame has duplicate column names. To demonstrate this, let's rename two of the columns in the stocks DataFrame:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CloseXYZXYZ
Date
2016-10-0331.5014070500CSCO
2016-10-03112.5221701800AAPL
2016-10-0357.4219189500MSFT
2016-10-04113.0029736800AAPL
2016-10-0457.2420085900MSFT
2016-10-0431.3518460400CSCO
2016-10-0557.6416726400MSFT
2016-10-0531.5911808600CSCO
2016-10-05113.0521453100AAPL
\n", "
" ], "text/plain": [ " Close XYZ XYZ\n", "Date \n", "2016-10-03 31.50 14070500 CSCO\n", "2016-10-03 112.52 21701800 AAPL\n", "2016-10-03 57.42 19189500 MSFT\n", "2016-10-04 113.00 29736800 AAPL\n", "2016-10-04 57.24 20085900 MSFT\n", "2016-10-04 31.35 18460400 CSCO\n", "2016-10-05 57.64 16726400 MSFT\n", "2016-10-05 31.59 11808600 CSCO\n", "2016-10-05 113.05 21453100 AAPL" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks = stocks.rename({'Symbol':'XYZ', 'Volume':'XYZ'}, axis='columns')\n", "stocks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the column names are not unique, you will get multiple copies of those columns:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
XYZXYZXYZXYZClose
Date
2016-10-0314070500CSCO14070500CSCO31.50
2016-10-0321701800AAPL21701800AAPL112.52
2016-10-0319189500MSFT19189500MSFT57.42
2016-10-0429736800AAPL29736800AAPL113.00
2016-10-0420085900MSFT20085900MSFT57.24
2016-10-0418460400CSCO18460400CSCO31.35
2016-10-0516726400MSFT16726400MSFT57.64
2016-10-0511808600CSCO11808600CSCO31.59
2016-10-0521453100AAPL21453100AAPL113.05
\n", "
" ], "text/plain": [ " XYZ XYZ XYZ XYZ Close\n", "Date \n", "2016-10-03 14070500 CSCO 14070500 CSCO 31.50\n", "2016-10-03 21701800 AAPL 21701800 AAPL 112.52\n", "2016-10-03 19189500 MSFT 19189500 MSFT 57.42\n", "2016-10-04 29736800 AAPL 29736800 AAPL 113.00\n", "2016-10-04 20085900 MSFT 20085900 MSFT 57.24\n", "2016-10-04 18460400 CSCO 18460400 CSCO 31.35\n", "2016-10-05 16726400 MSFT 16726400 MSFT 57.64\n", "2016-10-05 11808600 CSCO 11808600 CSCO 31.59\n", "2016-10-05 21453100 AAPL 21453100 AAPL 113.05" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stocks[reversed(stocks.columns)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 21. Split a string into multiple columns (alternative)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an example DataFrame:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelocation
0John Arthur DoeLos Angeles, CA
1Jane Ann SmithWashington, DC
\n", "
" ], "text/plain": [ " name location\n", "0 John Arthur Doe Los Angeles, CA\n", "1 Jane Ann Smith Washington, DC" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'name':['John Arthur Doe', 'Jane Ann Smith'], 'location':['Los Angeles, CA', 'Washington, DC']})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the method that I taught in the main video for splitting the \"name\" string into multiple columns:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelocationfirstmiddlelast
0John Arthur DoeLos Angeles, CAJohnArthurDoe
1Jane Ann SmithWashington, DCJaneAnnSmith
\n", "
" ], "text/plain": [ " name location first middle last\n", "0 John Arthur Doe Los Angeles, CA John Arthur Doe\n", "1 Jane Ann Smith Washington, DC Jane Ann Smith" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['first', 'middle', 'last']] = df.name.str.split(' ', expand=True)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an alternative method that also works:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelocationfirstmiddlelast
0John Arthur DoeLos Angeles, CAJohnArthurDoe
1Jane Ann SmithWashington, DCJaneAnnSmith
\n", "
" ], "text/plain": [ " name location first middle last\n", "0 John Arthur Doe Los Angeles, CA John Arthur Doe\n", "1 Jane Ann Smith Washington, DC Jane Ann Smith" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['first'], df['middle'], df['last'] = zip(*df.name.str.split(' '))\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how the alternative method works. First, `str.split()` splits on a space character and returns a Series of two lists:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [John, Arthur, Doe]\n", "1 [Jane, Ann, Smith]\n", "Name: name, dtype: object" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.name.str.split(' ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, you unpack the Series using the asterisk, and zip the lists back together using the `zip()` function:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('John', 'Jane'), ('Arthur', 'Ann'), ('Doe', 'Smith')]" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(zip(*df.name.str.split(' ')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first, middle, and last names are now paired together as tuples. These tuples become three new DataFrame columns through multiple assignment:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelocationfirstmiddlelast
0John Arthur DoeLos Angeles, CAJohnArthurDoe
1Jane Ann SmithWashington, DCJaneAnnSmith
\n", "
" ], "text/plain": [ " name location first middle last\n", "0 John Arthur Doe Los Angeles, CA John Arthur Doe\n", "1 Jane Ann Smith Washington, DC Jane Ann Smith" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['first'], df['middle'], df['last'] = zip(*df.name.str.split(' '))\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Want more tricks? Watch [My top 25 pandas tricks](https://www.youtube.com/watch?v=RlIiVeig3hc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=35) or [Read the notebook](https://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/top_25_pandas_tricks.ipynb)\n", "\n", "© 2022 [Data School](https://www.dataschool.io). All rights reserved." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 2 }