{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Python Pandas Cheat Sheet

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a data analyst, below are common tasks I perform using Pandas. However, best practice rule of thumb is to use as much SQL as possible for transformations or summarizations before pulling data into pandas as pandas is limited to using your local machine's resources. Another reason why SQL should be used as much as possible, there will likely be more people that are familiar with SQL in your organization than Python/pandas. So to reduce \"technical debt\", Python or pandas use should be limited if data manipulations can be done in SQL instead." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Pandas display output options](#options)\n", "- [Quick summary statistics, # of rows, columns, etc](#quick_summary)\n", "- [Sorting More Than One Column](#sorting)\n", "- [Removing duplicates](#removing_duplicates)\n", "- [Finding duplicates](#find_duplicates)\n", "- [Creating a column based on values from another column](#map)\n", "- [Removing a column](#remove_column)\n", "- [Replacing values in a series](#replacing_values)\n", "- [Dropping or Filling NaN/NA values](#drop_fill_na)\n", "- [Using ```converters``` parameter to convert data during data ingestion](#converters)\n", "- [Reading CSV or Excel files](http://pandas.pydata.org/pandas-docs/stable/io.html)\n", "- [Renaming axis](#renaming_axis)\n", "- [Renaming indexes or columns](#renaming_indexes_columns)\n", "- [Binning data into intervals](#binning)\n", "- [Sub-selecting or Slicing a Data Frame](#slicing)\n", "- [Boolean indexing](#boolean)\n", "- [Accessors](#accessors)\n", "- [Obtaining columns with partial column labels](#partial_column_labels)\n", "- [Getting value counts](#value_counts)\n", "- [Getting cumulative sum](#cum_sum)\n", "- [Grouping data](#group)\n", "- [Pivot table](#pivot_table)\n", "- [Named aggregations (version 0.25)](#named_aggregation)\n", "- [Percent of Rows](#perc_rows)\n", "- [Percent of Columns](#perc_columns)\n", "- [Transpose a data frame](#transpose)\n", "- [Converting index to a column](#index2column)\n", "- [Converting column to index](#column2index)\n", "- [How to add or fill in missing dates](#missing_dates)\n", "- [How to connect to an ODBC data source (Windows)](#database)\n", "- [How to convert data in wide format to long format using melt()](#melt)\n", "- [How to convert data in long format data to wide format using pivot()](#pivot)\n", "- [Using category data type to control sort order](#category)\n", "- [Merging 2 data frames using merge()](#merge)\n", "- [Finding rows containing data with missing values](#missing)\n", "- [Converting a data type of a column in a data frame](#convert_type)\n", "- [Plotting data frames using MATPLOTLIB ver 1.5+](#matplotlib)\n", "- [Method chaining](#chaining)\n", "- [pipe() example](#pipe)\n", "- [Debugging groupby objects with this introspection method](https://realpython.com/python-pandas-tricks/#6-introspect-groupby-objects-via-iteration)\n", "- [Mapping trick for memebership binning](https://realpython.com/python-pandas-tricks/#7-use-this-mapping-trick-for-membership-binning)\n", "- [BONUS #1: A HUGE list of python and pandas snippets by Chris Albon](http://chrisalbon.com/)\n", "- [BONUS #2: More goodies from a major pandas contributor, Tom Augspurger](http://tomaugspurger.github.io/)\n", "- [BONUS #3: Data School YouTube Series on Pandas](https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y) and it's accompanying [notebooks](https://github.com/justmarkham/pandas-videos) on Github!\n", "- [BONUS #4: Article on Dask: pandas on steroids](https://towardsdatascience.com/why-every-data-scientist-should-use-dask-81b2b850e15b) [Basic Video Overview](https://www.youtube.com/watch?v=ods97a5Pzw0) [More Detailed Video Overview](https://www.youtube.com/watch?v=mjQ7tCQxYFQ)\n", "- [BONUS #5: 25 Pandas Tips and Tricks by Kevin Markham](https://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/top_25_pandas_tricks.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting display output [options](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.set_option.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, a large pandas dataframe (several columns or several rows) won't be completely rendered in your browser. You can override the defaults. However, you should not be surprised if your browser crashes if it tries to render several hundreds of thousands of rows. Here's the [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to pandas display options documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "pd.set_option(\"display.max_rows\",1000) # or pd.options.display.max_rows=1000\n", "pd.set_option(\"display.max_columns\",20) # or pd.options.display.max_columns=20 \n", "pd.set_option('precision',7)\n", "pd.set_option('large_repr', 'truncate')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quick summary statistics using df.describe() and data types using df.info(). Also check out [pandas-summary](https://github.com/mouradmourafiq/pandas-summary) - an extension for data frame's describe() method and also [pandas profiling](https://github.com/JosPolfliet/pandas-profiling)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top]](#top)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounces
0a4.0
1a3.0
2a12.0
3b6.0
4b7.5
5b8.0
6c3.0
7c5.0
8c6.0
\n", "
" ], "text/plain": [ " group ounces\n", "0 a 4.0\n", "1 a 3.0\n", "2 a 12.0\n", "3 b 6.0\n", "4 b 7.5\n", "5 b 8.0\n", "6 c 3.0\n", "7 c 5.0\n", "8 c 6.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ounces
count9.0000000
mean6.0555556
std2.8553069
min3.0000000
25%4.0000000
50%6.0000000
75%7.5000000
max12.0000000
\n", "
" ], "text/plain": [ " ounces\n", "count 9.0000000\n", "mean 6.0555556\n", "std 2.8553069\n", "min 3.0000000\n", "25% 4.0000000\n", "50% 6.0000000\n", "75% 7.5000000\n", "max 12.0000000" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### df.info() shows data types, number of rows and columns, and memory usage of your data frame" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 9 entries, 0 to 8\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 group 9 non-null object \n", " 1 ounces 9 non-null float64\n", "dtypes: float64(1), object(1)\n", "memory usage: 272.0+ bytes\n" ] } ], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting More Than One Column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top]](#top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Just pass a list of columns you want to sort and pass corresponding list of True/False to the ascending parameter**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounces
6c3.0
7c5.0
8c6.0
3b6.0
4b7.5
5b8.0
1a3.0
0a4.0
2a12.0
\n", "
" ], "text/plain": [ " group ounces\n", "6 c 3.0\n", "7 c 5.0\n", "8 c 6.0\n", "3 b 6.0\n", "4 b 7.5\n", "5 b 8.0\n", "1 a 3.0\n", "0 a 4.0\n", "2 a 12.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sort_values(by=['group','ounces'], ascending=[False, True], inplace=True)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing duplicates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [3, 2, 1, 3, 3, 4, 4]})" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one3
1one2
2one1
3two3
4two3
5two4
6two4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 3\n", "1 one 2\n", "2 one 1\n", "3 two 3\n", "4 two 3\n", "5 two 4\n", "6 two 4" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sort_values(by='k2')\n", "data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one3
1one2
2one1
3two3
5two4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 3\n", "1 one 2\n", "2 one 1\n", "3 two 3\n", "5 two 4" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop_duplicates() # by default, a \"duplicate\" is defined by all columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define duplicates by column name(s):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one3
3two3
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 3\n", "3 two 3" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop_duplicates(subset='k1') # duplicate in column k1 only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding duplicate rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one3
1one2
2one1
3two3
4two3
5two4
6two4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 3\n", "1 one 2\n", "2 one 1\n", "3 two 3\n", "4 two 3\n", "5 two 4\n", "6 two 4" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on method duplicated in module pandas.core.frame:\n", "\n", "duplicated(subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first') -> 'Series' method of pandas.core.frame.DataFrame instance\n", " Return boolean Series denoting duplicate rows.\n", " \n", " Considering certain columns is optional.\n", " \n", " Parameters\n", " ----------\n", " subset : column label or sequence of labels, optional\n", " Only consider certain columns for identifying duplicates, by\n", " default use all of the columns.\n", " keep : {'first', 'last', False}, default 'first'\n", " Determines which duplicates (if any) to mark.\n", " \n", " - ``first`` : Mark duplicates as ``True`` except for the first occurrence.\n", " - ``last`` : Mark duplicates as ``True`` except for the last occurrence.\n", " - False : Mark all duplicates as ``True``.\n", " \n", " Returns\n", " -------\n", " Series\n", " Boolean series for each duplicated rows.\n", " \n", " See Also\n", " --------\n", " Index.duplicated : Equivalent method on index.\n", " Series.duplicated : Equivalent method on Series.\n", " Series.drop_duplicates : Remove duplicate values from Series.\n", " DataFrame.drop_duplicates : Remove duplicate values from DataFrame.\n", " \n", " Examples\n", " --------\n", " Consider dataset containing ramen rating.\n", " \n", " >>> df = pd.DataFrame({\n", " ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],\n", " ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],\n", " ... 'rating': [4, 4, 3.5, 15, 5]\n", " ... })\n", " >>> df\n", " brand style rating\n", " 0 Yum Yum cup 4.0\n", " 1 Yum Yum cup 4.0\n", " 2 Indomie cup 3.5\n", " 3 Indomie pack 15.0\n", " 4 Indomie pack 5.0\n", " \n", " By default, for each set of duplicated values, the first occurrence\n", " is set on False and all others on True.\n", " \n", " >>> df.duplicated()\n", " 0 False\n", " 1 True\n", " 2 False\n", " 3 False\n", " 4 False\n", " dtype: bool\n", " \n", " By using 'last', the last occurrence of each set of duplicated values\n", " is set on False and all others on True.\n", " \n", " >>> df.duplicated(keep='last')\n", " 0 True\n", " 1 False\n", " 2 False\n", " 3 False\n", " 4 False\n", " dtype: bool\n", " \n", " By setting ``keep`` on False, all duplicates are True.\n", " \n", " >>> df.duplicated(keep=False)\n", " 0 True\n", " 1 True\n", " 2 False\n", " 3 False\n", " 4 False\n", " dtype: bool\n", " \n", " To find duplicates on specific column(s), use ``subset``.\n", " \n", " >>> df.duplicated(subset=['brand'])\n", " 0 False\n", " 1 True\n", " 2 False\n", " 3 True\n", " 4 True\n", " dtype: bool\n", "\n" ] } ], "source": [ "help(data.duplicated)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
4two3
6two4
\n", "
" ], "text/plain": [ " k1 k2\n", "4 two 3\n", "6 two 4" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "duplicate_rows = data[data.duplicated()]\n", "duplicate_rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a new column based on values from one or more columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are familiar with Excel's IF/ELSE function or SQL's CASE WHEN statements, then this section is for you. For smallish data sets and not too complex IF/ELSE logic, `map()` and `apply()` are fine. But for larger data sets, you will notice a significant slowdown. For large data sets, you have a few options:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- use pandas `.loc`\n", "- use `np.where`\n", "- use `np.select`\n", "- use [swifter](https://github.com/jmcarpenter2/swifter) library\n", "- Dask's dataframe [apply](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.apply) which swifter does for you under the hood" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This [article](https://towardsdatascience.com/efficient-implementation-of-conditional-logic-on-pandas-dataframes-4afa61eb7fce) summarizes the first 3 options above and explains the reasons for slowdown. You may have to open the link using your browser's private/incognito mode." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the data was originally sourced from a database table, then apply CASE WHEN logic prior to bringing the data into pandas. Best practice rule of thumb is to use as much SQL as possible for transformations or summarizations before pulling data into pandas as pandas is limited to using your local machine's resources. Another reason why SQL should be used as much as possible, there will likely be more people that are familiar with SQL in your organization than Python/pandas. So to reduce \"technical debt\", Python or pandas use should be limited if data manipulations can be done in SQL instead." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3pastrami6.0
4corned beef7.5
5bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0
\n", "
" ], "text/plain": [ " food ounces\n", "0 bacon 4.0\n", "1 pulled pork 3.0\n", "2 bacon 12.0\n", "3 pastrami 6.0\n", "4 corned beef 7.5\n", "5 bacon 8.0\n", "6 pastrami 3.0\n", "7 honey ham 5.0\n", "8 nova lox 6.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'pastrami','corned beef', 'bacon', 'pastrami', 'honey ham','nova lox'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal using a dictionary and also a function:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "meat_to_animal = {\n", "'bacon': 'pig',\n", "'pulled pork': 'pig',\n", "'pastrami': 'cow',\n", "'corned beef': 'cow',\n", "'honey ham': 'pig',\n", "'nova lox': 'salmon'\n", "}\n", "\n", "def meat2animal(column):\n", " if column == 'bacon':\n", " return 'pig'\n", " elif column == 'pulled pork':\n", " return 'pig'\n", " elif column == 'pastrami':\n", " return 'cow'\n", " elif column == 'corned beef':\n", " return 'cow'\n", " elif column == 'honey ham':\n", " return 'pig'\n", " else:\n", " return 'salmon'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the Python dictionary from above:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3pastrami6.0cow
4corned beef7.5cow
5bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon
\n", "
" ], "text/plain": [ " food ounces animal\n", "0 bacon 4.0 pig\n", "1 pulled pork 3.0 pig\n", "2 bacon 12.0 pig\n", "3 pastrami 6.0 cow\n", "4 corned beef 7.5 cow\n", "5 bacon 8.0 pig\n", "6 pastrami 3.0 cow\n", "7 honey ham 5.0 pig\n", "8 nova lox 6.0 salmon" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['animal'] = data['food'].map(meat_to_animal)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or using the function ```meat2animal()```:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3pastrami6.0cow
4corned beef7.5cow
5bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon
\n", "
" ], "text/plain": [ " food ounces animal\n", "0 bacon 4.0 pig\n", "1 pulled pork 3.0 pig\n", "2 bacon 12.0 pig\n", "3 pastrami 6.0 cow\n", "4 corned beef 7.5 cow\n", "5 bacon 8.0 pig\n", "6 pastrami 3.0 cow\n", "7 honey ham 5.0 pig\n", "8 nova lox 6.0 salmon" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['animal'] = data['food'].map(meat2animal)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We could also modify the function slightly to accept a dataframe row and perform logic on more than one column" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def use2columns(row):\n", " if row['animal'] == 'pig' and row['ounces'] > 4:\n", " return 'Big Pig'\n", " elif row['animal'] == 'pig' and row['ounces'] <= 4:\n", " return 'Little Pig'\n", " else:\n", " return 'Other Animal'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE:** With large dataframes, applying a function across an entire row will impose a huge performance penalty." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodouncesanimalanimal2
0bacon4.0pigLittle Pig
1pulled pork3.0pigLittle Pig
2bacon12.0pigBig Pig
3pastrami6.0cowOther Animal
4corned beef7.5cowOther Animal
5bacon8.0pigBig Pig
6pastrami3.0cowOther Animal
7honey ham5.0pigBig Pig
8nova lox6.0salmonOther Animal
\n", "
" ], "text/plain": [ " food ounces animal animal2\n", "0 bacon 4.0 pig Little Pig\n", "1 pulled pork 3.0 pig Little Pig\n", "2 bacon 12.0 pig Big Pig\n", "3 pastrami 6.0 cow Other Animal\n", "4 corned beef 7.5 cow Other Animal\n", "5 bacon 8.0 pig Big Pig\n", "6 pastrami 3.0 cow Other Animal\n", "7 honey ham 5.0 pig Big Pig\n", "8 nova lox 6.0 salmon Other Animal" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['animal2'] = data.apply(use2columns, axis='columns')\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use apply() along with lambda function to create new columns:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valuedate
0-0.98932202013-01-01
10.15263212013-01-02
21.48331272013-01-03
31.43103362013-01-04
4-2.31510072013-01-05
50.49999762013-01-06
60.81936812013-01-07
7-1.03780932013-01-08
8-0.66734182013-01-09
90.31813292013-01-10
\n", "
" ], "text/plain": [ " value date\n", "0 -0.9893220 2013-01-01\n", "1 0.1526321 2013-01-02\n", "2 1.4833127 2013-01-03\n", "3 1.4310336 2013-01-04\n", "4 -2.3151007 2013-01-05\n", "5 0.4999976 2013-01-06\n", "6 0.8193681 2013-01-07\n", "7 -1.0378093 2013-01-08\n", "8 -0.6673418 2013-01-09\n", "9 0.3181329 2013-01-10" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "dates = pd.date_range('20130101',periods=10)\n", "data = {'value': [value[0] for value in np.random.randn(10,1).tolist()], 'date': dates }\n", "df = pd.DataFrame.from_dict(data)\n", "df" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "value float64\n", "date datetime64[ns]\n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### OBJECTIVE: Let's say you want to create a month and year column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Since ```date``` column is an actual ```datetime``` object, we can leverage its strftime() function:**" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
valuedatemonthyear
0-0.98932202013-01-012013-012013
10.15263212013-01-022013-012013
21.48331272013-01-032013-012013
31.43103362013-01-042013-012013
4-2.31510072013-01-052013-012013
50.49999762013-01-062013-012013
60.81936812013-01-072013-012013
7-1.03780932013-01-082013-012013
8-0.66734182013-01-092013-012013
90.31813292013-01-102013-012013
\n", "
" ], "text/plain": [ " value date month year\n", "0 -0.9893220 2013-01-01 2013-01 2013\n", "1 0.1526321 2013-01-02 2013-01 2013\n", "2 1.4833127 2013-01-03 2013-01 2013\n", "3 1.4310336 2013-01-04 2013-01 2013\n", "4 -2.3151007 2013-01-05 2013-01 2013\n", "5 0.4999976 2013-01-06 2013-01 2013\n", "6 0.8193681 2013-01-07 2013-01 2013\n", "7 -1.0378093 2013-01-08 2013-01 2013\n", "8 -0.6673418 2013-01-09 2013-01 2013\n", "9 0.3181329 2013-01-10 2013-01 2013" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['month'] = df['date'].apply(lambda x: x.strftime('%Y-%m'))\n", "df['year'] = df['date'].apply(lambda x: x.strftime('%Y'))\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a new column based on values in a separate dataframe using a combination of map, query and lambda function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A better alternative is to use the [merge()](#merge) function." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounces
0a4.0
1a3.0
2a12.0
3b6.0
4b7.5
5b8.0
6c3.0
7c5.0
8c6.0
\n", "
" ], "text/plain": [ " group ounces\n", "0 a 4.0\n", "1 a 3.0\n", "2 a 12.0\n", "3 b 6.0\n", "4 b 7.5\n", "5 b 8.0\n", "6 c 3.0\n", "7 c 5.0\n", "8 c 6.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "left = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "left" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelvalue
0aalpha
1bbeta
2ccharlie
\n", "
" ], "text/plain": [ " label value\n", "0 a alpha\n", "1 b beta\n", "2 c charlie" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "right = pd.DataFrame({'label': ['a','b','c'],\n", " 'value': ['alpha','beta','charlie']})\n", "right" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupouncesLabel
0a4.0alpha
1a3.0alpha
2a12.0alpha
3b6.0beta
4b7.5beta
5b8.0beta
6c3.0charlie
7c5.0charlie
8c6.0charlie
\n", "
" ], "text/plain": [ " group ounces Label\n", "0 a 4.0 alpha\n", "1 a 3.0 alpha\n", "2 a 12.0 alpha\n", "3 b 6.0 beta\n", "4 b 7.5 beta\n", "5 b 8.0 beta\n", "6 c 3.0 charlie\n", "7 c 5.0 charlie\n", "8 c 6.0 charlie" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left['Label'] = left['group'].map(\n", " lambda group: right.query(\"label == @group\")['value'].iloc[0]\n", ")\n", "left" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An equivalent to SQL's ROW_NUMBER / PARTITION BY" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### OBJECTIVE:\n", " - To identify duplicate records with row numbering\n", " - In the example below, a duplicate record is based on CLAIM_NUM and PART_NUM (both sorted by ascending order)\n", " - With records having a row number other than 1, we want to set cost amounts to zero" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example mimics SQL's [row_number()](https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql) function." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "df = pd.read_clipboard()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CLAIM_NUMPART_NUMPART_COST_USDLABOR_COST_USDHANDLING_COST_USDTOTAL_COST_USD
01062315LH645.3360.3446.30751.97
11062345LH323.5567.2520.56751.97
21062015LH303.1380.4535.34751.97
32062315LH613.4560.3446.30720.09
42062015LH300.2580.4535.34720.09
53062345LH333.1067.2520.56420.91
64062345LH300.2580.4546.30427.00
\n", "
" ], "text/plain": [ " CLAIM_NUM PART_NUM PART_COST_USD LABOR_COST_USD HANDLING_COST_USD \\\n", "0 1 062315LH 645.33 60.34 46.30 \n", "1 1 062345LH 323.55 67.25 20.56 \n", "2 1 062015LH 303.13 80.45 35.34 \n", "3 2 062315LH 613.45 60.34 46.30 \n", "4 2 062015LH 300.25 80.45 35.34 \n", "5 3 062345LH 333.10 67.25 20.56 \n", "6 4 062345LH 300.25 80.45 46.30 \n", "\n", " TOTAL_COST_USD \n", "0 751.97 \n", "1 751.97 \n", "2 751.97 \n", "3 720.09 \n", "4 720.09 \n", "5 420.91 \n", "6 427.00 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use combination of sort_values(), groupby(), and cumcount() function to create ```ROW_NUM``` column:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CLAIM_NUMPART_NUMPART_COST_USDLABOR_COST_USDHANDLING_COST_USDTOTAL_COST_USDROW_NUM
21062015LH303.1380.4535.34751.971
01062315LH645.3360.3446.30751.972
11062345LH323.5567.2520.56751.973
42062015LH300.2580.4535.34720.091
32062315LH613.4560.3446.30720.092
53062345LH333.1067.2520.56420.911
64062345LH300.2580.4546.30427.001
\n", "
" ], "text/plain": [ " CLAIM_NUM PART_NUM PART_COST_USD LABOR_COST_USD HANDLING_COST_USD \\\n", "2 1 062015LH 303.13 80.45 35.34 \n", "0 1 062315LH 645.33 60.34 46.30 \n", "1 1 062345LH 323.55 67.25 20.56 \n", "4 2 062015LH 300.25 80.45 35.34 \n", "3 2 062315LH 613.45 60.34 46.30 \n", "5 3 062345LH 333.10 67.25 20.56 \n", "6 4 062345LH 300.25 80.45 46.30 \n", "\n", " TOTAL_COST_USD ROW_NUM \n", "2 751.97 1 \n", "0 751.97 2 \n", "1 751.97 3 \n", "4 720.09 1 \n", "3 720.09 2 \n", "5 420.91 1 \n", "6 427.00 1 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['ROW_NUM'] = df.sort_values(by=['PART_NUM']).groupby(['CLAIM_NUM']).cumcount() + 1\n", "df.sort_values(by=['CLAIM_NUM', 'ROW_NUM'], inplace=True)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then use np.where() to \"zero out\" the cost columns whose row numbers are not 1:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "df['PART_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['PART_COST_USD'], 0)\n", "df['LABOR_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['LABOR_COST_USD'], 0)\n", "df['HANDLING_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['HANDLING_COST_USD'], 0)\n", "df['TOTAL_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['TOTAL_COST_USD'], 0)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CLAIM_NUMPART_NUMPART_COST_USDLABOR_COST_USDHANDLING_COST_USDTOTAL_COST_USDROW_NUM
21062015LH303.1380.4535.34751.971
01062315LH0.000.000.000.002
11062345LH0.000.000.000.003
42062015LH300.2580.4535.34720.091
32062315LH0.000.000.000.002
53062345LH333.1067.2520.56420.911
64062345LH300.2580.4546.30427.001
\n", "
" ], "text/plain": [ " CLAIM_NUM PART_NUM PART_COST_USD LABOR_COST_USD HANDLING_COST_USD \\\n", "2 1 062015LH 303.13 80.45 35.34 \n", "0 1 062315LH 0.00 0.00 0.00 \n", "1 1 062345LH 0.00 0.00 0.00 \n", "4 2 062015LH 300.25 80.45 35.34 \n", "3 2 062315LH 0.00 0.00 0.00 \n", "5 3 062345LH 333.10 67.25 20.56 \n", "6 4 062345LH 300.25 80.45 46.30 \n", "\n", " TOTAL_COST_USD ROW_NUM \n", "2 751.97 1 \n", "0 0.00 2 \n", "1 0.00 3 \n", "4 720.09 1 \n", "3 0.00 2 \n", "5 420.91 1 \n", "6 427.00 1 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We can also use [dataframe.assign()](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro-chained-assignment) function to create a new column which was added in pandas version 0.16" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2ratio
0-0.1957977-1.30061770.1505420
11.50436931.85486450.8110400
2-2.1327051-0.92183252.3135495
30.4580521-1.6185442-0.2830025
4-1.42603600.2567198-5.5548338
\n", "
" ], "text/plain": [ " data1 data2 ratio\n", "0 -0.1957977 -1.3006177 0.1505420\n", "1 1.5043693 1.8548645 0.8110400\n", "2 -2.1327051 -0.9218325 2.3135495\n", "3 0.4580521 -1.6185442 -0.2830025\n", "4 -1.4260360 0.2567198 -5.5548338" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.DataFrame({'data1' : np.random.randn(5),\n", " 'data2' : np.random.randn(5)})\n", "\n", "df.assign(ratio = df['data1'] / df['data2'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Text to Columns Example" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namenumber
0John Doe12
1Jane Doe234
2Someone Else235
\n", "
" ], "text/plain": [ " name number\n", "0 John Doe 12\n", "1 Jane Doe 234\n", "2 Someone Else 235" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame({'name': ['John Doe','Jane Doe','Someone Else'], \n", " 'number': [12, 234, 235]\n", " })\n", "df" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "df[['first_name', 'last_name']] = df['name'].str.split(' ', expand=True)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namenumberfirst_namelast_name
0John Doe12JohnDoe
1Jane Doe234JaneDoe
2Someone Else235SomeoneElse
\n", "
" ], "text/plain": [ " name number first_name last_name\n", "0 John Doe 12 John Doe\n", "1 Jane Doe 234 Jane Doe\n", "2 Someone Else 235 Someone Else" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing or dropping a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back top top](#top)]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3pastrami6.0
4corned beef7.5
5bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0
\n", "
" ], "text/plain": [ " food ounces\n", "0 bacon 4.0\n", "1 pulled pork 3.0\n", "2 bacon 12.0\n", "3 pastrami 6.0\n", "4 corned beef 7.5\n", "5 bacon 8.0\n", "6 pastrami 3.0\n", "7 honey ham 5.0\n", "8 nova lox 6.0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'pastrami','corned beef', 'bacon', 'pastrami', 'honey ham','nova lox'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ounces
04.0
13.0
212.0
36.0
47.5
58.0
63.0
75.0
86.0
\n", "
" ], "text/plain": [ " ounces\n", "0 4.0\n", "1 3.0\n", "2 12.0\n", "3 6.0\n", "4 7.5\n", "5 8.0\n", "6 3.0\n", "7 5.0\n", "8 6.0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop(columns=['food'], inplace=True)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replacing Values in a Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 -999.0\n", "2 2.0\n", "3 -999.0\n", "4 -1000.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "data = pd.Series([1., -999., 2., -999., -1000., 3.])\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### If you want to replace -999 with NaN:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 -1000.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.replace(-999, np.nan, inplace=True)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### If you want to replace multiple values at once, you instead pass a list then the substitute value:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 -999.0\n", "2 2.0\n", "3 -999.0\n", "4 -1000.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data = pd.Series([1., -999., 2., -999., -1000., 3.])\n", "data" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 NaN\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.replace([-999, -1000], np.nan, inplace=True)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filling or Dropping NaN (\"Not a Number\") Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Dropping rows that have NaN:**" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "2 2.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Filling/replacing NaN values with something else (replace NaN with 0/zero):**" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 0.0\n", "2 2.0\n", "3 0.0\n", "4 0.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.fillna(value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using ```converters``` parameter to transform data during data ingestion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from datetime import datetime" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateGBP_to_USD
0Wednesday 15 November 20171 GBP = 1.317 USD
1Tuesday 14 November 20171 GBP = 1.3158 USD
2Monday 13 November 20171 GBP = 1.312 USD
3Sunday 12 November 20171 GBP = 1.32 USD
4Saturday 11 November 20171 GBP = 1.32 USD
\n", "
" ], "text/plain": [ " Date GBP_to_USD\n", "0 Wednesday 15 November 2017 1 GBP = 1.317 USD\n", "1 Tuesday 14 November 2017 1 GBP = 1.3158 USD\n", "2 Monday 13 November 2017 1 GBP = 1.312 USD\n", "3 Sunday 12 November 2017 1 GBP = 1.32 USD\n", "4 Saturday 11 November 2017 1 GBP = 1.32 USD" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_clipboard()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### OBJECTIVE: You want to make the dates actual dates and you want to extract the USD amounts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Need to convert the string dates into actual Python dates and extract just the numeric values from the ```GBP_to_USD``` columns" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateGBP_to_USD
02017-11-151.317
12017-11-141.3158
22017-11-131.312
32017-11-121.32
42017-11-111.32
\n", "
" ], "text/plain": [ " Date GBP_to_USD\n", "0 2017-11-15 1.317\n", "1 2017-11-14 1.3158\n", "2 2017-11-13 1.312\n", "3 2017-11-12 1.32\n", "4 2017-11-11 1.32" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_clipboard(converters={'Date': lambda x: datetime.strptime(x, \"%A %d %B %Y\"),\n", " 'GBP_to_USD': lambda x: x.split(\" \")[3]\n", " })\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Renaming Index or Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE:** In pandas land, index is equivalent to your row labels or row numbers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### rename() can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
Ohio0123
Colorado4567
New York891011
\n", "
" ], "text/plain": [ " one two three four\n", "Ohio 0 1 2 3\n", "Colorado 4 5 6 7\n", "New York 8 9 10 11" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data = pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado', 'New York'],columns=['one', 'two', 'three', 'four'])\n", "data" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwopeekaboofour
INDIANA0123
Colorado4567
New York891011
\n", "
" ], "text/plain": [ " one two peekaboo four\n", "INDIANA 0 1 2 3\n", "Colorado 4 5 6 7\n", "New York 8 9 10 11" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.rename(index={'Ohio': 'INDIANA'},columns={'three': 'peekaboo'},inplace=True)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also apply str functions to modify the index or column labels" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ONETWOPEEKABOOFOUR
Indiana0123
Colorado4567
New York891011
\n", "
" ], "text/plain": [ " ONE TWO PEEKABOO FOUR\n", "Indiana 0 1 2 3\n", "Colorado 4 5 6 7\n", "New York 8 9 10 11" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.rename(index=str.title, columns=str.upper, inplace=True) # str.title means to make the 1st letter capitalized only\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Binning Data Into Intervals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do so, you have to use `cut()` function:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "bins = [18, 25, 35, 60, 100]" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]\n", "Length: 12\n", "Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats = pd.cut(ages, bins)\n", "cats" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(18, 25] 5\n", "(35, 60] 3\n", "(25, 35] 3\n", "(60, 100] 1\n", "dtype: int64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.value_counts(cats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### If you want the right value to be exclusive in the intervals:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]\n", "Length: 12\n", "Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.cut(ages, [18, 26, 36, 61, 100], right=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### You can also pass your own bin names by passing a list or array to the labels option:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Youth 5\n", "MiddleAged 3\n", "YoungAdult 3\n", "Senior 1\n", "dtype: int64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']\n", "pd.cut(ages, bins, labels=group_names)\n", "pd.value_counts(pd.cut(ages, bins, labels=group_names))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sub-selecting or Slicing a Data Frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filtering by label name: [[loc](#loc)]
\n", "Filtering by index row and/or column: [[iloc](#iloc)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-01-0.6096394-1.4504565-0.4148328-0.4758915
2013-01-021.6101875-0.7751663-0.29171611.1975733
2013-01-03-2.7636373-0.12625120.2035659-1.0004354
2013-01-041.73856110.00707010.04794910.8737716
2013-01-050.13616470.8550785-0.34414741.5101716
2013-01-061.33216680.5266501-0.35357081.8317850
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-01 -0.6096394 -1.4504565 -0.4148328 -0.4758915\n", "2013-01-02 1.6101875 -0.7751663 -0.2917161 1.1975733\n", "2013-01-03 -2.7636373 -0.1262512 0.2035659 -1.0004354\n", "2013-01-04 1.7385611 0.0070701 0.0479491 0.8737716\n", "2013-01-05 0.1361647 0.8550785 -0.3441474 1.5101716\n", "2013-01-06 1.3321668 0.5266501 -0.3535708 1.8317850" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "dates = pd.date_range('20130101',periods=6)\n", "df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting first n rows of data frame using index slicing syntax" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-01-0.6096394-1.4504565-0.4148328-0.4758915
2013-01-021.6101875-0.7751663-0.29171611.1975733
2013-01-03-2.7636373-0.12625120.2035659-1.0004354
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-01 -0.6096394 -1.4504565 -0.4148328 -0.4758915\n", "2013-01-02 1.6101875 -0.7751663 -0.2917161 1.1975733\n", "2013-01-03 -2.7636373 -0.1262512 0.2035659 -1.0004354" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[0:3] # get first 3 rows of the data frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slicing based on data frame's index range" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-021.6101875-0.7751663-0.29171611.1975733
2013-01-03-2.7636373-0.12625120.2035659-1.0004354
2013-01-041.73856110.00707010.04794910.8737716
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-02 1.6101875 -0.7751663 -0.2917161 1.1975733\n", "2013-01-03 -2.7636373 -0.1262512 0.2035659 -1.0004354\n", "2013-01-04 1.7385611 0.0070701 0.0479491 0.8737716" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['20130102':'20130104'] # get rows by index range" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slicing based on column labels/names using loc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[view df](#slicing)]" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
2013-01-01-0.6096394-1.4504565
2013-01-021.6101875-0.7751663
2013-01-03-2.7636373-0.1262512
2013-01-041.73856110.0070701
2013-01-050.13616470.8550785
2013-01-061.33216680.5266501
\n", "
" ], "text/plain": [ " A B\n", "2013-01-01 -0.6096394 -1.4504565\n", "2013-01-02 1.6101875 -0.7751663\n", "2013-01-03 -2.7636373 -0.1262512\n", "2013-01-04 1.7385611 0.0070701\n", "2013-01-05 0.1361647 0.8550785\n", "2013-01-06 1.3321668 0.5266501" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:,['A','B']] # syntax is: df.loc[rows_index, cols_index]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slicing based on row index label and column label combined using loc" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
2013-01-021.6101875-0.7751663
2013-01-03-2.7636373-0.1262512
2013-01-041.73856110.0070701
\n", "
" ], "text/plain": [ " A B\n", "2013-01-02 1.6101875 -0.7751663\n", "2013-01-03 -2.7636373 -0.1262512\n", "2013-01-04 1.7385611 0.0070701" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['20130102':'20130104',['A','B']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Resampling Time Series Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For complete list of resampling time periods - [link](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
2013-01-011.0478521
2013-01-02-0.8181170
2013-01-03-0.6467086
2013-01-040.6722718
2013-01-050.6782096
\n", "
" ], "text/plain": [ " A\n", "2013-01-01 1.0478521\n", "2013-01-02 -0.8181170\n", "2013-01-03 -0.6467086\n", "2013-01-04 0.6722718\n", "2013-01-05 0.6782096" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "dates = pd.date_range('20130101',periods=180)\n", "df = pd.DataFrame(np.random.randn(180,1),index=dates,columns=list('A'))\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Partial date filtering using loc, get just January's data:**" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
2013-01-011.0478521
2013-01-02-0.8181170
2013-01-03-0.6467086
2013-01-040.6722718
2013-01-050.6782096
2013-01-06-0.8074374
2013-01-071.5643882
2013-01-080.3723588
2013-01-090.4197943
2013-01-101.5196560
2013-01-11-0.4932106
2013-01-120.4736382
2013-01-130.2886090
2013-01-140.8174301
2013-01-150.0395030
2013-01-161.5106605
2013-01-17-2.0664942
2013-01-180.8735501
2013-01-19-0.8793671
2013-01-201.0978811
2013-01-21-0.5022753
2013-01-22-1.1285848
2013-01-23-1.2105503
2013-01-24-0.3503544
2013-01-250.0748389
2013-01-260.8998185
2013-01-27-0.0481116
2013-01-28-1.7208055
2013-01-29-0.1856816
2013-01-30-0.0957440
2013-01-310.0110788
\n", "
" ], "text/plain": [ " A\n", "2013-01-01 1.0478521\n", "2013-01-02 -0.8181170\n", "2013-01-03 -0.6467086\n", "2013-01-04 0.6722718\n", "2013-01-05 0.6782096\n", "2013-01-06 -0.8074374\n", "2013-01-07 1.5643882\n", "2013-01-08 0.3723588\n", "2013-01-09 0.4197943\n", "2013-01-10 1.5196560\n", "2013-01-11 -0.4932106\n", "2013-01-12 0.4736382\n", "2013-01-13 0.2886090\n", "2013-01-14 0.8174301\n", "2013-01-15 0.0395030\n", "2013-01-16 1.5106605\n", "2013-01-17 -2.0664942\n", "2013-01-18 0.8735501\n", "2013-01-19 -0.8793671\n", "2013-01-20 1.0978811\n", "2013-01-21 -0.5022753\n", "2013-01-22 -1.1285848\n", "2013-01-23 -1.2105503\n", "2013-01-24 -0.3503544\n", "2013-01-25 0.0748389\n", "2013-01-26 0.8998185\n", "2013-01-27 -0.0481116\n", "2013-01-28 -1.7208055\n", "2013-01-29 -0.1856816\n", "2013-01-30 -0.0957440\n", "2013-01-31 0.0110788" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['2013-01']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**UPDATE:** You no longer need to use the .loc() function if your index is already of type ```DateTime```:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
2013-01-011.0478521
2013-01-02-0.8181170
2013-01-03-0.6467086
2013-01-040.6722718
2013-01-050.6782096
2013-01-06-0.8074374
2013-01-071.5643882
2013-01-080.3723588
2013-01-090.4197943
2013-01-101.5196560
2013-01-11-0.4932106
2013-01-120.4736382
2013-01-130.2886090
2013-01-140.8174301
2013-01-150.0395030
2013-01-161.5106605
2013-01-17-2.0664942
2013-01-180.8735501
2013-01-19-0.8793671
2013-01-201.0978811
2013-01-21-0.5022753
2013-01-22-1.1285848
2013-01-23-1.2105503
2013-01-24-0.3503544
2013-01-250.0748389
2013-01-260.8998185
2013-01-27-0.0481116
2013-01-28-1.7208055
2013-01-29-0.1856816
2013-01-30-0.0957440
2013-01-310.0110788
\n", "
" ], "text/plain": [ " A\n", "2013-01-01 1.0478521\n", "2013-01-02 -0.8181170\n", "2013-01-03 -0.6467086\n", "2013-01-04 0.6722718\n", "2013-01-05 0.6782096\n", "2013-01-06 -0.8074374\n", "2013-01-07 1.5643882\n", "2013-01-08 0.3723588\n", "2013-01-09 0.4197943\n", "2013-01-10 1.5196560\n", "2013-01-11 -0.4932106\n", "2013-01-12 0.4736382\n", "2013-01-13 0.2886090\n", "2013-01-14 0.8174301\n", "2013-01-15 0.0395030\n", "2013-01-16 1.5106605\n", "2013-01-17 -2.0664942\n", "2013-01-18 0.8735501\n", "2013-01-19 -0.8793671\n", "2013-01-20 1.0978811\n", "2013-01-21 -0.5022753\n", "2013-01-22 -1.1285848\n", "2013-01-23 -1.2105503\n", "2013-01-24 -0.3503544\n", "2013-01-25 0.0748389\n", "2013-01-26 0.8998185\n", "2013-01-27 -0.0481116\n", "2013-01-28 -1.7208055\n", "2013-01-29 -0.1856816\n", "2013-01-30 -0.0957440\n", "2013-01-31 0.0110788" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['2013-01']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Resample daily data to monthly data**" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
2013-01-311.4080967
2013-02-28-3.1467609
2013-03-31-4.7020674
2013-04-30-10.9058725
2013-05-312.5918750
2013-06-3011.2540062
\n", "
" ], "text/plain": [ " A\n", "2013-01-31 1.4080967\n", "2013-02-28 -3.1467609\n", "2013-03-31 -4.7020674\n", "2013-04-30 -10.9058725\n", "2013-05-31 2.5918750\n", "2013-06-30 11.2540062" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "monthly = df.resample('M').sum()\n", "monthly.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**But now, I prefer to see YYYY-MM format - for complete list of date formats - [link](http://strftime.org/)**" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A
2013-011.4080967
2013-02-3.1467609
2013-03-4.7020674
2013-04-10.9058725
2013-052.5918750
2013-0611.2540062
\n", "
" ], "text/plain": [ " A\n", "2013-01 1.4080967\n", "2013-02 -3.1467609\n", "2013-03 -4.7020674\n", "2013-04 -10.9058725\n", "2013-05 2.5918750\n", "2013-06 11.2540062" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "monthly.index = monthly.index.strftime('%Y-%m')\n", "monthly.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slicing based on index / numerical position of the row or column using iloc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[view df](#slicing)]" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-01-0.75944731.19399951.08311080.3719371
2013-01-020.9382271-0.73387310.53631711.4589336
2013-01-03-0.2829617-0.9869213-0.7907918-3.3821940
2013-01-040.12835170.10851600.57319050.3059759
2013-01-05-0.8193987-0.5878911-0.73041312.0117699
2013-01-06-0.1047669-1.6513622-0.75486270.1238924
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-01 -0.7594473 1.1939995 1.0831108 0.3719371\n", "2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336\n", "2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940\n", "2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759\n", "2013-01-05 -0.8193987 -0.5878911 -0.7304131 2.0117699\n", "2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dates = pd.date_range('20130101',periods=6)\n", "df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))\n", "df" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0.1283517\n", "B 0.1085160\n", "C 0.5731905\n", "D 0.3059759\n", "Name: 2013-01-04 00:00:00, dtype: float64" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[3] # returns 4th row (index=3) of the data frame" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
2013-01-040.12835170.1085160
2013-01-05-0.8193987-0.5878911
\n", "
" ], "text/plain": [ " A B\n", "2013-01-04 0.1283517 0.1085160\n", "2013-01-05 -0.8193987 -0.5878911" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[3:5,0:2] # returns specific range of rows and columns of the data frame" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-020.9382271-0.73387310.53631711.4589336
2013-01-03-0.2829617-0.9869213-0.7907918-3.3821940
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336\n", "2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[1:3,:] # returning specific rows and returning all columns" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BC
2013-01-011.19399951.0831108
2013-01-02-0.73387310.5363171
2013-01-03-0.9869213-0.7907918
2013-01-040.10851600.5731905
2013-01-05-0.5878911-0.7304131
2013-01-06-1.6513622-0.7548627
\n", "
" ], "text/plain": [ " B C\n", "2013-01-01 1.1939995 1.0831108\n", "2013-01-02 -0.7338731 0.5363171\n", "2013-01-03 -0.9869213 -0.7907918\n", "2013-01-04 0.1085160 0.5731905\n", "2013-01-05 -0.5878911 -0.7304131\n", "2013-01-06 -1.6513622 -0.7548627" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[:,1:3] # returning all rows and specific columns" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.7338730678633689" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[1,1] # getting secific scalar/single value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Boolean Indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[view df](#df)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The boolean operators are: **|** for or, & for and, and **~** for not. These **must** be grouped by using parentheses." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**By the way, there are 2 ways to access a dataframe column. Using bracket syntax (df[\"column_name\"]) or dot notation syntax (df.column_name). It is recommended that you use bracket syntax since you could in theory create a column name that happens to be an attribute of the dataframe. Another reason to use bracket syntax is if your column name has a space in it. Then in this case, the dot notation syntax would not work.**" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-020.9382271-0.73387310.53631711.4589336
2013-01-03-0.2829617-0.9869213-0.7907918-3.3821940
2013-01-040.12835170.10851600.57319050.3059759
2013-01-06-0.1047669-1.6513622-0.75486270.1238924
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336\n", "2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940\n", "2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759\n", "2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.A > -0.5] # or df[df[\"A\"] > -0.5], this syntax works when there is a space in the column name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or I've seen some people do boolean filtering by passing \"criteria\" variable to the data frame like so:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-020.9382271-0.73387310.53631711.4589336
2013-01-03-0.2829617-0.9869213-0.7907918-3.3821940
2013-01-040.12835170.10851600.57319050.3059759
2013-01-06-0.1047669-1.6513622-0.75486270.1238924
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336\n", "2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940\n", "2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759\n", "2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "criteria = df['A'] > -0.5\n", "df[criteria]" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
2013-01-01-0.75944731.19399951.08311080.3719371one
2013-01-020.9382271-0.73387310.53631711.4589336one
2013-01-03-0.2829617-0.9869213-0.7907918-3.3821940two
2013-01-040.12835170.10851600.57319050.3059759three
2013-01-05-0.8193987-0.5878911-0.73041312.0117699four
2013-01-06-0.1047669-1.6513622-0.75486270.1238924three
\n", "
" ], "text/plain": [ " A B C D E\n", "2013-01-01 -0.7594473 1.1939995 1.0831108 0.3719371 one\n", "2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336 one\n", "2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940 two\n", "2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759 three\n", "2013-01-05 -0.8193987 -0.5878911 -0.7304131 2.0117699 four\n", "2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924 three" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = df.copy()\n", "df2['E']=['one', 'one','two','three','four','three']\n", "df2" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
2013-01-03-0.2829617-0.9869213-0.7907918-3.3821940two
2013-01-05-0.8193987-0.5878911-0.73041312.0117699four
\n", "
" ], "text/plain": [ " A B C D E\n", "2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940 two\n", "2013-01-05 -0.8193987 -0.5878911 -0.7304131 2.0117699 four" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2[df2['E'].isin(['two','four'])] # read as \"return rows where column E contains two or four\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We can also do partial string matching. So let's say you don't know the exact spelling a word is you want to match, you can do this:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
2013-01-03-0.2829617-0.9869213-0.7907918-3.3821940two
2013-01-05-0.8193987-0.5878911-0.73041312.0117699four
\n", "
" ], "text/plain": [ " A B C D E\n", "2013-01-03 -0.2829617 -0.9869213 -0.7907918 -3.3821940 two\n", "2013-01-05 -0.8193987 -0.5878911 -0.7304131 2.0117699 four" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2[df2.E.str.contains(\"tw|ou\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using ~ to perform a \"NOT\" filtering" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
2013-01-01-0.75944731.19399951.08311080.3719371one
2013-01-020.9382271-0.73387310.53631711.4589336one
2013-01-040.12835170.10851600.57319050.3059759three
2013-01-06-0.1047669-1.6513622-0.75486270.1238924three
\n", "
" ], "text/plain": [ " A B C D E\n", "2013-01-01 -0.7594473 1.1939995 1.0831108 0.3719371 one\n", "2013-01-02 0.9382271 -0.7338731 0.5363171 1.4589336 one\n", "2013-01-04 0.1283517 0.1085160 0.5731905 0.3059759 three\n", "2013-01-06 -0.1047669 -1.6513622 -0.7548627 0.1238924 three" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2[~df2['E'].isin(['two','four'])] # column E containing values not in two or four" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Filtering using ```query()``` method. With version 0.25, we can use ```query()``` / ```eval()``` with column names with spaces using back ticks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is equivalent to using SQL's `WHERE` clause" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-01-0.02382120.1475556-1.02458121.6811354
2013-01-020.0652982-0.02319360.5774753-0.6419250
2013-01-03-0.95311441.0518699-0.0242550-0.3607456
2013-01-041.6691680-1.1965801-0.25353122.0905688
2013-01-050.65645642.34301371.80940840.6435664
2013-01-06-0.45178360.7971626-0.68576411.6924577
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-01 -0.0238212 0.1475556 -1.0245812 1.6811354\n", "2013-01-02 0.0652982 -0.0231936 0.5774753 -0.6419250\n", "2013-01-03 -0.9531144 1.0518699 -0.0242550 -0.3607456\n", "2013-01-04 1.6691680 -1.1965801 -0.2535312 2.0905688\n", "2013-01-05 0.6564564 2.3430137 1.8094084 0.6435664\n", "2013-01-06 -0.4517836 0.7971626 -0.6857641 1.6924577" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "dates = pd.date_range('20130101',periods=6)\n", "df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))\n", "df" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-01-0.02382120.1475556-1.02458121.6811354
2013-01-041.6691680-1.1965801-0.25353122.0905688
2013-01-06-0.45178360.7971626-0.68576411.6924577
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-01 -0.0238212 0.1475556 -1.0245812 1.6811354\n", "2013-01-04 1.6691680 -1.1965801 -0.2535312 2.0905688\n", "2013-01-06 -0.4517836 0.7971626 -0.6857641 1.6924577" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"A > C\")" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-020.0652982-0.02319360.5774753-0.6419250
2013-01-041.6691680-1.1965801-0.25353122.0905688
2013-01-050.65645642.34301371.80940840.6435664
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-02 0.0652982 -0.0231936 0.5774753 -0.6419250\n", "2013-01-04 1.6691680 -1.1965801 -0.2535312 2.0905688\n", "2013-01-05 0.6564564 2.3430137 1.8094084 0.6435664" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"A > 0\")" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-020.0652982-0.02319360.5774753-0.6419250
2013-01-050.65645642.34301371.80940840.6435664
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-02 0.0652982 -0.0231936 0.5774753 -0.6419250\n", "2013-01-05 0.6564564 2.3430137 1.8094084 0.6435664" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"A > 0 & A < 1\")" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-01-0.02382120.1475556-1.02458121.6811354
2013-01-020.0652982-0.02319360.5774753-0.6419250
2013-01-041.6691680-1.1965801-0.25353122.0905688
2013-01-06-0.45178360.7971626-0.68576411.6924577
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-01 -0.0238212 0.1475556 -1.0245812 1.6811354\n", "2013-01-02 0.0652982 -0.0231936 0.5774753 -0.6419250\n", "2013-01-04 1.6691680 -1.1965801 -0.2535312 2.0905688\n", "2013-01-06 -0.4517836 0.7971626 -0.6857641 1.6924577" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"A > B | A > C\") # where A is greater than B or A is greater than C" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessors and their methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top]](#top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've already seen an example usage of the ```.str.``` accessor above. There are 2 other accessors that help you perform date, and categorical functions." ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'cat', 'dt', 'sparse', 'str'}" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series._accessors" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['__annotations__',\n", " '__class__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattribute__',\n", " '__gt__',\n", " '__hash__',\n", " '__init__',\n", " '__init_subclass__',\n", " '__le__',\n", " '__lt__',\n", " '__module__',\n", " '__ne__',\n", " '__new__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__setattr__',\n", " '__sizeof__',\n", " '__str__',\n", " '__subclasshook__',\n", " '__weakref__',\n", " '_accessors',\n", " '_add_delegate_accessors',\n", " '_constructor',\n", " '_delegate_method',\n", " '_delegate_property_get',\n", " '_delegate_property_set',\n", " '_deprecations',\n", " '_dir_additions',\n", " '_dir_deletions',\n", " '_freeze',\n", " '_get_values',\n", " '_reset_cache',\n", " 'asfreq',\n", " 'ceil',\n", " 'components',\n", " 'date',\n", " 'day',\n", " 'day_name',\n", " 'dayofweek',\n", " 'dayofyear',\n", " 'days',\n", " 'days_in_month',\n", " 'daysinmonth',\n", " 'end_time',\n", " 'floor',\n", " 'freq',\n", " 'hour',\n", " 'is_leap_year',\n", " 'is_month_end',\n", " 'is_month_start',\n", " 'is_quarter_end',\n", " 'is_quarter_start',\n", " 'is_year_end',\n", " 'is_year_start',\n", " 'isocalendar',\n", " 'microsecond',\n", " 'microseconds',\n", " 'minute',\n", " 'month',\n", " 'month_name',\n", " 'nanosecond',\n", " 'nanoseconds',\n", " 'normalize',\n", " 'quarter',\n", " 'qyear',\n", " 'round',\n", " 'second',\n", " 'seconds',\n", " 'start_time',\n", " 'strftime',\n", " 'time',\n", " 'timetz',\n", " 'to_period',\n", " 'to_pydatetime',\n", " 'to_pytimedelta',\n", " 'to_timestamp',\n", " 'total_seconds',\n", " 'tz',\n", " 'tz_convert',\n", " 'tz_localize',\n", " 'week',\n", " 'weekday',\n", " 'weekofyear',\n", " 'year']" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(pd.Series.dt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the `dir()` output above, we see that there are some useful date-related attributes that we can leverage" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "daterng = pd.Series(pd.date_range('2017', periods=9, freq='Q'))" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2017-03-31\n", "1 2017-06-30\n", "2 2017-09-30\n", "3 2017-12-31\n", "4 2018-03-31\n", "5 2018-06-30\n", "6 2018-09-30\n", "7 2018-12-31\n", "8 2019-03-31\n", "dtype: datetime64[ns]" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daterng" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Friday\n", "1 Friday\n", "2 Saturday\n", "3 Sunday\n", "4 Saturday\n", "5 Saturday\n", "6 Sunday\n", "7 Monday\n", "8 Sunday\n", "dtype: object" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daterng.dt.day_name()" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 2017-12-31\n", "7 2018-12-31\n", "dtype: datetime64[ns]" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daterng[daterng.dt.is_year_end]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtaining columns with partial column labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key1key2data1data2
0aone-0.5009150-1.1067497
1atwo-0.45328280.3262502
2bone-0.24582760.2946161
3btwo-0.3467547-0.6925296
4aone0.1528616-0.0112956
\n", "
" ], "text/plain": [ " key1 key2 data1 data2\n", "0 a one -0.5009150 -1.1067497\n", "1 a two -0.4532828 0.3262502\n", "2 b one -0.2458276 0.2946161\n", "3 b two -0.3467547 -0.6925296\n", "4 a one 0.1528616 -0.0112956" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],\n", " 'key2' : ['one', 'two', 'one', 'two', 'one'],\n", " 'data1' : np.random.randn(5),\n", " 'data2' : np.random.randn(5)})\n", "df" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
0-0.5009150-1.1067497
1-0.45328280.3262502
2-0.24582760.2946161
3-0.3467547-0.6925296
40.1528616-0.0112956
\n", "
" ], "text/plain": [ " data1 data2\n", "0 -0.5009150 -1.1067497\n", "1 -0.4532828 0.3262502\n", "2 -0.2458276 0.2946161\n", "3 -0.3467547 -0.6925296\n", "4 0.1528616 -0.0112956" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.filter(like='data')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Value Counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is so tempting to use the groupby() function or pivot_table, but most of the time, value_counts() function is all we need." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounces
0a4.0
1a3.0
2a12.0
3b6.0
4b7.5
5b8.0
6c3.0
7c5.0
8c6.0
\n", "
" ], "text/plain": [ " group ounces\n", "0 a 4.0\n", "1 a 3.0\n", "2 a 12.0\n", "3 b 6.0\n", "4 b 7.5\n", "5 b 8.0\n", "6 c 3.0\n", "7 c 5.0\n", "8 c 6.0" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 3\n", "b 3\n", "c 3\n", "Name: group, dtype: int64" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['group'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**There is also a handy ```normalize``` option:**" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0.3333333\n", "b 0.3333333\n", "c 0.3333333\n", "Name: group, dtype: float64" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['group'].value_counts(normalize=True) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Cumulative Sum" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 3\n", "b 6\n", "c 9\n", "Name: group, dtype: int64" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['group'].value_counts().cumsum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping Data ( see also [value_count()](#value_counts) and [pivot_table()](#pivot_table) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### New in version 0.20 - agg() function. Now aggregating is less verbose!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now for some simple aggregations:" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
2013-01-010.95961461.99727820.60171561.4154062
2013-01-02-1.4137538-1.3745631-0.32153560.7472984
2013-01-030.76324991.67429771.30213260.1888954
2013-01-040.8005111-0.14627940.34365990.1310278
2013-01-050.90663430.5098032-0.1090932-0.6649942
2013-01-06-0.5543259-0.75123751.7489706-1.6296998
2013-01-070.29703821.26248041.01159710.2765671
2013-01-08-0.98453970.85089580.54002080.6634031
2013-01-09-0.11880550.4146147-0.85129620.1770277
2013-01-10-1.36457580.0008273-0.81914290.3820062
\n", "
" ], "text/plain": [ " A B C D\n", "2013-01-01 0.9596146 1.9972782 0.6017156 1.4154062\n", "2013-01-02 -1.4137538 -1.3745631 -0.3215356 0.7472984\n", "2013-01-03 0.7632499 1.6742977 1.3021326 0.1888954\n", "2013-01-04 0.8005111 -0.1462794 0.3436599 0.1310278\n", "2013-01-05 0.9066343 0.5098032 -0.1090932 -0.6649942\n", "2013-01-06 -0.5543259 -0.7512375 1.7489706 -1.6296998\n", "2013-01-07 0.2970382 1.2624804 1.0115971 0.2765671\n", "2013-01-08 -0.9845397 0.8508958 0.5400208 0.6634031\n", "2013-01-09 -0.1188055 0.4146147 -0.8512962 0.1770277\n", "2013-01-10 -1.3645758 0.0008273 -0.8191429 0.3820062" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "dates = pd.date_range('20130101',periods=10)\n", "df = pd.DataFrame(np.random.randn(10,4),index=dates,columns=list('ABCD'))\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sum for each column:**" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A -0.7089525\n", "B 4.4381173\n", "C 3.4470286\n", "D 1.6869378\n", "dtype: float64" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.agg('sum')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sum and min of each column:**" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
sum-0.70895254.43811733.44702861.6869378
min-1.4137538-1.3745631-0.8512962-1.6296998
\n", "
" ], "text/plain": [ " A B C D\n", "sum -0.7089525 4.4381173 3.4470286 1.6869378\n", "min -1.4137538 -1.3745631 -0.8512962 -1.6296998" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.agg(['sum', 'min'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sum of just one or more columns:**" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.7089524679704087" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['A'].agg('sum')" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A -0.7089525\n", "C 3.4470286\n", "dtype: float64" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['A','C']].agg('sum')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now for some groupby() examples:" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key1key2data1data2
0aone0.1564853-0.6441757
1atwo1.34460950.2890500
2bone0.4096645-1.2109906
3btwo-0.18759621.3129774
4aone-0.77949650.4034388
\n", "
" ], "text/plain": [ " key1 key2 data1 data2\n", "0 a one 0.1564853 -0.6441757\n", "1 a two 1.3446095 0.2890500\n", "2 b one 0.4096645 -1.2109906\n", "3 b two -0.1875962 1.3129774\n", "4 a one -0.7794965 0.4034388" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],\n", " 'key2' : ['one', 'two', 'one', 'two', 'one'],\n", " 'data1' : np.random.randn(5),\n", " 'data2' : np.random.randn(5)})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using a combination of ```groupby()``` and ```agg()```:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a lot of cases, instead of ```pivot_table()```, we can use the ```groupby()``` \\ ```agg()``` combination." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "General preferred syntax: **```df.groupby('grouping_column').agg({'aggregating_column': 'aggregating_functions'})```**" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key1key2data1data2
0aone0.1564853-0.6441757
1atwo1.34460950.2890500
2bone0.4096645-1.2109906
3btwo-0.18759621.3129774
4aone-0.77949650.4034388
\n", "
" ], "text/plain": [ " key1 key2 data1 data2\n", "0 a one 0.1564853 -0.6441757\n", "1 a two 1.3446095 0.2890500\n", "2 b one 0.4096645 -1.2109906\n", "3 b two -0.1875962 1.3129774\n", "4 a one -0.7794965 0.4034388" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NOTE: You can pass a list of aggregating functions:" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1
summax
key1
a0.72159831.3446095
b0.22206830.4096645
\n", "
" ], "text/plain": [ " data1 \n", " sum max\n", "key1 \n", "a 0.7215983 1.3446095\n", "b 0.2220683 0.4096645" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('key1').agg({'data1': ['sum','max']})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pivot table example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See also this excellent [article](http://pbpython.com/pandas-pivot-table-explained.html) on pivot tables by Chris Moffitt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounces
0a4.0
1a3.0
2a12.0
3b6.0
4b7.5
5b8.0
6c3.0
7c5.0
8c6.0
\n", "
" ], "text/plain": [ " group ounces\n", "0 a 4.0\n", "1 a 3.0\n", "2 a 12.0\n", "3 b 6.0\n", "4 b 7.5\n", "5 b 8.0\n", "6 c 3.0\n", "7 c 5.0\n", "8 c 6.0" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculating the means of each group" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ounces
group
a6.3333333
b7.1666667
c4.6666667
\n", "
" ], "text/plain": [ " ounces\n", "group \n", "a 6.3333333\n", "b 7.1666667\n", "c 4.6666667" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.pivot_table(values='ounces',index='group',aggfunc=np.mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Getting counts by group" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ounces
group
a3
b3
c3
\n", "
" ], "text/plain": [ " ounces\n", "group \n", "a 3\n", "b 3\n", "c 3" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.pivot_table(values='ounces',index='group',aggfunc='count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Getting cumulative sum or running total of the group counts" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ounces
group
a3
b6
c9
\n", "
" ], "text/plain": [ " ounces\n", "group \n", "a 3\n", "b 6\n", "c 9" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.pivot_table(values='ounces',index='group',aggfunc='count').cumsum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You can pass a dictionary to the ```aggfunc=``` parameter to specify specific columns to have specific aggregate function applied to them** " ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounces
group
a319.0
b321.5
c314.0
\n", "
" ], "text/plain": [ " group ounces\n", "group \n", "a 3 19.0\n", "b 3 21.5\n", "c 3 14.0" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.pivot_table(values='ounces', index='group', aggfunc={'group': len, 'ounces': np.sum})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Named aggregations (v0.25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With named aggregations, we can create new columns based on aggregations on other columns." ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
kindheightweight
0cat9.17.9
1dog6.07.5
2cat9.59.9
3dog34.0198.0
\n", "
" ], "text/plain": [ " kind height weight\n", "0 cat 9.1 7.9\n", "1 dog 6.0 7.5\n", "2 cat 9.5 9.9\n", "3 dog 34.0 198.0" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],\n", " 'height': [9.1, 6.0, 9.5, 34.0],\n", " 'weight': [7.9, 7.5, 9.9, 198.0]})\n", "animals" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min_heightmax_heightaverage_weight
kind
cat9.19.58.90
dog6.034.0102.75
\n", "
" ], "text/plain": [ " min_height max_height average_weight\n", "kind \n", "cat 9.1 9.5 8.90\n", "dog 6.0 34.0 102.75" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "animals.groupby(\"kind\").agg(\n", " min_height=pd.NamedAgg(column='height', aggfunc='min'),\n", " max_height=pd.NamedAgg(column='height', aggfunc='max'),\n", " average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also a less verbose syntax using tuples instead of ```NamedAgg()``` function:" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min_heightmax_heightaverage_weight
kind
cat9.19.58.90
dog6.034.0102.75
\n", "
" ], "text/plain": [ " min_height max_height average_weight\n", "kind \n", "cat 9.1 9.5 8.90\n", "dog 6.0 34.0 102.75" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "animals.groupby(\"kind\").agg(\n", " min_height=('height', 'min'),\n", " max_height=('height', 'max'),\n", " average_weight=('weight', np.mean)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Percent of rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupouncessize
08020100
17030100
27525100
37525100
\n", "
" ], "text/plain": [ " group ounces size\n", "0 80 20 100\n", "1 70 30 100\n", "2 75 25 100\n", "3 75 25 100" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data = pd.DataFrame({'group': [80, 70, 75, 75],\n", " 'ounces': [20, 30, 25, 25],\n", " 'size': [100, 100, 100, 100]}\n", " )\n", "data" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupouncessize
040.010.050.0
135.015.050.0
237.512.550.0
337.512.550.0
\n", "
" ], "text/plain": [ " group ounces size\n", "0 40.0 10.0 50.0\n", "1 35.0 15.0 50.0\n", "2 37.5 12.5 50.0\n", "3 37.5 12.5 50.0" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perc_of_rows = data.apply(lambda x : x / x.sum() * 100, axis='columns') # or axis=1\n", "perc_of_rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Percent of columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupouncessize
026.666666720.025.0
123.333333330.025.0
225.000000025.025.0
325.000000025.025.0
\n", "
" ], "text/plain": [ " group ounces size\n", "0 26.6666667 20.0 25.0\n", "1 23.3333333 30.0 25.0\n", "2 25.0000000 25.0 25.0\n", "3 25.0000000 25.0 25.0" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perc_of_columns = data.apply(lambda x : x / x.sum() * 100, axis='index') # or axis=0\n", "perc_of_columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**or percent of a specific column:**" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupouncessizePerc_of_Group_Column
0802010026.6666667
1703010023.3333333
2752510025.0000000
3752510025.0000000
\n", "
" ], "text/plain": [ " group ounces size Perc_of_Group_Column\n", "0 80 20 100 26.6666667\n", "1 70 30 100 23.3333333\n", "2 75 25 100 25.0000000\n", "3 75 25 100 25.0000000" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['Perc_of_Group_Column'] = data['group'] / data['group'].sum() * 100\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transpose a data frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say you have a data frame with several columns and don't want to horizontally scroll to see the columns, but want to vertically scroll instead. Then that case, transpose the data frame." ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key1key2data1data2
0aone1.25707381.2159738
1atwo0.6660120-1.4213658
2bone0.6255760-0.0309261
3btwo0.13502551.4271483
4aone1.4709446-0.3034248
\n", "
" ], "text/plain": [ " key1 key2 data1 data2\n", "0 a one 1.2570738 1.2159738\n", "1 a two 0.6660120 -1.4213658\n", "2 b one 0.6255760 -0.0309261\n", "3 b two 0.1350255 1.4271483\n", "4 a one 1.4709446 -0.3034248" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],\n", " 'key2' : ['one', 'two', 'one', 'two', 'one'],\n", " 'data1' : np.random.randn(5),\n", " 'data2' : np.random.randn(5)})\n", "df" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
key1a
key2one
data11.257074
data21.215974
\n", "
" ], "text/plain": [ " 0\n", "key1 a\n", "key2 one\n", "data1 1.257074\n", "data2 1.215974" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(1).transpose()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting a data frame index to a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounces
0a4.0
1a3.0
2a12.0
3b6.0
4b7.5
5b8.0
6c3.0
7c5.0
8c6.0
\n", "
" ], "text/plain": [ " group ounces\n", "0 a 4.0\n", "1 a 3.0\n", "2 a 12.0\n", "3 b 6.0\n", "4 b 7.5\n", "5 b 8.0\n", "6 c 3.0\n", "7 c 5.0\n", "8 c 6.0" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexgroupounces
00a4.0
11a3.0
22a12.0
33b6.0
44b7.5
55b8.0
66c3.0
77c5.0
88c6.0
\n", "
" ], "text/plain": [ " index group ounces\n", "0 0 a 4.0\n", "1 1 a 3.0\n", "2 2 a 12.0\n", "3 3 b 6.0\n", "4 4 b 7.5\n", "5 5 b 8.0\n", "6 6 c 3.0\n", "7 7 c 5.0\n", "8 8 c 6.0" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.reset_index(level=0, inplace=True)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to set one of the dataframe's columns to be the index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top]](#top)" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0123
1101112
2202122
\n", "
" ], "text/plain": [ " a b c\n", "0 1 2 3\n", "1 10 11 12\n", "2 20 21 22" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "col = ['a','b','c']\n", "data = pd.DataFrame([[1,2,3],[10,11,12],[20,21,22]],columns=col)\n", "data" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bc
a
123
101112
202122
\n", "
" ], "text/plain": [ " b c\n", "a \n", "1 2 3\n", "10 11 12\n", "20 21 22" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = data.set_index('a')\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To remove the index name (\"a\"), do:" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bc
123
101112
202122
\n", "
" ], "text/plain": [ " b c\n", "1 2 3\n", "10 11 12\n", "20 21 22" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index.name = None\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to add or fill in missing dates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top]](#top)" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "09-02-2013 2\n", "09-03-2013 10\n", "09-06-2013 5\n", "09-07-2013 1\n", "dtype: int64" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "idx = pd.date_range('09-01-2013', '09-30-2013')\n", "\n", "s = pd.Series({'09-02-2013': 2,\n", " '09-03-2013': 10,\n", " '09-06-2013': 5,\n", " '09-07-2013': 1})\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### From above, we have holes in our data. Let's fill in those missing holes using the idx date_range we specified above using `reindex` function." ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2013-09-01 0\n", "2013-09-02 2\n", "2013-09-03 10\n", "2013-09-04 0\n", "2013-09-05 0\n", "2013-09-06 5\n", "2013-09-07 1\n", "2013-09-08 0\n", "2013-09-09 0\n", "2013-09-10 0\n", "2013-09-11 0\n", "2013-09-12 0\n", "2013-09-13 0\n", "2013-09-14 0\n", "2013-09-15 0\n", "2013-09-16 0\n", "2013-09-17 0\n", "2013-09-18 0\n", "2013-09-19 0\n", "2013-09-20 0\n", "2013-09-21 0\n", "2013-09-22 0\n", "2013-09-23 0\n", "2013-09-24 0\n", "2013-09-25 0\n", "2013-09-26 0\n", "2013-09-27 0\n", "2013-09-28 0\n", "2013-09-29 0\n", "2013-09-30 0\n", "Freq: D, dtype: int64" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First need to convert the dataframe index type as type DateTime\n", "s.index = pd.DatetimeIndex(s.index)\n", "\n", "s = s.reindex(idx, fill_value=0)\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to connect and query against a database registered as an ODBC data source (Windows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top]](#top)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pyodbc\n", "import pandas as pd\n", "from getpass import getpass # Module that will create a text input widget AND mask your password\n", "\n", "userid = 'your_userid'\n", "pw = getpass(prompt='Enter your password: ')\n", "\n", "cnxn_string = 'DSN=your_dsn;UID=' + userid + ';PWD=' + pw\n", "\n", "sql = \"\"\"\n", "SELECT * from your_table...\n", "\"\"\"\n", "\n", "with pyodbc.connect(cnxn_string) as conn:\n", " df = pd.read_sql(sql, conn, index_col=None, parse_dates={'some_column':\"%Y-%m-%d\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to convert data in wide format to long format using pd.melt()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Certain charting libaries like expect the data to be in long format. Let's get some data in wide format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from pandas_datareader import data as pdr\n", "from datetime import datetime\n", "import pandas as pd\n", "\n", "stocks = {'tech':['GOOGL', 'MSFT', 'LNKD', 'YHOO', 'FB','HPQ','AMZN'],\n", " 'auto':['TM','F','GM','HMC','NSANY','HYMTF'],\n", " 'housing':['HD','WMT','LOW']\n", " }\n", "\n", "start_date = datetime(2016,1,1)\n", "end_date = datetime(2016, 4, 21)\n", "\n", "def get_px(stock, start, end):\n", " return web.get_data_yahoo(stock, start, end)['Adj Close']\n", "\n", "df = pd.DataFrame({n: get_px(n, start_date, end_date) for n in stocks['tech']})" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateAMZNFBGOOGLHPQLNKDMSFTYHOO
01/4/2016636.989990102.220001759.44000211.471342225.55000354.40934631.400000
11/5/2016633.789978102.730003761.53002911.580122229.00000054.65756332.200001
21/6/2016632.650024102.970001759.33001711.164780226.08000253.66469232.160000
31/7/2016607.94000297.919998741.00000010.650548217.00000051.79809330.160000
41/8/2016607.04998897.330002730.90997310.462655215.89999451.95695630.629999
51/11/2016617.73999097.510002733.07000710.531878210.92999351.92716730.170000
61/12/2016617.89001599.370003745.34002710.699993210.42999352.40374530.690001
71/13/2016581.80999895.440002719.57000710.462655203.13000551.27187329.440001
81/14/2016593.00000098.370003731.39001510.502211203.21000752.73139530.320000
91/15/2016570.17999394.970001710.4899909.997867196.32000750.62650829.139999
\n", "
" ], "text/plain": [ " Date AMZN FB GOOGL HPQ LNKD \\\n", "0 1/4/2016 636.989990 102.220001 759.440002 11.471342 225.550003 \n", "1 1/5/2016 633.789978 102.730003 761.530029 11.580122 229.000000 \n", "2 1/6/2016 632.650024 102.970001 759.330017 11.164780 226.080002 \n", "3 1/7/2016 607.940002 97.919998 741.000000 10.650548 217.000000 \n", "4 1/8/2016 607.049988 97.330002 730.909973 10.462655 215.899994 \n", "5 1/11/2016 617.739990 97.510002 733.070007 10.531878 210.929993 \n", "6 1/12/2016 617.890015 99.370003 745.340027 10.699993 210.429993 \n", "7 1/13/2016 581.809998 95.440002 719.570007 10.462655 203.130005 \n", "8 1/14/2016 593.000000 98.370003 731.390015 10.502211 203.210007 \n", "9 1/15/2016 570.179993 94.970001 710.489990 9.997867 196.320007 \n", "\n", " MSFT YHOO \n", "0 54.409346 31.400000 \n", "1 54.657563 32.200001 \n", "2 53.664692 32.160000 \n", "3 51.798093 30.160000 \n", "4 51.956956 30.629999 \n", "5 51.927167 30.170000 \n", "6 52.403745 30.690001 \n", "7 51.271873 29.440001 \n", "8 52.731395 30.320000 \n", "9 50.626508 29.139999 " ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_clipboard()\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Below is what data looks like in 'wide' format:" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateAMZNFBGOOGLHPQLNKDMSFTYHOO
01/4/2016636.989990102.220001759.44000211.471342225.55000354.40934631.400000
11/5/2016633.789978102.730003761.53002911.580122229.00000054.65756332.200001
21/6/2016632.650024102.970001759.33001711.164780226.08000253.66469232.160000
31/7/2016607.94000297.919998741.00000010.650548217.00000051.79809330.160000
41/8/2016607.04998897.330002730.90997310.462655215.89999451.95695630.629999
51/11/2016617.73999097.510002733.07000710.531878210.92999351.92716730.170000
61/12/2016617.89001599.370003745.34002710.699993210.42999352.40374530.690001
71/13/2016581.80999895.440002719.57000710.462655203.13000551.27187329.440001
81/14/2016593.00000098.370003731.39001510.502211203.21000752.73139530.320000
91/15/2016570.17999394.970001710.4899909.997867196.32000750.62650829.139999
\n", "
" ], "text/plain": [ " Date AMZN FB GOOGL HPQ LNKD \\\n", "0 1/4/2016 636.989990 102.220001 759.440002 11.471342 225.550003 \n", "1 1/5/2016 633.789978 102.730003 761.530029 11.580122 229.000000 \n", "2 1/6/2016 632.650024 102.970001 759.330017 11.164780 226.080002 \n", "3 1/7/2016 607.940002 97.919998 741.000000 10.650548 217.000000 \n", "4 1/8/2016 607.049988 97.330002 730.909973 10.462655 215.899994 \n", "5 1/11/2016 617.739990 97.510002 733.070007 10.531878 210.929993 \n", "6 1/12/2016 617.890015 99.370003 745.340027 10.699993 210.429993 \n", "7 1/13/2016 581.809998 95.440002 719.570007 10.462655 203.130005 \n", "8 1/14/2016 593.000000 98.370003 731.390015 10.502211 203.210007 \n", "9 1/15/2016 570.179993 94.970001 710.489990 9.997867 196.320007 \n", "\n", " MSFT YHOO \n", "0 54.409346 31.400000 \n", "1 54.657563 32.200001 \n", "2 53.664692 32.160000 \n", "3 51.798093 30.160000 \n", "4 51.956956 30.629999 \n", "5 51.927167 30.170000 \n", "6 52.403745 30.690001 \n", "7 51.271873 29.440001 \n", "8 52.731395 30.320000 \n", "9 50.626508 29.139999 " ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_clipboard()\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### This is what data looks like in 'long' format:" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Datevariablevalue
01/4/2016AMZN636.989990
11/5/2016AMZN633.789978
21/6/2016AMZN632.650024
31/7/2016AMZN607.940002
41/8/2016AMZN607.049988
51/11/2016AMZN617.739990
61/12/2016AMZN617.890015
71/13/2016AMZN581.809998
81/14/2016AMZN593.000000
91/15/2016AMZN570.179993
101/4/2016FB102.220001
111/5/2016FB102.730003
121/6/2016FB102.970001
131/7/2016FB97.919998
141/8/2016FB97.330002
151/11/2016FB97.510002
161/12/2016FB99.370003
171/13/2016FB95.440002
181/14/2016FB98.370003
191/15/2016FB94.970001
201/4/2016GOOGL759.440002
211/5/2016GOOGL761.530029
221/6/2016GOOGL759.330017
231/7/2016GOOGL741.000000
241/8/2016GOOGL730.909973
251/11/2016GOOGL733.070007
261/12/2016GOOGL745.340027
271/13/2016GOOGL719.570007
281/14/2016GOOGL731.390015
291/15/2016GOOGL710.489990
301/4/2016HPQ11.471342
311/5/2016HPQ11.580122
321/6/2016HPQ11.164780
331/7/2016HPQ10.650548
341/8/2016HPQ10.462655
351/11/2016HPQ10.531878
361/12/2016HPQ10.699993
371/13/2016HPQ10.462655
381/14/2016HPQ10.502211
391/15/2016HPQ9.997867
401/4/2016LNKD225.550003
411/5/2016LNKD229.000000
421/6/2016LNKD226.080002
431/7/2016LNKD217.000000
441/8/2016LNKD215.899994
451/11/2016LNKD210.929993
461/12/2016LNKD210.429993
471/13/2016LNKD203.130005
481/14/2016LNKD203.210007
491/15/2016LNKD196.320007
501/4/2016MSFT54.409346
511/5/2016MSFT54.657563
521/6/2016MSFT53.664692
531/7/2016MSFT51.798093
541/8/2016MSFT51.956956
551/11/2016MSFT51.927167
561/12/2016MSFT52.403745
571/13/2016MSFT51.271873
581/14/2016MSFT52.731395
591/15/2016MSFT50.626508
601/4/2016YHOO31.400000
611/5/2016YHOO32.200001
621/6/2016YHOO32.160000
631/7/2016YHOO30.160000
641/8/2016YHOO30.629999
651/11/2016YHOO30.170000
661/12/2016YHOO30.690001
671/13/2016YHOO29.440001
681/14/2016YHOO30.320000
691/15/2016YHOO29.139999
\n", "
" ], "text/plain": [ " Date variable value\n", "0 1/4/2016 AMZN 636.989990\n", "1 1/5/2016 AMZN 633.789978\n", "2 1/6/2016 AMZN 632.650024\n", "3 1/7/2016 AMZN 607.940002\n", "4 1/8/2016 AMZN 607.049988\n", "5 1/11/2016 AMZN 617.739990\n", "6 1/12/2016 AMZN 617.890015\n", "7 1/13/2016 AMZN 581.809998\n", "8 1/14/2016 AMZN 593.000000\n", "9 1/15/2016 AMZN 570.179993\n", "10 1/4/2016 FB 102.220001\n", "11 1/5/2016 FB 102.730003\n", "12 1/6/2016 FB 102.970001\n", "13 1/7/2016 FB 97.919998\n", "14 1/8/2016 FB 97.330002\n", "15 1/11/2016 FB 97.510002\n", "16 1/12/2016 FB 99.370003\n", "17 1/13/2016 FB 95.440002\n", "18 1/14/2016 FB 98.370003\n", "19 1/15/2016 FB 94.970001\n", "20 1/4/2016 GOOGL 759.440002\n", "21 1/5/2016 GOOGL 761.530029\n", "22 1/6/2016 GOOGL 759.330017\n", "23 1/7/2016 GOOGL 741.000000\n", "24 1/8/2016 GOOGL 730.909973\n", "25 1/11/2016 GOOGL 733.070007\n", "26 1/12/2016 GOOGL 745.340027\n", "27 1/13/2016 GOOGL 719.570007\n", "28 1/14/2016 GOOGL 731.390015\n", "29 1/15/2016 GOOGL 710.489990\n", "30 1/4/2016 HPQ 11.471342\n", "31 1/5/2016 HPQ 11.580122\n", "32 1/6/2016 HPQ 11.164780\n", "33 1/7/2016 HPQ 10.650548\n", "34 1/8/2016 HPQ 10.462655\n", "35 1/11/2016 HPQ 10.531878\n", "36 1/12/2016 HPQ 10.699993\n", "37 1/13/2016 HPQ 10.462655\n", "38 1/14/2016 HPQ 10.502211\n", "39 1/15/2016 HPQ 9.997867\n", "40 1/4/2016 LNKD 225.550003\n", "41 1/5/2016 LNKD 229.000000\n", "42 1/6/2016 LNKD 226.080002\n", "43 1/7/2016 LNKD 217.000000\n", "44 1/8/2016 LNKD 215.899994\n", "45 1/11/2016 LNKD 210.929993\n", "46 1/12/2016 LNKD 210.429993\n", "47 1/13/2016 LNKD 203.130005\n", "48 1/14/2016 LNKD 203.210007\n", "49 1/15/2016 LNKD 196.320007\n", "50 1/4/2016 MSFT 54.409346\n", "51 1/5/2016 MSFT 54.657563\n", "52 1/6/2016 MSFT 53.664692\n", "53 1/7/2016 MSFT 51.798093\n", "54 1/8/2016 MSFT 51.956956\n", "55 1/11/2016 MSFT 51.927167\n", "56 1/12/2016 MSFT 52.403745\n", "57 1/13/2016 MSFT 51.271873\n", "58 1/14/2016 MSFT 52.731395\n", "59 1/15/2016 MSFT 50.626508\n", "60 1/4/2016 YHOO 31.400000\n", "61 1/5/2016 YHOO 32.200001\n", "62 1/6/2016 YHOO 32.160000\n", "63 1/7/2016 YHOO 30.160000\n", "64 1/8/2016 YHOO 30.629999\n", "65 1/11/2016 YHOO 30.170000\n", "66 1/12/2016 YHOO 30.690001\n", "67 1/13/2016 YHOO 29.440001\n", "68 1/14/2016 YHOO 30.320000\n", "69 1/15/2016 YHOO 29.139999" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_long = pd.melt(df, id_vars=['Date']).dropna()\n", "df_long" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateStockPrice
01/4/2016AMZN636.989990
11/5/2016AMZN633.789978
21/6/2016AMZN632.650024
31/7/2016AMZN607.940002
41/8/2016AMZN607.049988
51/11/2016AMZN617.739990
61/12/2016AMZN617.890015
71/13/2016AMZN581.809998
81/14/2016AMZN593.000000
91/15/2016AMZN570.179993
101/4/2016FB102.220001
111/5/2016FB102.730003
121/6/2016FB102.970001
131/7/2016FB97.919998
141/8/2016FB97.330002
151/11/2016FB97.510002
161/12/2016FB99.370003
171/13/2016FB95.440002
181/14/2016FB98.370003
191/15/2016FB94.970001
201/4/2016GOOGL759.440002
211/5/2016GOOGL761.530029
221/6/2016GOOGL759.330017
231/7/2016GOOGL741.000000
241/8/2016GOOGL730.909973
251/11/2016GOOGL733.070007
261/12/2016GOOGL745.340027
271/13/2016GOOGL719.570007
281/14/2016GOOGL731.390015
291/15/2016GOOGL710.489990
301/4/2016HPQ11.471342
311/5/2016HPQ11.580122
321/6/2016HPQ11.164780
331/7/2016HPQ10.650548
341/8/2016HPQ10.462655
351/11/2016HPQ10.531878
361/12/2016HPQ10.699993
371/13/2016HPQ10.462655
381/14/2016HPQ10.502211
391/15/2016HPQ9.997867
401/4/2016LNKD225.550003
411/5/2016LNKD229.000000
421/6/2016LNKD226.080002
431/7/2016LNKD217.000000
441/8/2016LNKD215.899994
451/11/2016LNKD210.929993
461/12/2016LNKD210.429993
471/13/2016LNKD203.130005
481/14/2016LNKD203.210007
491/15/2016LNKD196.320007
501/4/2016MSFT54.409346
511/5/2016MSFT54.657563
521/6/2016MSFT53.664692
531/7/2016MSFT51.798093
541/8/2016MSFT51.956956
551/11/2016MSFT51.927167
561/12/2016MSFT52.403745
571/13/2016MSFT51.271873
581/14/2016MSFT52.731395
591/15/2016MSFT50.626508
601/4/2016YHOO31.400000
611/5/2016YHOO32.200001
621/6/2016YHOO32.160000
631/7/2016YHOO30.160000
641/8/2016YHOO30.629999
651/11/2016YHOO30.170000
661/12/2016YHOO30.690001
671/13/2016YHOO29.440001
681/14/2016YHOO30.320000
691/15/2016YHOO29.139999
\n", "
" ], "text/plain": [ " Date Stock Price\n", "0 1/4/2016 AMZN 636.989990\n", "1 1/5/2016 AMZN 633.789978\n", "2 1/6/2016 AMZN 632.650024\n", "3 1/7/2016 AMZN 607.940002\n", "4 1/8/2016 AMZN 607.049988\n", "5 1/11/2016 AMZN 617.739990\n", "6 1/12/2016 AMZN 617.890015\n", "7 1/13/2016 AMZN 581.809998\n", "8 1/14/2016 AMZN 593.000000\n", "9 1/15/2016 AMZN 570.179993\n", "10 1/4/2016 FB 102.220001\n", "11 1/5/2016 FB 102.730003\n", "12 1/6/2016 FB 102.970001\n", "13 1/7/2016 FB 97.919998\n", "14 1/8/2016 FB 97.330002\n", "15 1/11/2016 FB 97.510002\n", "16 1/12/2016 FB 99.370003\n", "17 1/13/2016 FB 95.440002\n", "18 1/14/2016 FB 98.370003\n", "19 1/15/2016 FB 94.970001\n", "20 1/4/2016 GOOGL 759.440002\n", "21 1/5/2016 GOOGL 761.530029\n", "22 1/6/2016 GOOGL 759.330017\n", "23 1/7/2016 GOOGL 741.000000\n", "24 1/8/2016 GOOGL 730.909973\n", "25 1/11/2016 GOOGL 733.070007\n", "26 1/12/2016 GOOGL 745.340027\n", "27 1/13/2016 GOOGL 719.570007\n", "28 1/14/2016 GOOGL 731.390015\n", "29 1/15/2016 GOOGL 710.489990\n", "30 1/4/2016 HPQ 11.471342\n", "31 1/5/2016 HPQ 11.580122\n", "32 1/6/2016 HPQ 11.164780\n", "33 1/7/2016 HPQ 10.650548\n", "34 1/8/2016 HPQ 10.462655\n", "35 1/11/2016 HPQ 10.531878\n", "36 1/12/2016 HPQ 10.699993\n", "37 1/13/2016 HPQ 10.462655\n", "38 1/14/2016 HPQ 10.502211\n", "39 1/15/2016 HPQ 9.997867\n", "40 1/4/2016 LNKD 225.550003\n", "41 1/5/2016 LNKD 229.000000\n", "42 1/6/2016 LNKD 226.080002\n", "43 1/7/2016 LNKD 217.000000\n", "44 1/8/2016 LNKD 215.899994\n", "45 1/11/2016 LNKD 210.929993\n", "46 1/12/2016 LNKD 210.429993\n", "47 1/13/2016 LNKD 203.130005\n", "48 1/14/2016 LNKD 203.210007\n", "49 1/15/2016 LNKD 196.320007\n", "50 1/4/2016 MSFT 54.409346\n", "51 1/5/2016 MSFT 54.657563\n", "52 1/6/2016 MSFT 53.664692\n", "53 1/7/2016 MSFT 51.798093\n", "54 1/8/2016 MSFT 51.956956\n", "55 1/11/2016 MSFT 51.927167\n", "56 1/12/2016 MSFT 52.403745\n", "57 1/13/2016 MSFT 51.271873\n", "58 1/14/2016 MSFT 52.731395\n", "59 1/15/2016 MSFT 50.626508\n", "60 1/4/2016 YHOO 31.400000\n", "61 1/5/2016 YHOO 32.200001\n", "62 1/6/2016 YHOO 32.160000\n", "63 1/7/2016 YHOO 30.160000\n", "64 1/8/2016 YHOO 30.629999\n", "65 1/11/2016 YHOO 30.170000\n", "66 1/12/2016 YHOO 30.690001\n", "67 1/13/2016 YHOO 29.440001\n", "68 1/14/2016 YHOO 30.320000\n", "69 1/15/2016 YHOO 29.139999" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_long.rename(columns={'variable': 'Stock', 'value':'Price'},inplace=True)\n", "df_long" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to convert data in long format to wide format using dataframe.pivot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StockAMZNFBGOOGLHPQLNKDMSFTYHOO
Date
1/11/2016617.73999097.510002733.07000710.531878210.92999351.92716730.170000
1/12/2016617.89001599.370003745.34002710.699993210.42999352.40374530.690001
1/13/2016581.80999895.440002719.57000710.462655203.13000551.27187329.440001
1/14/2016593.00000098.370003731.39001510.502211203.21000752.73139530.320000
1/15/2016570.17999394.970001710.4899909.997867196.32000750.62650829.139999
1/4/2016636.989990102.220001759.44000211.471342225.55000354.40934631.400000
1/5/2016633.789978102.730003761.53002911.580122229.00000054.65756332.200001
1/6/2016632.650024102.970001759.33001711.164780226.08000253.66469232.160000
1/7/2016607.94000297.919998741.00000010.650548217.00000051.79809330.160000
1/8/2016607.04998897.330002730.90997310.462655215.89999451.95695630.629999
\n", "
" ], "text/plain": [ "Stock AMZN FB GOOGL HPQ LNKD \\\n", "Date \n", "1/11/2016 617.739990 97.510002 733.070007 10.531878 210.929993 \n", "1/12/2016 617.890015 99.370003 745.340027 10.699993 210.429993 \n", "1/13/2016 581.809998 95.440002 719.570007 10.462655 203.130005 \n", "1/14/2016 593.000000 98.370003 731.390015 10.502211 203.210007 \n", "1/15/2016 570.179993 94.970001 710.489990 9.997867 196.320007 \n", "1/4/2016 636.989990 102.220001 759.440002 11.471342 225.550003 \n", "1/5/2016 633.789978 102.730003 761.530029 11.580122 229.000000 \n", "1/6/2016 632.650024 102.970001 759.330017 11.164780 226.080002 \n", "1/7/2016 607.940002 97.919998 741.000000 10.650548 217.000000 \n", "1/8/2016 607.049988 97.330002 730.909973 10.462655 215.899994 \n", "\n", "Stock MSFT YHOO \n", "Date \n", "1/11/2016 51.927167 30.170000 \n", "1/12/2016 52.403745 30.690001 \n", "1/13/2016 51.271873 29.440001 \n", "1/14/2016 52.731395 30.320000 \n", "1/15/2016 50.626508 29.139999 \n", "1/4/2016 54.409346 31.400000 \n", "1/5/2016 54.657563 32.200001 \n", "1/6/2016 53.664692 32.160000 \n", "1/7/2016 51.798093 30.160000 \n", "1/8/2016 51.956956 30.629999 " ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pivoted = df_long.pivot(index='Date', columns='Stock', values='Price')\n", "pivoted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using category data type to control sort order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
medalounces
0bronze4.0
1silver3.0
2silver12.0
3gold6.0
4bronze7.5
5bronze8.0
6gold3.0
7gold5.0
8gold6.0
\n", "
" ], "text/plain": [ " medal ounces\n", "0 bronze 4.0\n", "1 silver 3.0\n", "2 silver 12.0\n", "3 gold 6.0\n", "4 bronze 7.5\n", "5 bronze 8.0\n", "6 gold 3.0\n", "7 gold 5.0\n", "8 gold 6.0" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data = pd.DataFrame({'medal': ['bronze', 'silver', 'silver', 'gold','bronze', 'bronze', 'gold', 'gold','gold'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "medal object\n", "ounces float64\n", "dtype: object" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dtypes" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "medal category\n", "ounces float64\n", "dtype: object" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[\"medal\"] = data[\"medal\"].astype(\"category\")\n", "data.dtypes" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
medalounces
0bronze4.0
4bronze7.5
5bronze8.0
3gold6.0
6gold3.0
7gold5.0
8gold6.0
1silver3.0
2silver12.0
\n", "
" ], "text/plain": [ " medal ounces\n", "0 bronze 4.0\n", "4 bronze 7.5\n", "5 bronze 8.0\n", "3 gold 6.0\n", "6 gold 3.0\n", "7 gold 5.0\n", "8 gold 6.0\n", "1 silver 3.0\n", "2 silver 12.0" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sort_values(by=[\"medal\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now define the order of the catgorical data using set_categories()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)] [[back to section](#category)]" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [], "source": [ "data[\"medal\"].cat.set_categories([ \"gold\",\"silver\",\"bronze\"], inplace=True)" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
medalounces
3gold6.0
6gold3.0
7gold5.0
8gold6.0
1silver3.0
2silver12.0
0bronze4.0
4bronze7.5
5bronze8.0
\n", "
" ], "text/plain": [ " medal ounces\n", "3 gold 6.0\n", "6 gold 3.0\n", "7 gold 5.0\n", "8 gold 6.0\n", "1 silver 3.0\n", "2 silver 12.0\n", "0 bronze 4.0\n", "4 bronze 7.5\n", "5 bronze 8.0" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sort_values(by='medal')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a column of type categorical that is also ordered" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDquality
0100good
1101very good
2102good
3103excellent
\n", "
" ], "text/plain": [ " ID quality\n", "0 100 good\n", "1 101 very good\n", "2 102 good\n", "3 103 excellent" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'ID': [100, 101, 102, 103],\n", " 'quality': ['good', 'very good', 'good', 'excellent']})\n", "df" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [], "source": [ "cat_dtype = pd.api.types.CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True)" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDquality
0100good
2102good
1101very good
3103excellent
\n", "
" ], "text/plain": [ " ID quality\n", "0 100 good\n", "2 102 good\n", "1 101 very good\n", "3 103 excellent" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['quality'] = df.quality.astype(cat_dtype)\n", "df.sort_values(by='quality')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merging 2 data frames using merge()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounces
0a4.0
1a3.0
2a12.0
3b6.0
4b7.5
5b8.0
6c3.0
7c5.0
8c6.0
\n", "
" ], "text/plain": [ " group ounces\n", "0 a 4.0\n", "1 a 3.0\n", "2 a 12.0\n", "3 b 6.0\n", "4 b 7.5\n", "5 b 8.0\n", "6 c 3.0\n", "7 c 5.0\n", "8 c 6.0" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "left = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "left" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelvalue
0aalpha
1bbeta
2ccharlie
\n", "
" ], "text/plain": [ " label value\n", "0 a alpha\n", "1 b beta\n", "2 c charlie" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "right = pd.DataFrame({'label': ['a','b','c'],\n", " 'value': ['alpha','beta','charlie']})\n", "right" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### By default, merge() does an inner join, but you can specify 'left', 'right', or 'outer' joins." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These types of joins are familiar if you have SQL background. Here's a good SQL joins [visualizer](http://sql-joins.leopard.in.ua/)." ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounceslabelvalue
0a4.0aalpha
1a3.0aalpha
2a12.0aalpha
3b6.0bbeta
4b7.5bbeta
5b8.0bbeta
6c3.0ccharlie
7c5.0ccharlie
8c6.0ccharlie
\n", "
" ], "text/plain": [ " group ounces label value\n", "0 a 4.0 a alpha\n", "1 a 3.0 a alpha\n", "2 a 12.0 a alpha\n", "3 b 6.0 b beta\n", "4 b 7.5 b beta\n", "5 b 8.0 b beta\n", "6 c 3.0 c charlie\n", "7 c 5.0 c charlie\n", "8 c 6.0 c charlie" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inner_joined = pd.merge(left, right, how='inner', left_on='group', right_on='label')\n", "inner_joined" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE:** To merge 2 Series, you have to use [pd.concat()](http://stackoverflow.com/questions/18062135/combining-two-series-into-a-dataframe-in-pandas) instead" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### For more on merge, join, and concatenate, see the official [docs](http://pandas.pydata.org/pandas-docs/stable/merging.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### YouTube [video](https://www.youtube.com/watch?v=8K8Bs7z6d2M&feature=youtu.be) on joining data sets using pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding rows of a data frame containing missing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
001.02.0
10NaN1.0
211.0NaN
301.02.0
401.02.0
\n", "
" ], "text/plain": [ " A B C\n", "0 0 1.0 2.0\n", "1 0 NaN 1.0\n", "2 1 1.0 NaN\n", "3 0 1.0 2.0\n", "4 0 1.0 2.0" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "df = pd.DataFrame([range(3), [0, np.NaN, 1], [1, 1, np.NaN], range(3), range(3)])\n", "df.columns = ['A','B','C']\n", "df" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 True\n", "3 False\n", "4 False\n", "dtype: bool" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().any(axis=1)" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
10NaN1.0
211.0NaN
\n", "
" ], "text/plain": [ " A B C\n", "1 0 NaN 1.0\n", "2 1 1.0 NaN" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.isna().any(axis=1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To filter a column for na/null values and also filter another column:" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
10NaN1.0
\n", "
" ], "text/plain": [ " A B C\n", "1 0 NaN 1.0" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df['C']==1) & (df['B'].isna())]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting a data type of a column in a data frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothree
0a1.24.2
1b700.03
2x50
\n", "
" ], "text/plain": [ " one two three\n", "0 a 1.2 4.2\n", "1 b 70 0.03\n", "2 x 5 0" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]\n", "df = pd.DataFrame(a, columns=['one', 'two', 'three'])\n", "df" ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one object\n", "two object\n", "three object\n", "dtype: object" ] }, "execution_count": 170, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [], "source": [ "df[['two', 'three']] = df[['two', 'three']].astype(float)" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one object\n", "two float64\n", "three float64\n", "dtype: object" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting data frames using MATPLOTLIB version 1.5 and up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Starting with version 1.5, you can now pass a \"data\" parameter inside MATPLOTLIB's plot() function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prior to version 1.5:**" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAeZklEQVR4nO3deXhV9aH18e8vJAESSJgChCEkEMaQIBBAwBEHVBxA6rXWqpVabG9t9W2vEHBCRQW1VmudcLZqvZYEUBAHigJOKCiczBBCIGFKGDKQkPH83j/I29dalADnZJ9hfZ6Hh0AOJ+swrGdns/eKsdYiIiK+K8TpACIi8uNU1CIiPk5FLSLi41TUIiI+TkUtIuLjQr3xpN26dbPx8fHeeGoRkYC0cePG/dbamGO9zytFHR8fz4YNG7zx1CIiAckYs+OH3qdTHyIiPk5FLSLi41TUIiI+TkUtIuLjVNQiIj6uRUVtjOlkjFlsjMkzxuQaY8Z7O5iIiBzV0svzngDet9b+xBgTDkR4MZOIiHzHcY+ojTHRwFnAiwDW2nprbbmXc4mI+JWviw7y7JptXnnulpz6SADKgJeNMd8aY14wxkR+/0HGmJnGmA3GmA1lZWUeDyoi4osO1zVy97Isrnr2C95cv5Oa+kaPf4yWFHUoMAp4xlo7EqgG0r7/IGvtImttqrU2NSbmmHdBiogElDVbypj857X87csd3DgxnpW3nklEuOdv+G7JM5YAJdba9c0/XswxilpEJFgcqq7n/hU5ZHyzi8TuHVj86wmM7tfZax/vuEVtrd1rjCk2xgy21uYD5wE5XkskIuKjrLWszNrL3cuyKK9p4HeTErllUiJtQ9t49eO29Bj9d8AbzVd8FAI3ei+SiIjvKa2s5a5lWXyQvY/k3tG8NmMcw3pFtcrHblFRW2s3AanejSIi4nustfxjYwnzl+dQ1+gm7eIh3HRGAqFtWu9+Qa/MnIqIBILigzXMycjk04L9jE3owoIrk+kf06HVc6ioRUS+p8ltefXzIh75IJ82IYb5U4fzs7FxhIQYR/KoqEVEvmPrvipmp7v4Zmc55wyO4cFpyfTq1N7RTCpqERGgocnNs59s48nVBUS2bcPjV5/GFaf1whhnjqK/S0UtIkEvs6SC2xdvJm9vFZemxDLv8iS6dWjrdKx/UVGLSNCqbWjiz6u28PzaQmI6tmXRdaO5MKmn07H+g4paRILSl4UHSEt3UXSghmvG9iXt4qFEtw9zOtYxqahFJKhU1TawYGUeb6zfSVyXCN68aRwTErs5HetHqahFJGh8nFfK3CWZ7Kus5aYzEvjDhYO8MqLkab6fUETkFB2srue+d7NZumk3A7t34OnfTGBknPdGlDxNRS0iActay3LXHua9k01lbQO3njeQ/z53gNdHlDxNRS0iAWlvRS13Ls1iVe4+RvSJZuFPxjGkZ+uMKHmailpEAoq1lre+LubBFbk0uN3ccclQZpyRQBuHbv/2BBW1iASMHQeqSUvP5IvCA5zevwsLrkwhvtt/fOVAv6OiFhG/1+S2vPzZdh79MJ+wkBAeujKZq1P7Ojai5GkqahHxa/l7q5iV7mJzcTnnD+3O/KnJ9Ixu53Qsj1JRi4hfqm908/QnBTz1cQEd24Xxl2tGcllKrE+MKHmailpE/M6m4nJmL3aRv6+KK07rxT2XJdElMtzpWF6johYRv3GkvonHPsrnxU+3071jO168IZXzhvZwOpbXqahFxC98vm0/aemZ7DxYw7Xj4ph98RCi2vnmiJKnqahFxKdV1jbw0Ht5/P2rncR3jeCtmadzev+uTsdqVSpqEfFZq3L2ccfSTMqq6rj5rP7cdv4g2of71+3fnqCiFhGfc+BwHfPezeHdzbsZ0rMjz1+fSkqfTk7HcoyKWkR8hrWWdzbvZt472Ryua+QPFwzi12cPIDw0xOlojlJRi4hP2F1+hDuXZrE6r5SRcZ1YOD2FQT06Oh3LJ6ioRcRRbrflza92smBlHk1uy92XDuOGCfF+PaLkaSpqEXHM9v3VpKW7WL/9IBMTu/LQtBTiukY4HcvnqKhFpNU1Nrl58dPtPPbRFsJDQ3h4egpXpfYJyNu/PUFFLSKtKndPJbPTXbhKKrhgWA/mTx1Oj6jAGlHyNBW1iLSKusYmnlpdwNOfbKNTRBhP/WwUlyT31FF0C6ioRcTrNu44xOx0FwWlh7lyVG/umjKMzgE8ouRpLSpqY0wRUAU0AY3W2lRvhhKRwFBT38gjH+TzyudFxEa14+Ubx3Du4O5Ox/I7J3JEfa61dr/XkohIQPl0637SMlyUHDrC9eP7MeuiIXRoq0/iT4Z+10TEoyqONPDgilz+d0MxCd0iefvm8YxN6OJ0LL/W0qK2wIfGGAs8Z61d9P0HGGNmAjMB4uLiPJdQRPzGB9l7uWtpFgeq6/n12QO47fyBtAsLvhElT2tpUZ9hrd1ljOkOfGSMybPWrv3uA5rLexFAamqq9XBOEfFhZVV1zHsnmxWZexgWG8VLvxjD8N7RTscKGC0qamvtrubvS40xS4CxwNof/1UiEuistSz5dhf3Lc+hpq6J2ycPZuZZ/QlrE9wjSp523KI2xkQCIdbaqua3LwTu83oyEfFpu8qPMDcjkzVbyhjdrzMLp6eQ2L2D07ECUkuOqHsAS5ovSg8F3rTWvu/VVCLis9xuy+vrd7BwZR4WmHfZMK4fH0+IRpS85rhFba0tBEa0QhYR8XHbyg6Tlu7i66JDnDmwGw9OS6ZvF40oeZsuzxOR42pscrNoXSGPr9pKu9AQHvlJCj8ZrRGl1qKiFpEflb27gtnpLrJ2VXLx8J7ce0US3TtqRKk1qahF5JhqG5p4cvVWnl1TSOeIcJ65dhQXJ8c6HSsoqahF5D9sKDrIrHQXhWXV/GR0H+6cMpROERpRcoqKWkT+pbru6IjSq18U0Su6Pa/NGMtZg2KcjhX0VNQiAsDaLWXMychkd8URbhgfz+2TBxOpESWfoD8FkSBXXlPP/BW5LN5YQv+YSP5x83hS4zWi5EtU1CJBbGXmHu5als2hmnpuOTeRWyYlakTJB6moRYJQaVUt9yzLZmXWXpJ6RfHqjDEk9dKIkq9SUYsEEWstizeWMH9FLkcamph90RB+dWYCoRpR8mkqapEgUXywhrlLMlm3dT9j4juzYHoKA2I0ouQPVNQiAc7ttrz2RREPf5CPAe6/Iolrx/XTiJIfUVGLBLCC0ipmp2eyccchzh4UwwPThtOns0aU/I2KWiQANTS5WbS2kCdWbSWibRse+68RTBvZWyNKfkpFLRJgsnZVMGuxi5w9lUxJiWXeZUnEdGzrdCw5BSpqkQBR29DEE//cyqK1hXSJDOe560YzOamn07HEA1TUIgHgq+0HSUt3Ubi/mqtT+zL3kqFER4Q5HUs8REUt4scO1zWycGUef/tyB306t+f1X47jjIHdnI4lHqaiFvFTH+eXckdGJnsqa5kxMYH/mTyIiHD9kw5E+lMV8TOHquu5f3kOGd/uIrF7Bxb/egKj+3V2OpZ4kYpaxE9Ya3kvcy/3vJNFeU0Dv5+UyG8nJdI2VCNKgU5FLeIH9lXWctfSLD7M2Udy72j+9stxDI2NcjqWtBIVtYgPs9by9oZi5q/Ipb7RzZyLh/DLMzSiFGxU1CI+aueBGuYscfFZwQHGJnRh4fQUErpFOh1LHKCiFvExTW7LK58X8egH+bQJMcyfOpyfjY3TiFIQU1GL+JCt+6qYle7i253lnDs4hgemJdOrU3unY4nDVNQiPqC+0c2za7bx19UFRLZtwxM/PY3LR/TSiJIAKmoRx7lKypm12EXe3iouG9GLeZcNo2sHjSjJ/6eiFnHIkfomHl+1hefXFRLTsS3PX5/KBcN6OB1LfJCKWsQBXxYeIC3dRdGBGq4Z25c5lwwlqp1GlOTYWlzUxpg2wAZgl7X2Uu9FEglcVbUNLFiZxxvrdxLXJYI3bxrHhESNKMmPO5Ej6luBXEC3Q4mchNV5+7hjSRb7Kmu56YwE/njhYNqH6/ZvOb4WFbUxpg8wBXgA+INXE4kEmIPV9dz3bjZLN+1mUI8OPH3tBEbGaURJWq6lR9SPA7OAjj/0AGPMTGAmQFxc3CkHE/F31lrede1h3jvZVNU2cOt5A/ntuYmEh+r2bzkxxy1qY8ylQKm1dqMx5pwfepy1dhGwCCA1NdV6KqCIP9pbUcudS7NYlbuPEX078fD0FAb3/MHjHJEf1ZIj6onA5caYS4B2QJQx5nVr7c+9G03E/1hreevrYh5ckUuD282dU4Zy48QE2uj2bzkFxy1qa+0cYA5A8xH1/6ikRf7TjgPVpKVn8kXhAcb378qC6cn066oRJTl1uo5a5BQ1uS0vf7adRz/MJywkhIeuTOanY/rq9m/xmBMqamvtJ8AnXkki4ofy9x4dUdpcXM75Q7szf2oyPaPbOR1LAoyOqEVOQn2jm6c/KeCpjwuIahfGk9eM5NKUWB1Fi1eoqEVO0KbicmYvdpG/r4qpp/Xi7suS6BIZ7nQsCWAqapEWOlLfxJ8+zOelz7bTI6odL/0ilUlDNKIk3qeiFmmBz7ftJy09k50Ha7h2XBxpFw+ho0aUpJWoqEV+RGVtAw+9l8vfvyomvmsEb808ndP7d3U6lgQZFbXID1iVs487lmZSVlXHzWf157bzB2lESRyhohb5nv2H67j33Rze3bybIT078vz1qaT06eR0LAliKmqRZtZalm3azb3vZlNd18QfLxjEzWcP0IiSOE5FLQLsLj/CnUuzWJ1Xysi4oyNKA3toREl8g4pagprbbXnzq50sWJlHk9ty96XDuGFCvEaUxKeoqCVobd9fTVq6i/XbDzIxsSsPTUshrmuE07FE/oOKWoJOY5ObFz/dzmMfbSE8NISHp6dwVWof3f4tPktFLUElZ3cls9NdZO6q4MJhPbh/6nB6RGlESXybilqCQl1jE39dXcAzn2yjU0QYT/1sFJck99RRtPgFFbUEvI07DjE73UVB6WGuHNWbu6YMo7NGlMSPqKglYNXUN/LIB/m88nkRsVHtePnGMZw7uLvTsUROmIpaAtKnW/eTluGi5NARrh/fj1kXDaFDW/11F/+kv7kSUCpqGnjgvRze3lBC/26RvH3zeMYmdHE6lsgpUVFLwHg/ay93LcviYHU9vzlnALeeN5B2YRpREv+noha/V1ZVx7x3slmRuYdhsVG8/IsxDO8d7XQsEY9RUYvfstaS8c0u7luew5H6Jm6fPJiZZ/UnrI1GlCSwqKjFL+0qP8LcjEzWbCljdL/OLJyeQmL3Dk7HEvEKFbX4Fbfb8vr6HSxcmYcF7r08ietO70eIRpQkgKmoxW9sKztMWrqLr4sOcebAbjw4LZm+XTSiJIFPRS0+r6HJzfPrCnl81Vbah7Xh0atGMH1Ub93+LUFDRS0+LWtXBbPTXWTvruTi4T2594okunfUiJIEFxW1+KTahiaeXL2VZ9cU0jkinGeuHcXFybFOxxJxhIpafM6GooPMSndRWFbNVaP7cMeUoXSK0IiSBC8VtfiM6rqjI0qvflFEr+j2vDZjLGcNinE6lojjVNTiE9ZsKWNuRia7K45ww/h4bp88mEiNKIkAKmpxWHlNPfcvzyX9mxIGxETyj5vHkxqvESWR7zpuURtj2gFrgbbNj19srb3H28Ek8K3M3MNdy7I5VFPPLecmcsukRI0oiRxDS46o64BJ1trDxpgw4FNjzEpr7ZdeziYBqrSylruXZfN+9l6SekXx6owxJPXSiJLIDzluUVtrLXC4+Ydhzd+sN0NJYLLWsnhjCfcvz6G20c3si4bwqzMTCNWIksiPatE5amNMG2AjkAg8Za1df4zHzARmAsTFxXkyowSA4oM1zF2Sybqt+xkTf3REqX+MRpREWqJFRW2tbQJOM8Z0ApYYY4Zba7O+95hFwCKA1NRUHXELAE1uy2tfFPHIB/kY4P4rkrh2nEaURE7ECV31Ya0tN8Z8DFwEZB3v8RLcCkqrmJ2eycYdhzh7UAwPXplM707tnY4l4ndactVHDNDQXNLtgQuAhV5PJn6rocnNc2u28Zd/FhDRtg2P/dcIpo3UiJLIyWrJEXUs8GrzeeoQ4G1r7XLvxhJ/lbWrgtsXu8jdU8mUlFjmXZZETMe2TscS8WstuerDBYxshSzix2obmnh81VaeX1dI18hwnrtuNJOTejodSyQg6M5EOWVfbT9IWrqLwv3VXJ3al7lThhLdPszpWCIBQ0UtJ62qtoGH38/nb1/uoG+X9rxx0zgmJnZzOpZIwFFRy0n5OL+UOzIy2VNZy4yJCfzP5EFEhOuvk4g36F+WnJBD1fXcvzyHjG93MbB7B9J/M4FRcZ2djiUS0FTU0iLWWlZk7uGeZdlUHGng95MS+e2kRNqGakRJxNtU1HJc+ypruWtpFh/m7COlTzSv3zSOobFRTscSCRoqavlB1lre3lDM/BW51De6mXvJEGZM1IiSSGtTUcsx7TxQw5wlLj4rOMC4hC4snJ5CfLdIp2OJBCUVtfybJrfllc+LePSDfNqEGB6YNpxrxsRpREnEQSpq+Zct+6qYtdjFpuJyJg3pzgPThhMbrRElEaepqIX6RjfPrtnGk6u30qFtKE/89DQuH9FLI0oiPkJFHeQ2F5czO91F3t4qLhvRi3mXDaNrB40oifgSFXWQOlLfxOOrtvD8ukJiOrbl+etTuWBYD6djicgxqKiD0BfbDjAnw0XRgRquGRvHnEuGENVOI0oivkpFHUQqaxtYsDKPN9fvpF/XCN781TgmDNCIkoivU1EHidV5+5ibkUVpVS2/OjOBP1wwmPbhuv1bxB+oqAPcgcN13Lc8h2WbdjO4R0eevW40p/Xt5HQsETkBKuoAZa3lXdce5r2TTVVtA7edP5D/PieR8FDd/i3ib1TUAWhvRS13Ls1kVW4pI/p24uHpKQzu2dHpWCJyklTUAcRay1tfF/Pgilwa3G7unDKUGycm0Ea3f4v4NRV1gCjaX82cjEy+KDzA+P5dWTA9mX5dNaIkEghU1H6uyW156dPt/OmjfMJCQlhwZTJXj+mr279FAoiK2o/l761i1uLNbC6p4Pyh3Zk/NZme0e2cjiUiHqai9kP1jW6e+riApz8pIKpdGE9eM5JLU2J1FC0SoFTUfmZTcTmzFm9my77DTD2tF3dflkSXyHCnY4mIF6mo/cSR+ib+9GE+L322nR5R7XjpF6lMGqIRJZFgoKL2A58X7CctI5OdB2v4+elxzL5oCB01oiQSNFTUPqziSAMPvZfLW18XE981grdmns7p/bs6HUtEWpmK2kd9lLOPO5dmUlZVx81n9+f/nD+IdmEaURIJRipqH7P/cB3z3slmuWsPQ3p25PnrU0np08npWCLioOMWtTGmL/Aa0AOwwCJr7RPeDhZsrLUs27Sbe9/NprquiT9eMIibzx6gESURadERdSPwR2vtN8aYjsBGY8xH1tocL2cLGrvLj3Dn0ixW55UyMu7oiNLAHhpREpGjjlvU1to9wJ7mt6uMMblAb0BFfYrcbsubX+1kwco8mtyWuy8dxg0T4jWiJCL/5oTOURtj4oGRwPpjvG8mMBMgLi7OE9kC2vb91cxOd/HV9oOckdiNh65Mpm+XCKdjiYgPanFRG2M6AOnAbdbayu+/31q7CFgEkJqaaj2WMMA0Nrl54dPt/PmjLYSHhvDw9BSuSu2j279F5Ae1qKiNMWEcLek3rLUZ3o0UuHJ2VzI73UXmrgouHNaD+6cOp0eURpRE5Me15KoPA7wI5FprH/N+pMBT19jEX1cX8Mwn2+gUEcZTPxvFJck9dRQtIi3SkiPqicB1QKYxZlPzz8211r7ntVQBZOOOQ8xOd1FQepgrR/XmrinD6KwRJRE5AS256uNTQId+J6i6rpFHP8znlc+L6BXdnlduHMM5g7s7HUtE/JDuTPSCdVvLmJORScmhI1w/vh+zLhpCh7b6rRaRk6P28KCKmgYeeC+HtzeU0L9bJG/fPJ6xCV2cjiUifk5F7SHvZ+3lrmVZHKyu5zfnDODW8wZqRElEPEJFfYrKqo6OKK3I3MOw2Che/sUYhveOdjqWiAQQFfVJstaS8c0u7luew5GGJm6fPJiZZ/UnrI1GlETEs1TUJ6HkUA1zl2SxdksZo/t1ZuH0FBK7d3A6logEKBX1CXC7La+v38HClXlY4N7Lk7ju9H6EaERJRLxIRd1C28oOk5bu4uuiQ5w5sBsPTtOIkoi0DhX1cTQ0uXl+XSGPr9pK+7A2PHrVCKaP6q3bv0Wk1aiof0TWrgpmp7vI3l3JJck9mXd5Et07akRJRFqXivoYahua+Ms/t/Lc2kI6R4Tz7M9HcdHwWKdjiUiQUlF/z4aig8xKd1FYVs1Vo/tw55RhREeEOR1LRIKYirrZ4bpGHnk/j9e+3EGv6Pa8NmMsZw2KcTqWiIiKGmDNljLmZmSyu+IIN4yP5/bJg4nUiJKI+IigbqPymnruW55Dxje7GBATyT9uHk9qvEaURMS3BG1Rv5e5h7uXZVFe08At5yZyy6REjSiJiE8KuqIurazl7mXZvJ+9l+G9o3h1xliSemlESUR8V9AUtbWWf2wsYf7yHGob3cy+aAi/OjOBUI0oiYiPC4qiLj5Yw9wlmazbup+x8V1YMD2Z/jEaURIR/xDQRd3ktrz2RREPv59PiIH7r0ji2nEaURIR/xKwRV1QWsWsxS6+2VnO2YNiePDKZHp3au90LBGRExZwRd3Q5Oa5Ndv4yz8LiGjbhj9fPYKpp2lESUT8V0AVdWZJBbcv3kze3iqmpMRy7+VJdOvQ1ulYIiKnJCCKurahiT+v2sIL67bTNTKc564bzeSknk7HEhHxCL8v6vWFB0jLyGT7/mquTu3L3ClDiW6vESURCRx+W9RVtQ0sfD+P17/cSd8u7XnjpnFMTOzmdCwREY/zy6L+OK+UO5Zksqeyll+ekcAfLxxERLhfvhQRkePyq3Y7WF3P/ctzWPLtLgZ270D6byYwKq6z07FERLzKL4raWsty1x7mvZNNxZEGfn/eQH577gDahmpESUQCn88X9b7KWu5YksWq3H2k9Inm9ZvGMTQ2yulYIiKtxmeL2lrL/35dzAPv5VLf6GbuJUOYMVEjSiISfI5b1MaYl4BLgVJr7XDvR4KdB2pIy3Dx+bYDjEvowsLpKcR3i2yNDy0i4nNackT9CvBX4DXvRjk6ovTyZ9t59MN8QkNCeGDacK4ZE6cRJREJasctamvtWmNMvLeDVNQ0cMPLX7GpuJxJQ7rzwLThxEZrRElExGPnqI0xM4GZAHFxcSf866Pah9KvawQ3Tozn8hG9NKIkItLMWGuP/6CjR9TLW3qOOjU11W7YsOEUo4mIBA9jzEZrbeqx3qdLKEREfJyKWkTExx23qI0xfwe+AAYbY0qMMb/0fiwREfl/WnLVxzWtEURERI5Npz5ERHycilpExMepqEVEfJyKWkTEx7XohpcTflJjyoAdJ/nLuwH7PRjHH+g1B75ge72g13yi+llrY471Dq8U9akwxmz4obtzApVec+ALttcLes2epFMfIiI+TkUtIuLjfLGoFzkdwAF6zYEv2F4v6DV7jM+doxYRkX/ni0fUIiLyHSpqEREf5zNFbYx5yRhTaozJcjpLazDG9DXGfGyMyTHGZBtjbnU6k7cZY9oZY74yxmxufs33Op2ptRhj2hhjvjXGLHc6S2swxhQZYzKNMZuMMUHxVUSMMZ2MMYuNMXnGmFxjzHiPPbevnKM2xpwFHAZea62vdu4kY0wsEGut/cYY0xHYCEy11uY4HM1rzNGvrxZprT1sjAkDPgVutdZ+6XA0rzPG/AFIBaKstZc6ncfbjDFFQKq1NmhueDHGvAqss9a+YIwJByKsteWeeG6fOaK21q4FDjqdo7VYa/dYa79pfrsKyAV6O5vKu+xRh5t/GNb8zTeOFLzIGNMHmAK84HQW8Q5jTDRwFvAigLW23lMlDT5U1MGs+WtSjgTWOxzF65pPAWwCSoGPrLUB/5qBx4FZgNvhHK3JAh8aYzY2f+HrQJcAlAEvN5/iesEYE+mpJ1dRO8wY0wFIB26z1lY6ncfbrLVN1trTgD7AWGNMQJ/mMsZcCpRaazc6naWVnWGtHQVcDPy2+dRmIAsFRgHPWGtHAtVAmqeeXEXtoObztOnAG9baDKfztKbmTws/Bi5yOIq3TQQubz5n+xYwyRjzurORvM9au6v5+1JgCTDW2UReVwKUfOczxMUcLW6PUFE7pPk/1l4Ecq21jzmdpzUYY2KMMZ2a324PXADkORrKy6y1c6y1fay18cBPgdXW2p87HMurjDGRzf9BTvOn/xcCAX01l7V2L1BsjBnc/FPnAR67MOC4XzOxtTR/Ed1zgG7GmBLgHmvti86m8qqJwHVAZvM5W4C51tr3nIvkdbHAq8aYNhw9SHjbWhsUl6sFmR7AkqPHIoQCb1pr33c2Uqv4HfBG8xUfhcCNnnpin7k8T0REjk2nPkREfJyKWkTEx6moRUR8nIpaRMTHqahFRHycilpExMepqEVEfNz/BcHtvKh0gTY8AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "df = pd.DataFrame({\"var1\":[1,2,3,4,5,6], \"var2\":[1,2,3,4,5,6]})\n", "plt.plot(df[\"var1\"], df[\"var2\"])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**With version 1.5+**" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAeZklEQVR4nO3deXhV9aH18e8vJAESSJgChCEkEMaQIBBAwBEHVBxA6rXWqpVabG9t9W2vEHBCRQW1VmudcLZqvZYEUBAHigJOKCiczBBCIGFKGDKQkPH83j/I29dalADnZJ9hfZ6Hh0AOJ+swrGdns/eKsdYiIiK+K8TpACIi8uNU1CIiPk5FLSLi41TUIiI+TkUtIuLjQr3xpN26dbPx8fHeeGoRkYC0cePG/dbamGO9zytFHR8fz4YNG7zx1CIiAckYs+OH3qdTHyIiPk5FLSLi41TUIiI+TkUtIuLjVNQiIj6uRUVtjOlkjFlsjMkzxuQaY8Z7O5iIiBzV0svzngDet9b+xBgTDkR4MZOIiHzHcY+ojTHRwFnAiwDW2nprbbmXc4mI+JWviw7y7JptXnnulpz6SADKgJeNMd8aY14wxkR+/0HGmJnGmA3GmA1lZWUeDyoi4osO1zVy97Isrnr2C95cv5Oa+kaPf4yWFHUoMAp4xlo7EqgG0r7/IGvtImttqrU2NSbmmHdBiogElDVbypj857X87csd3DgxnpW3nklEuOdv+G7JM5YAJdba9c0/XswxilpEJFgcqq7n/hU5ZHyzi8TuHVj86wmM7tfZax/vuEVtrd1rjCk2xgy21uYD5wE5XkskIuKjrLWszNrL3cuyKK9p4HeTErllUiJtQ9t49eO29Bj9d8AbzVd8FAI3ei+SiIjvKa2s5a5lWXyQvY/k3tG8NmMcw3pFtcrHblFRW2s3AanejSIi4nustfxjYwnzl+dQ1+gm7eIh3HRGAqFtWu9+Qa/MnIqIBILigzXMycjk04L9jE3owoIrk+kf06HVc6ioRUS+p8ltefXzIh75IJ82IYb5U4fzs7FxhIQYR/KoqEVEvmPrvipmp7v4Zmc55wyO4cFpyfTq1N7RTCpqERGgocnNs59s48nVBUS2bcPjV5/GFaf1whhnjqK/S0UtIkEvs6SC2xdvJm9vFZemxDLv8iS6dWjrdKx/UVGLSNCqbWjiz6u28PzaQmI6tmXRdaO5MKmn07H+g4paRILSl4UHSEt3UXSghmvG9iXt4qFEtw9zOtYxqahFJKhU1TawYGUeb6zfSVyXCN68aRwTErs5HetHqahFJGh8nFfK3CWZ7Kus5aYzEvjDhYO8MqLkab6fUETkFB2srue+d7NZumk3A7t34OnfTGBknPdGlDxNRS0iActay3LXHua9k01lbQO3njeQ/z53gNdHlDxNRS0iAWlvRS13Ls1iVe4+RvSJZuFPxjGkZ+uMKHmailpEAoq1lre+LubBFbk0uN3ccclQZpyRQBuHbv/2BBW1iASMHQeqSUvP5IvCA5zevwsLrkwhvtt/fOVAv6OiFhG/1+S2vPzZdh79MJ+wkBAeujKZq1P7Ojai5GkqahHxa/l7q5iV7mJzcTnnD+3O/KnJ9Ixu53Qsj1JRi4hfqm908/QnBTz1cQEd24Xxl2tGcllKrE+MKHmailpE/M6m4nJmL3aRv6+KK07rxT2XJdElMtzpWF6johYRv3GkvonHPsrnxU+3071jO168IZXzhvZwOpbXqahFxC98vm0/aemZ7DxYw7Xj4ph98RCi2vnmiJKnqahFxKdV1jbw0Ht5/P2rncR3jeCtmadzev+uTsdqVSpqEfFZq3L2ccfSTMqq6rj5rP7cdv4g2of71+3fnqCiFhGfc+BwHfPezeHdzbsZ0rMjz1+fSkqfTk7HcoyKWkR8hrWWdzbvZt472Ryua+QPFwzi12cPIDw0xOlojlJRi4hP2F1+hDuXZrE6r5SRcZ1YOD2FQT06Oh3LJ6ioRcRRbrflza92smBlHk1uy92XDuOGCfF+PaLkaSpqEXHM9v3VpKW7WL/9IBMTu/LQtBTiukY4HcvnqKhFpNU1Nrl58dPtPPbRFsJDQ3h4egpXpfYJyNu/PUFFLSKtKndPJbPTXbhKKrhgWA/mTx1Oj6jAGlHyNBW1iLSKusYmnlpdwNOfbKNTRBhP/WwUlyT31FF0C6ioRcTrNu44xOx0FwWlh7lyVG/umjKMzgE8ouRpLSpqY0wRUAU0AY3W2lRvhhKRwFBT38gjH+TzyudFxEa14+Ubx3Du4O5Ox/I7J3JEfa61dr/XkohIQPl0637SMlyUHDrC9eP7MeuiIXRoq0/iT4Z+10TEoyqONPDgilz+d0MxCd0iefvm8YxN6OJ0LL/W0qK2wIfGGAs8Z61d9P0HGGNmAjMB4uLiPJdQRPzGB9l7uWtpFgeq6/n12QO47fyBtAsLvhElT2tpUZ9hrd1ljOkOfGSMybPWrv3uA5rLexFAamqq9XBOEfFhZVV1zHsnmxWZexgWG8VLvxjD8N7RTscKGC0qamvtrubvS40xS4CxwNof/1UiEuistSz5dhf3Lc+hpq6J2ycPZuZZ/QlrE9wjSp523KI2xkQCIdbaqua3LwTu83oyEfFpu8qPMDcjkzVbyhjdrzMLp6eQ2L2D07ECUkuOqHsAS5ovSg8F3rTWvu/VVCLis9xuy+vrd7BwZR4WmHfZMK4fH0+IRpS85rhFba0tBEa0QhYR8XHbyg6Tlu7i66JDnDmwGw9OS6ZvF40oeZsuzxOR42pscrNoXSGPr9pKu9AQHvlJCj8ZrRGl1qKiFpEflb27gtnpLrJ2VXLx8J7ce0US3TtqRKk1qahF5JhqG5p4cvVWnl1TSOeIcJ65dhQXJ8c6HSsoqahF5D9sKDrIrHQXhWXV/GR0H+6cMpROERpRcoqKWkT+pbru6IjSq18U0Su6Pa/NGMtZg2KcjhX0VNQiAsDaLWXMychkd8URbhgfz+2TBxOpESWfoD8FkSBXXlPP/BW5LN5YQv+YSP5x83hS4zWi5EtU1CJBbGXmHu5als2hmnpuOTeRWyYlakTJB6moRYJQaVUt9yzLZmXWXpJ6RfHqjDEk9dKIkq9SUYsEEWstizeWMH9FLkcamph90RB+dWYCoRpR8mkqapEgUXywhrlLMlm3dT9j4juzYHoKA2I0ouQPVNQiAc7ttrz2RREPf5CPAe6/Iolrx/XTiJIfUVGLBLCC0ipmp2eyccchzh4UwwPThtOns0aU/I2KWiQANTS5WbS2kCdWbSWibRse+68RTBvZWyNKfkpFLRJgsnZVMGuxi5w9lUxJiWXeZUnEdGzrdCw5BSpqkQBR29DEE//cyqK1hXSJDOe560YzOamn07HEA1TUIgHgq+0HSUt3Ubi/mqtT+zL3kqFER4Q5HUs8REUt4scO1zWycGUef/tyB306t+f1X47jjIHdnI4lHqaiFvFTH+eXckdGJnsqa5kxMYH/mTyIiHD9kw5E+lMV8TOHquu5f3kOGd/uIrF7Bxb/egKj+3V2OpZ4kYpaxE9Ya3kvcy/3vJNFeU0Dv5+UyG8nJdI2VCNKgU5FLeIH9lXWctfSLD7M2Udy72j+9stxDI2NcjqWtBIVtYgPs9by9oZi5q/Ipb7RzZyLh/DLMzSiFGxU1CI+aueBGuYscfFZwQHGJnRh4fQUErpFOh1LHKCiFvExTW7LK58X8egH+bQJMcyfOpyfjY3TiFIQU1GL+JCt+6qYle7i253lnDs4hgemJdOrU3unY4nDVNQiPqC+0c2za7bx19UFRLZtwxM/PY3LR/TSiJIAKmoRx7lKypm12EXe3iouG9GLeZcNo2sHjSjJ/6eiFnHIkfomHl+1hefXFRLTsS3PX5/KBcN6OB1LfJCKWsQBXxYeIC3dRdGBGq4Z25c5lwwlqp1GlOTYWlzUxpg2wAZgl7X2Uu9FEglcVbUNLFiZxxvrdxLXJYI3bxrHhESNKMmPO5Ej6luBXEC3Q4mchNV5+7hjSRb7Kmu56YwE/njhYNqH6/ZvOb4WFbUxpg8wBXgA+INXE4kEmIPV9dz3bjZLN+1mUI8OPH3tBEbGaURJWq6lR9SPA7OAjj/0AGPMTGAmQFxc3CkHE/F31lrede1h3jvZVNU2cOt5A/ntuYmEh+r2bzkxxy1qY8ylQKm1dqMx5pwfepy1dhGwCCA1NdV6KqCIP9pbUcudS7NYlbuPEX078fD0FAb3/MHjHJEf1ZIj6onA5caYS4B2QJQx5nVr7c+9G03E/1hreevrYh5ckUuD282dU4Zy48QE2uj2bzkFxy1qa+0cYA5A8xH1/6ikRf7TjgPVpKVn8kXhAcb378qC6cn066oRJTl1uo5a5BQ1uS0vf7adRz/MJywkhIeuTOanY/rq9m/xmBMqamvtJ8AnXkki4ofy9x4dUdpcXM75Q7szf2oyPaPbOR1LAoyOqEVOQn2jm6c/KeCpjwuIahfGk9eM5NKUWB1Fi1eoqEVO0KbicmYvdpG/r4qpp/Xi7suS6BIZ7nQsCWAqapEWOlLfxJ8+zOelz7bTI6odL/0ilUlDNKIk3qeiFmmBz7ftJy09k50Ha7h2XBxpFw+ho0aUpJWoqEV+RGVtAw+9l8vfvyomvmsEb808ndP7d3U6lgQZFbXID1iVs487lmZSVlXHzWf157bzB2lESRyhohb5nv2H67j33Rze3bybIT078vz1qaT06eR0LAliKmqRZtZalm3azb3vZlNd18QfLxjEzWcP0IiSOE5FLQLsLj/CnUuzWJ1Xysi4oyNKA3toREl8g4pagprbbXnzq50sWJlHk9ty96XDuGFCvEaUxKeoqCVobd9fTVq6i/XbDzIxsSsPTUshrmuE07FE/oOKWoJOY5ObFz/dzmMfbSE8NISHp6dwVWof3f4tPktFLUElZ3cls9NdZO6q4MJhPbh/6nB6RGlESXybilqCQl1jE39dXcAzn2yjU0QYT/1sFJck99RRtPgFFbUEvI07DjE73UVB6WGuHNWbu6YMo7NGlMSPqKglYNXUN/LIB/m88nkRsVHtePnGMZw7uLvTsUROmIpaAtKnW/eTluGi5NARrh/fj1kXDaFDW/11F/+kv7kSUCpqGnjgvRze3lBC/26RvH3zeMYmdHE6lsgpUVFLwHg/ay93LcviYHU9vzlnALeeN5B2YRpREv+noha/V1ZVx7x3slmRuYdhsVG8/IsxDO8d7XQsEY9RUYvfstaS8c0u7luew5H6Jm6fPJiZZ/UnrI1GlCSwqKjFL+0qP8LcjEzWbCljdL/OLJyeQmL3Dk7HEvEKFbX4Fbfb8vr6HSxcmYcF7r08ietO70eIRpQkgKmoxW9sKztMWrqLr4sOcebAbjw4LZm+XTSiJIFPRS0+r6HJzfPrCnl81Vbah7Xh0atGMH1Ub93+LUFDRS0+LWtXBbPTXWTvruTi4T2594okunfUiJIEFxW1+KTahiaeXL2VZ9cU0jkinGeuHcXFybFOxxJxhIpafM6GooPMSndRWFbNVaP7cMeUoXSK0IiSBC8VtfiM6rqjI0qvflFEr+j2vDZjLGcNinE6lojjVNTiE9ZsKWNuRia7K45ww/h4bp88mEiNKIkAKmpxWHlNPfcvzyX9mxIGxETyj5vHkxqvESWR7zpuURtj2gFrgbbNj19srb3H28Ek8K3M3MNdy7I5VFPPLecmcsukRI0oiRxDS46o64BJ1trDxpgw4FNjzEpr7ZdeziYBqrSylruXZfN+9l6SekXx6owxJPXSiJLIDzluUVtrLXC4+Ydhzd+sN0NJYLLWsnhjCfcvz6G20c3si4bwqzMTCNWIksiPatE5amNMG2AjkAg8Za1df4zHzARmAsTFxXkyowSA4oM1zF2Sybqt+xkTf3REqX+MRpREWqJFRW2tbQJOM8Z0ApYYY4Zba7O+95hFwCKA1NRUHXELAE1uy2tfFPHIB/kY4P4rkrh2nEaURE7ECV31Ya0tN8Z8DFwEZB3v8RLcCkqrmJ2eycYdhzh7UAwPXplM707tnY4l4ndactVHDNDQXNLtgQuAhV5PJn6rocnNc2u28Zd/FhDRtg2P/dcIpo3UiJLIyWrJEXUs8GrzeeoQ4G1r7XLvxhJ/lbWrgtsXu8jdU8mUlFjmXZZETMe2TscS8WstuerDBYxshSzix2obmnh81VaeX1dI18hwnrtuNJOTejodSyQg6M5EOWVfbT9IWrqLwv3VXJ3al7lThhLdPszpWCIBQ0UtJ62qtoGH38/nb1/uoG+X9rxx0zgmJnZzOpZIwFFRy0n5OL+UOzIy2VNZy4yJCfzP5EFEhOuvk4g36F+WnJBD1fXcvzyHjG93MbB7B9J/M4FRcZ2djiUS0FTU0iLWWlZk7uGeZdlUHGng95MS+e2kRNqGakRJxNtU1HJc+ypruWtpFh/m7COlTzSv3zSOobFRTscSCRoqavlB1lre3lDM/BW51De6mXvJEGZM1IiSSGtTUcsx7TxQw5wlLj4rOMC4hC4snJ5CfLdIp2OJBCUVtfybJrfllc+LePSDfNqEGB6YNpxrxsRpREnEQSpq+Zct+6qYtdjFpuJyJg3pzgPThhMbrRElEaepqIX6RjfPrtnGk6u30qFtKE/89DQuH9FLI0oiPkJFHeQ2F5czO91F3t4qLhvRi3mXDaNrB40oifgSFXWQOlLfxOOrtvD8ukJiOrbl+etTuWBYD6djicgxqKiD0BfbDjAnw0XRgRquGRvHnEuGENVOI0oivkpFHUQqaxtYsDKPN9fvpF/XCN781TgmDNCIkoivU1EHidV5+5ibkUVpVS2/OjOBP1wwmPbhuv1bxB+oqAPcgcN13Lc8h2WbdjO4R0eevW40p/Xt5HQsETkBKuoAZa3lXdce5r2TTVVtA7edP5D/PieR8FDd/i3ib1TUAWhvRS13Ls1kVW4pI/p24uHpKQzu2dHpWCJyklTUAcRay1tfF/Pgilwa3G7unDKUGycm0Ea3f4v4NRV1gCjaX82cjEy+KDzA+P5dWTA9mX5dNaIkEghU1H6uyW156dPt/OmjfMJCQlhwZTJXj+mr279FAoiK2o/l761i1uLNbC6p4Pyh3Zk/NZme0e2cjiUiHqai9kP1jW6e+riApz8pIKpdGE9eM5JLU2J1FC0SoFTUfmZTcTmzFm9my77DTD2tF3dflkSXyHCnY4mIF6mo/cSR+ib+9GE+L322nR5R7XjpF6lMGqIRJZFgoKL2A58X7CctI5OdB2v4+elxzL5oCB01oiQSNFTUPqziSAMPvZfLW18XE981grdmns7p/bs6HUtEWpmK2kd9lLOPO5dmUlZVx81n9+f/nD+IdmEaURIJRipqH7P/cB3z3slmuWsPQ3p25PnrU0np08npWCLioOMWtTGmL/Aa0AOwwCJr7RPeDhZsrLUs27Sbe9/NprquiT9eMIibzx6gESURadERdSPwR2vtN8aYjsBGY8xH1tocL2cLGrvLj3Dn0ixW55UyMu7oiNLAHhpREpGjjlvU1to9wJ7mt6uMMblAb0BFfYrcbsubX+1kwco8mtyWuy8dxg0T4jWiJCL/5oTOURtj4oGRwPpjvG8mMBMgLi7OE9kC2vb91cxOd/HV9oOckdiNh65Mpm+XCKdjiYgPanFRG2M6AOnAbdbayu+/31q7CFgEkJqaaj2WMMA0Nrl54dPt/PmjLYSHhvDw9BSuSu2j279F5Ae1qKiNMWEcLek3rLUZ3o0UuHJ2VzI73UXmrgouHNaD+6cOp0eURpRE5Me15KoPA7wI5FprH/N+pMBT19jEX1cX8Mwn2+gUEcZTPxvFJck9dRQtIi3SkiPqicB1QKYxZlPzz8211r7ntVQBZOOOQ8xOd1FQepgrR/XmrinD6KwRJRE5AS256uNTQId+J6i6rpFHP8znlc+L6BXdnlduHMM5g7s7HUtE/JDuTPSCdVvLmJORScmhI1w/vh+zLhpCh7b6rRaRk6P28KCKmgYeeC+HtzeU0L9bJG/fPJ6xCV2cjiUifk5F7SHvZ+3lrmVZHKyu5zfnDODW8wZqRElEPEJFfYrKqo6OKK3I3MOw2Che/sUYhveOdjqWiAQQFfVJstaS8c0u7luew5GGJm6fPJiZZ/UnrI1GlETEs1TUJ6HkUA1zl2SxdksZo/t1ZuH0FBK7d3A6logEKBX1CXC7La+v38HClXlY4N7Lk7ju9H6EaERJRLxIRd1C28oOk5bu4uuiQ5w5sBsPTtOIkoi0DhX1cTQ0uXl+XSGPr9pK+7A2PHrVCKaP6q3bv0Wk1aiof0TWrgpmp7vI3l3JJck9mXd5Et07akRJRFqXivoYahua+Ms/t/Lc2kI6R4Tz7M9HcdHwWKdjiUiQUlF/z4aig8xKd1FYVs1Vo/tw55RhREeEOR1LRIKYirrZ4bpGHnk/j9e+3EGv6Pa8NmMsZw2KcTqWiIiKGmDNljLmZmSyu+IIN4yP5/bJg4nUiJKI+IigbqPymnruW55Dxje7GBATyT9uHk9qvEaURMS3BG1Rv5e5h7uXZVFe08At5yZyy6REjSiJiE8KuqIurazl7mXZvJ+9l+G9o3h1xliSemlESUR8V9AUtbWWf2wsYf7yHGob3cy+aAi/OjOBUI0oiYiPC4qiLj5Yw9wlmazbup+x8V1YMD2Z/jEaURIR/xDQRd3ktrz2RREPv59PiIH7r0ji2nEaURIR/xKwRV1QWsWsxS6+2VnO2YNiePDKZHp3au90LBGRExZwRd3Q5Oa5Ndv4yz8LiGjbhj9fPYKpp2lESUT8V0AVdWZJBbcv3kze3iqmpMRy7+VJdOvQ1ulYIiKnJCCKurahiT+v2sIL67bTNTKc564bzeSknk7HEhHxCL8v6vWFB0jLyGT7/mquTu3L3ClDiW6vESURCRx+W9RVtQ0sfD+P17/cSd8u7XnjpnFMTOzmdCwREY/zy6L+OK+UO5Zksqeyll+ekcAfLxxERLhfvhQRkePyq3Y7WF3P/ctzWPLtLgZ270D6byYwKq6z07FERLzKL4raWsty1x7mvZNNxZEGfn/eQH577gDahmpESUQCn88X9b7KWu5YksWq3H2k9Inm9ZvGMTQ2yulYIiKtxmeL2lrL/35dzAPv5VLf6GbuJUOYMVEjSiISfI5b1MaYl4BLgVJr7XDvR4KdB2pIy3Dx+bYDjEvowsLpKcR3i2yNDy0i4nNackT9CvBX4DXvRjk6ovTyZ9t59MN8QkNCeGDacK4ZE6cRJREJasctamvtWmNMvLeDVNQ0cMPLX7GpuJxJQ7rzwLThxEZrRElExGPnqI0xM4GZAHFxcSf866Pah9KvawQ3Tozn8hG9NKIkItLMWGuP/6CjR9TLW3qOOjU11W7YsOEUo4mIBA9jzEZrbeqx3qdLKEREfJyKWkTExx23qI0xfwe+AAYbY0qMMb/0fiwREfl/WnLVxzWtEURERI5Npz5ERHycilpExMepqEVEfJyKWkTEx7XohpcTflJjyoAdJ/nLuwH7PRjHH+g1B75ge72g13yi+llrY471Dq8U9akwxmz4obtzApVec+ALttcLes2epFMfIiI+TkUtIuLjfLGoFzkdwAF6zYEv2F4v6DV7jM+doxYRkX/ni0fUIiLyHSpqEREf5zNFbYx5yRhTaozJcjpLazDG9DXGfGyMyTHGZBtjbnU6k7cZY9oZY74yxmxufs33Op2ptRhj2hhjvjXGLHc6S2swxhQZYzKNMZuMMUHxVUSMMZ2MMYuNMXnGmFxjzHiPPbevnKM2xpwFHAZea62vdu4kY0wsEGut/cYY0xHYCEy11uY4HM1rzNGvrxZprT1sjAkDPgVutdZ+6XA0rzPG/AFIBaKstZc6ncfbjDFFQKq1NmhueDHGvAqss9a+YIwJByKsteWeeG6fOaK21q4FDjqdo7VYa/dYa79pfrsKyAV6O5vKu+xRh5t/GNb8zTeOFLzIGNMHmAK84HQW8Q5jTDRwFvAigLW23lMlDT5U1MGs+WtSjgTWOxzF65pPAWwCSoGPrLUB/5qBx4FZgNvhHK3JAh8aYzY2f+HrQJcAlAEvN5/iesEYE+mpJ1dRO8wY0wFIB26z1lY6ncfbrLVN1trTgD7AWGNMQJ/mMsZcCpRaazc6naWVnWGtHQVcDPy2+dRmIAsFRgHPWGtHAtVAmqeeXEXtoObztOnAG9baDKfztKbmTws/Bi5yOIq3TQQubz5n+xYwyRjzurORvM9au6v5+1JgCTDW2UReVwKUfOczxMUcLW6PUFE7pPk/1l4Ecq21jzmdpzUYY2KMMZ2a324PXADkORrKy6y1c6y1fay18cBPgdXW2p87HMurjDGRzf9BTvOn/xcCAX01l7V2L1BsjBnc/FPnAR67MOC4XzOxtTR/Ed1zgG7GmBLgHmvti86m8qqJwHVAZvM5W4C51tr3nIvkdbHAq8aYNhw9SHjbWhsUl6sFmR7AkqPHIoQCb1pr33c2Uqv4HfBG8xUfhcCNnnpin7k8T0REjk2nPkREfJyKWkTEx6moRUR8nIpaRMTHqahFRHycilpExMepqEVEfNz/BcHtvKh0gTY8AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.plot(\"var1\", \"var2\", data=df)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Method Chaining" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With method chaining, it eliminates the need for making intermediary variables. You can process your data frame in a series of method calls when you enclose your data frame with parenthesis. Here's a very contrived example where you are creating these intermediary variables between transformation steps:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('data.csv')\n", "df = df.fillna(...)\n", "df = df.query('some_condition')\n", "df['new_column'] = df.cut(...)\n", "df = df.pivot_table(...)\n", "df = df.rename(...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of the above, we can chain the methods as long as you surround the code with parenthesis:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(pd.read_csv('data.csv')\n", " .fillna(...)\n", " .query('some_condition')\n", " .assign(new_column = df.cut(...))\n", " .pivot_table(...)\n", " .rename(...)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is recommended that you don't use method chaining if it will significantly impact readability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Also check out [pipe() method](http://pandas.pydata.org/pandas-docs/stable/basics.html#tablewise-function-application) as it can also be used to better organize data transformations in discrete functions.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### pipe() example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A short series of video [tutorials](https://calmcode.io/pandas-pipe/introduction.html) on using `pipe()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupounces
0a4.0
1a3.0
2a12.0
3b6.0
4b7.5
5b8.0
6c3.0
7c5.0
8c6.0
\n", "
" ], "text/plain": [ " group ounces\n", "0 a 4.0\n", "1 a 3.0\n", "2 a 12.0\n", "3 b 6.0\n", "4 b 7.5\n", "5 b 8.0\n", "6 c 3.0\n", "7 c 5.0\n", "8 c 6.0" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [], "source": [ "def mean_age_by_group(df, col):\n", " return df.groupby(col).mean()" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [], "source": [ "def uppercase_column_names(df):\n", " df.columns = df.columns.str.upper()\n", " return df" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OUNCES
group
a6.3333333
b7.1666667
c4.6666667
\n", "
" ], "text/plain": [ " OUNCES\n", "group \n", "a 6.3333333\n", "b 7.1666667\n", "c 4.6666667" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df.pipe(mean_age_by_group, col='group')\n", " .pipe(uppercase_column_names)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#top)]" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "toc": { "toc_cell": false, "toc_number_sections": true, "toc_threshold": 6, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }