{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Manipulation Techniques\n", "\n", "There are lots of ways to change the shape and data in your `DataFrame`.\n", "\n", "Let's explore the popular options." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((475, 7), (998, 4))" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Setup\n", "from datetime import datetime\n", "import os\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from utils import render\n", "\n", "users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)\n", "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n", "# Pop out a quick sanity check\n", "(users.shape, transactions.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assigning values\n", "Let's assume that we work on the CashBox Customer Support team and the user with name of **`Adrian Yang`** called and told us about an error in his balance. It should be **`$35`**, but is currently **`$30.01`**\n", "\n", "### With Chaining (don't do this)\n", "Let's walk ourselves right into a common problem..." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameemailemail_verifiedsignup_datereferral_countbalance
adrianAdrianYangadrian.yang@teamtreehouse.comTrue2018-04-28330.01
\n", "
" ], "text/plain": [ " first_name last_name email email_verified \\\n", "adrian Adrian Yang adrian.yang@teamtreehouse.com True \n", "\n", " signup_date referral_count balance \n", "adrian 2018-04-28 3 30.01 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First let's make sure there is only one Adrian Yang\n", "users[(users.first_name == \"Adrian\") & (users.last_name == \"Yang\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> [Yo, Adrian! I did it!](https://www.youtube.com/watch?v=_61IZp_RNYg)\n", "\n", "Our goal is to update the balance, so the common thought process here, usually leads for us to just chain off the returned `DataFrame` like so..." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "adrian 30.01\n", "Name: balance, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users[(users.first_name == \"Adrian\") & (users.last_name == \"Yang\")]['balance']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and since that appears to work, maybe we'll go ahead and set it to the new value." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/craig/miniconda3/envs/panduhs/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] } ], "source": [ "users[(users.first_name == \"Adrian\") & (users.last_name == \"Yang\")]['balance'] = 35.00" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uh oh! As you can see in the **SettingWithCopyWarning** above that you should not attempt to chain and assign this way.\n", "\n", "### Using `loc`\n", "The [solution](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy) is to use the `DataFrame.loc` object to locate the row and specific column to update." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "first_name Adrian\n", "last_name Yang\n", "email adrian.yang@teamtreehouse.com\n", "email_verified True\n", "signup_date 2018-04-28\n", "referral_count 3\n", "balance 35\n", "Name: adrian, dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users.loc[(users.first_name == \"Adrian\") & (users.last_name == \"Yang\"), 'balance'] = 35.00\n", "# Display our updated user with the new value assigned\n", "users.loc['adrian']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `at`\n", "\n", "You can also use the `DataFrame.at` method to quickly set scalar values" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "users.at['adrian', 'balance'] = 35.00" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding Rows\n", "\n", "So we changed the **`balance`** variable for Adrian, and now we need to track that the transaction occurred.\n", "\n", "Let's take a quick peek at the **`transcactions`** DataFrame." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_date
0steinsmoyer49.032018-01-24
1holden4580joshua.henry34.642018-02-06
2rose.eatonemily.lewis62.672018-02-15
3lmoorekallen1.942018-03-05
4scott3928lmoore27.822018-03-10
\n", "
" ], "text/plain": [ " sender receiver amount sent_date\n", "0 stein smoyer 49.03 2018-01-24\n", "1 holden4580 joshua.henry 34.64 2018-02-06\n", "2 rose.eaton emily.lewis 62.67 2018-02-15\n", "3 lmoore kallen 1.94 2018-03-05\n", "4 scott3928 lmoore 27.82 2018-03-10" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Let's build a new record\n", "record = dict(sender=np.nan, receiver='adrian', amount=4.99, sent_date=datetime.now().date())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Appending with [`DataFrame.append`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html)\n", "\n", "There is a method on `DataFrame`s that allow a way to [`append`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html) a new row to a new dataset. This returns a copy of the DataFrame with the new row(s) appended.\n", "\n", "The index for our **`transactions`** is auto assigned, so we'll set ths **`ignore_index`** keyword argument to `True`, so it gets generated.\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_date
994king3246john25.372018-09-25
995shernandezkristen158175.772018-09-25
996leah6255jholloway63.622018-09-25
997pamelamichelle42252.542018-09-25
998NaNadrian4.992018-10-08
\n", "
" ], "text/plain": [ " sender receiver amount sent_date\n", "994 king3246 john 25.37 2018-09-25\n", "995 shernandez kristen1581 75.77 2018-09-25\n", "996 leah6255 jholloway 63.62 2018-09-25\n", "997 pamela michelle4225 2.54 2018-09-25\n", "998 NaN adrian 4.99 2018-10-08" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remember this is returning a copy...\n", "transactions.append(record, ignore_index=True).tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are appending multiple rows, the more effective way to get the job done is by using the [`pandas.concat`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.htm) method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting With Enlargement\n", "\n", "If you assign to a non-existent index key, the DataFrame will be enlarged automatically, the row will just be added.\n", "\n", "There is a slight problem here, as the index in the **`transactions`** DataFrame is autogenerated. A popular workaround is to figure out the last used index, and increment it." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_date
994king3246john25.372018-09-25
995shernandezkristen158175.772018-09-25
996leah6255jholloway63.622018-09-25
997pamelamichelle42252.542018-09-25
998NaNadrian4.992018-10-08
\n", "
" ], "text/plain": [ " sender receiver amount sent_date\n", "994 king3246 john 25.37 2018-09-25\n", "995 shernandez kristen1581 75.77 2018-09-25\n", "996 leah6255 jholloway 63.62 2018-09-25\n", "997 pamela michelle4225 2.54 2018-09-25\n", "998 NaN adrian 4.99 2018-10-08" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Largest current record, incremented\n", "next_key = transactions.index.max() + 1\n", "transactions.loc[next_key] = record \n", "# Make sure it got added\n", "transactions.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding Columns\n", "\n", "You can add columns much like you do rows, missing values will be set to `np.nan`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting With Enlargement" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_datenotes
994king3246john25.372018-09-25NaN
995shernandezkristen158175.772018-09-25NaN
996leah6255jholloway63.622018-09-25NaN
997pamelamichelle42252.542018-09-25NaN
998NaNadrian4.992018-10-08Adrian called customer support to report billi...
\n", "
" ], "text/plain": [ " sender receiver amount sent_date \\\n", "994 king3246 john 25.37 2018-09-25 \n", "995 shernandez kristen1581 75.77 2018-09-25 \n", "996 leah6255 jholloway 63.62 2018-09-25 \n", "997 pamela michelle4225 2.54 2018-09-25 \n", "998 NaN adrian 4.99 2018-10-08 \n", "\n", " notes \n", "994 NaN \n", "995 NaN \n", "996 NaN \n", "997 NaN \n", "998 Adrian called customer support to report billi... " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "latest_id = transactions.index.max()\n", "# Add a new column named notes\n", "transactions.at[latest_id, 'notes'] = 'Adrian called customer support to report billing error.'\n", "transactions.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column can be added and assigned from an expression." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Add a new column called large. This is a bad name and use of a column ;)\n", "transactions['large'] = transactions.amount > 70" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_datenoteslarge
0steinsmoyer49.032018-01-24NaNFalse
1holden4580joshua.henry34.642018-02-06NaNFalse
2rose.eatonemily.lewis62.672018-02-15NaNFalse
3lmoorekallen1.942018-03-05NaNFalse
4scott3928lmoore27.822018-03-10NaNFalse
\n", "
" ], "text/plain": [ " sender receiver amount sent_date notes large\n", "0 stein smoyer 49.03 2018-01-24 NaN False\n", "1 holden4580 joshua.henry 34.64 2018-02-06 NaN False\n", "2 rose.eaton emily.lewis 62.67 2018-02-15 NaN False\n", "3 lmoore kallen 1.94 2018-03-05 NaN False\n", "4 scott3928 lmoore 27.82 2018-03-10 NaN False" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Renaming Columns\n", "Renaming columns can be acheived using the [`DataFrame.rename`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) method. You specify the current name(s) as the key(s) and the new name(s) as the value(s).\n", "\n", "By default this returns a copy, but you can use the `inplace` command to change the existing `DataFrame`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_datenotesbig_sender
0steinsmoyer49.032018-01-24NaNFalse
1holden4580joshua.henry34.642018-02-06NaNFalse
2rose.eatonemily.lewis62.672018-02-15NaNFalse
3lmoorekallen1.942018-03-05NaNFalse
4scott3928lmoore27.822018-03-10NaNFalse
\n", "
" ], "text/plain": [ " sender receiver amount sent_date notes big_sender\n", "0 stein smoyer 49.03 2018-01-24 NaN False\n", "1 holden4580 joshua.henry 34.64 2018-02-06 NaN False\n", "2 rose.eaton emily.lewis 62.67 2018-02-15 NaN False\n", "3 lmoore kallen 1.94 2018-03-05 NaN False\n", "4 scott3928 lmoore 27.82 2018-03-10 NaN False" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.rename(columns={'large': 'big_sender'}, inplace=True)\n", "transactions.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deleting Columns\n", "\n", "In addition to slicing a `DataFrame` to simply not include a specific existing column. You can also drop columns by name.\n", "\n", "Let's remove the two that we added, in place." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_datebig_sender
0steinsmoyer49.032018-01-24False
1holden4580joshua.henry34.642018-02-06False
2rose.eatonemily.lewis62.672018-02-15False
3lmoorekallen1.942018-03-05False
4scott3928lmoore27.822018-03-10False
\n", "
" ], "text/plain": [ " sender receiver amount sent_date big_sender\n", "0 stein smoyer 49.03 2018-01-24 False\n", "1 holden4580 joshua.henry 34.64 2018-02-06 False\n", "2 rose.eaton emily.lewis 62.67 2018-02-15 False\n", "3 lmoore kallen 1.94 2018-03-05 False\n", "4 scott3928 lmoore 27.82 2018-03-10 False" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.drop(columns=['notes'], inplace=True)\n", "transactions.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might also seen this done using the `axis` parameter.\n", "\n", "Let's get rid of the oddly named **`big_sender`** column. Why'd you let me name it that way?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_date
0steinsmoyer49.032018-01-24
1holden4580joshua.henry34.642018-02-06
2rose.eatonemily.lewis62.672018-02-15
3lmoorekallen1.942018-03-05
4scott3928lmoore27.822018-03-10
\n", "
" ], "text/plain": [ " sender receiver amount sent_date\n", "0 stein smoyer 49.03 2018-01-24\n", "1 holden4580 joshua.henry 34.64 2018-02-06\n", "2 rose.eaton emily.lewis 62.67 2018-02-15\n", "3 lmoore kallen 1.94 2018-03-05\n", "4 scott3928 lmoore 27.82 2018-03-10" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.drop(['big_sender'], axis='columns', inplace=True)\n", "transactions.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deleting Rows\n", "You can see also use the [`DataFrame.drop`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) method to remove row(s) by index." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_date
993colemansarah.evans36.292018-09-25
994king3246john25.372018-09-25
995shernandezkristen158175.772018-09-25
996leah6255jholloway63.622018-09-25
997pamelamichelle42252.542018-09-25
\n", "
" ], "text/plain": [ " sender receiver amount sent_date\n", "993 coleman sarah.evans 36.29 2018-09-25\n", "994 king3246 john 25.37 2018-09-25\n", "995 shernandez kristen1581 75.77 2018-09-25\n", "996 leah6255 jholloway 63.62 2018-09-25\n", "997 pamela michelle4225 2.54 2018-09-25" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "last_key = transactions.index.max()\n", "transactions.drop(index=[last_key], inplace=True)\n", "transactions.tail()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }