{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Setting a column based on another one and multiple conditions in pandas\n", "\n", "This short notebook shows a way to set the value of one column in a CSV file, that satisfies multiple conditions, by extracting information from another column using regular expressions. The [pandas library](https://pandas.pydata.org/) is the best tool I know for programmatically working with CSV files. It offers many methods for modifying columns and supports different data types.\n", "\n", "While working on [ukealong.com](https://ukealong.com/), a website with more than 1,000 play along videos for the ukulele, I faced such a task. The meta data for the videos, which includes artists and track names is maintained in a CSV file. The source data comes from several YouTube channels, that create these play along videos. Let's look at a small subset of the data, that includes 10 records and only the relevant columns for demonstrating the column setting." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artiststracktitlechannel_id
0NaNNaN\"Something\" (The Beatles) Ukulele Play-Along!UCDglnz22aXMzpug5HbD1bCA
1NaNNaNArthur Theme Song Ukulele Play-Along!UCDglnz22aXMzpug5HbD1bCA
2NaNNaN\"Black Magic Woman\" (Santana) Ukulele Play-Along!UCDglnz22aXMzpug5HbD1bCA
3Israel Kamakawiwo'oleWhite Sandy Beach of Hawaii\"White Sandy Beach of Hawaii\" (Braddah IZ) Uku...UCDglnz22aXMzpug5HbD1bCA
4NaNWhat Will We Do With a Drunken SailorWhat Will We Do With a Drunken SailorUC6jZYLoYuV1CxY4Stzm6mgg
5NaNFrere JacquesFrere JacquesUCZjDV_1UEbVsAQA_q9tyTWw
6NaNNaN\"What's Up?\" (4 Non Blondes) Ukulele Play Along!UCDglnz22aXMzpug5HbD1bCA
7NaNNaN\"Lucy in the Sky with Diamonds\" (The Beatles) ...UCDglnz22aXMzpug5HbD1bCA
8NaNNaNLava Ukulele Play-along // Cynthia Lin (Chords...UCD2q6i-C0ZLJUK-VCp49TJA
9WeezerIsland In The Sun\"Island In The Sun\" Ukulele Play-Along!UCbQn9nS2_W-dsmnr3h_Rpvg
\n", "
" ], "text/plain": [ " artists track \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 Israel Kamakawiwo'ole White Sandy Beach of Hawaii \n", "4 NaN What Will We Do With a Drunken Sailor \n", "5 NaN Frere Jacques \n", "6 NaN NaN \n", "7 NaN NaN \n", "8 NaN NaN \n", "9 Weezer Island In The Sun \n", "\n", " title channel_id \n", "0 \"Something\" (The Beatles) Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA \n", "1 Arthur Theme Song Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA \n", "2 \"Black Magic Woman\" (Santana) Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA \n", "3 \"White Sandy Beach of Hawaii\" (Braddah IZ) Uku... UCDglnz22aXMzpug5HbD1bCA \n", "4 What Will We Do With a Drunken Sailor UC6jZYLoYuV1CxY4Stzm6mgg \n", "5 Frere Jacques UCZjDV_1UEbVsAQA_q9tyTWw \n", "6 \"What's Up?\" (4 Non Blondes) Ukulele Play Along! UCDglnz22aXMzpug5HbD1bCA \n", "7 \"Lucy in the Sky with Diamonds\" (The Beatles) ... UCDglnz22aXMzpug5HbD1bCA \n", "8 Lava Ukulele Play-along // Cynthia Lin (Chords... UCD2q6i-C0ZLJUK-VCp49TJA \n", "9 \"Island In The Sun\" Ukulele Play-Along! UCbQn9nS2_W-dsmnr3h_Rpvg " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%load_ext signature\n", "import re\n", "import pandas as pd\n", "\n", "df = pd.read_csv('data/ukealong-column-setting-sample.csv')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After importing videos from [this playlist](https://www.youtube.com/playlist?list=PLCD6F28C13A0A9AC2) by [Ukulele Underground](https://ukuleleunderground.com/), I had more than 100 records with empty `artists` and `track` columns. I updated some of them manually and thought there must be a better way. Most of their YouTube titles follow the format _\"TRACK\" (ARTIST) Ukulele Play-Along!_, so the relevant information could be extracted using regular expressions. I only wanted to update records where the `artists` and `track` columns were not already set, which led to the following condition:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "condition = (df['channel_id'] == 'UCDglnz22aXMzpug5HbD1bCA') & df['artists'].isna() & df['track'].isna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code cell below shows the two statements for extracting the information from the `title` column, which contains the title used on YouTube, using two regular expression. This can be done comfortably using the [pandas.Series.str.extract](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html) method. The `expand` parameter has to be set to `False` so a `Series` object is returned." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artiststracktitlechannel_id
0The BeatlesSomething\"Something\" (The Beatles) Ukulele Play-Along!UCDglnz22aXMzpug5HbD1bCA
1NaNNaNArthur Theme Song Ukulele Play-Along!UCDglnz22aXMzpug5HbD1bCA
2SantanaBlack Magic Woman\"Black Magic Woman\" (Santana) Ukulele Play-Along!UCDglnz22aXMzpug5HbD1bCA
3Israel Kamakawiwo'oleWhite Sandy Beach of Hawaii\"White Sandy Beach of Hawaii\" (Braddah IZ) Uku...UCDglnz22aXMzpug5HbD1bCA
4NaNWhat Will We Do With a Drunken SailorWhat Will We Do With a Drunken SailorUC6jZYLoYuV1CxY4Stzm6mgg
5NaNFrere JacquesFrere JacquesUCZjDV_1UEbVsAQA_q9tyTWw
64 Non BlondesWhat's Up?\"What's Up?\" (4 Non Blondes) Ukulele Play Along!UCDglnz22aXMzpug5HbD1bCA
7The BeatlesLucy in the Sky with Diamonds\"Lucy in the Sky with Diamonds\" (The Beatles) ...UCDglnz22aXMzpug5HbD1bCA
8NaNNaNLava Ukulele Play-along // Cynthia Lin (Chords...UCD2q6i-C0ZLJUK-VCp49TJA
9WeezerIsland In The Sun\"Island In The Sun\" Ukulele Play-Along!UCbQn9nS2_W-dsmnr3h_Rpvg
\n", "
" ], "text/plain": [ " artists track \\\n", "0 The Beatles Something \n", "1 NaN NaN \n", "2 Santana Black Magic Woman \n", "3 Israel Kamakawiwo'ole White Sandy Beach of Hawaii \n", "4 NaN What Will We Do With a Drunken Sailor \n", "5 NaN Frere Jacques \n", "6 4 Non Blondes What's Up? \n", "7 The Beatles Lucy in the Sky with Diamonds \n", "8 NaN NaN \n", "9 Weezer Island In The Sun \n", "\n", " title channel_id \n", "0 \"Something\" (The Beatles) Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA \n", "1 Arthur Theme Song Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA \n", "2 \"Black Magic Woman\" (Santana) Ukulele Play-Along! UCDglnz22aXMzpug5HbD1bCA \n", "3 \"White Sandy Beach of Hawaii\" (Braddah IZ) Uku... UCDglnz22aXMzpug5HbD1bCA \n", "4 What Will We Do With a Drunken Sailor UC6jZYLoYuV1CxY4Stzm6mgg \n", "5 Frere Jacques UCZjDV_1UEbVsAQA_q9tyTWw \n", "6 \"What's Up?\" (4 Non Blondes) Ukulele Play Along! UCDglnz22aXMzpug5HbD1bCA \n", "7 \"Lucy in the Sky with Diamonds\" (The Beatles) ... UCDglnz22aXMzpug5HbD1bCA \n", "8 Lava Ukulele Play-along // Cynthia Lin (Chords... UCD2q6i-C0ZLJUK-VCp49TJA \n", "9 \"Island In The Sun\" Ukulele Play-Along! UCbQn9nS2_W-dsmnr3h_Rpvg " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[condition, 'artists'] = df.title.str.extract(r'\\((.+?)\\)', expand=False)\n", "df.loc[condition, 'track'] = df.title.str.extract(r'^\"(.+?)\"', expand=False)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After running this code on the actual CSV file, most of the Ukulele Underground records showed the correct information in the two columns.\n", "\n", "In the [first published version](https://github.com/yaph/ipython-notebooks/blob/52dfed8d661e2e4b76abfea8a3b82c538a50ec5e/pandas-column-setting.ipynb) of this notebook I imported the standard library [re module](https://docs.python.org/3/library/re.html) and defined two custom functions using regular expressions to extract the information. Thanks to redditor [otterom](https://www.reddit.com/user/otterom/), who [made me aware](https://www.reddit.com/r/Python/comments/frpu36/setting_a_column_based_on_another_one_and/flyj66f/) of the `extract` method, the code is now more succinct and straightforward.\n", "\n", "Quite frankly, writing the initial version of the code probably took longer than updating 100 records manually would have, let alone documenting the process in this notebook und updating it. On the other hand I can integrate the code in the process of adding new videos. Moreover, I learned something new, now have a reference for similar tasks in the future and it hopefully helps other people who face a similar problem.\n", "\n", "Last but not least, if you play the ukulele or want to learn it, I hope you'll dig [ukealong.com](https://ukealong.com/), share the site with your friends and show the creators of the play along videos some love by subscribing to their channels and liking their videos on YouTube. Playing music is really a joyful way to spend your time and helps to clear the mind in these times of uncertainty." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Author: Ramiro Gómez\n", " • Last edited: March 31, 2020
Linux 5.3.0-42-generic - CPython 3.7.6 - IPython 7.13.0 - matplotlib 3.2.1 - numpy 1.18.1 - pandas 1.0.3" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%signature" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }