{ "cells": [ { "cell_type": "markdown", "id": "abb58d0b", "metadata": {}, "source": [ "# Introduction to Jupyter Notebooks and Pandas\n", "\n", "## What is a Jupyter Notebook?\n", "\n", "A [Jupyter](https://jupyter.org/) notebook is a document that can contain live code w/ results, visualizations, and rich text. It is widely used in data science and analytics. The cell below is a *code* cell. It contains a block of executable code.\n", "\n", "Run the code below by clicking on the cell below and clicking the \"Run\" button on top (β–Ά)." ] }, { "cell_type": "code", "execution_count": 1, "id": "7cccbac8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "30\n" ] } ], "source": [ "print(10 + 20)" ] }, { "cell_type": "markdown", "id": "d085a2a1", "metadata": { "id": "DiznwS2xem7h" }, "source": [ "▢️ Run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder." ] }, { "cell_type": "code", "execution_count": 2, "id": "7c2705fe", "metadata": { "id": "xOFvIip0em7h" }, "outputs": [], "source": [ "import unittest\n", "tc = unittest.TestCase()" ] }, { "cell_type": "markdown", "id": "1f0a04f0", "metadata": {}, "source": [ "## Types of cells\n", "\n", "There are three different type of cells.\n", "\n", "1. Code cell\n", "2. Markdown cell\n", "3. Raw cell\n", "\n", "We will most frequently use the first two types of cells." ] }, { "cell_type": "markdown", "id": "1de8ab20", "metadata": { "id": "Dg-bBkF8fGaN" }, "source": [ "---\n", "\n", "### 🎯 Challenge 1: Find the sum of a list\n", "\n", "#### πŸ‘‡ Tasks\n", "\n", "- βœ”οΈ Complete the code cell below to find the sum of all values in `my_list`.\n", "- βœ”οΈ Store the result in a new variable named `result`." ] }, { "cell_type": "code", "execution_count": 3, "id": "1b5e79e0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "478\n" ] } ], "source": [ "my_list = [11, 20, 52, 91, 90, 75, 74, 20, 21, 10, 14]\n", "\n", "### BEGIN SOLUTION\n", "result = 0\n", "\n", "for num in my_list:\n", " result = result + num\n", "### END SOLUTION\n", "\n", "print(result)" ] }, { "cell_type": "markdown", "id": "fdd9c853", "metadata": { "id": "EdrK-mBsem7r" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- βœ”οΈ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix any incorrect parts." ] }, { "cell_type": "code", "execution_count": 4, "id": "b2d60d4c", "metadata": {}, "outputs": [], "source": [ "import unittest\n", "\n", "tc = unittest.TestCase()\n", "\n", "tc.assertEqual(result, 478)" ] }, { "cell_type": "markdown", "id": "fbb0c241", "metadata": {}, "source": [ "---\n", "\n", "## Introduction to Pandas\n", "\n", "Pandas is a Python *library* for data manipulation and analysis. Although it's used universally in data-related programming applications, it was initially developed for financial analysis by [AQR Capital Management](https://www.aqr.com/).\n", "\n", "![Pandas logo](https://github.com/bdi475/notebooks/blob/main/images/pandas-logo.png?raw=true)\n", "\n", "Note: A *library* in the context of programming is a collection of functions (and other data) that others have already written for you.\n", "\n", "Pandas is popular for many reasons:\n", "\n", "1. πŸƒπŸΏβ€β™€οΈ It's fast (for most cases where the dataset can be loaded to your memory).\n", "2. πŸͺ’ It supports most of the features required for data manipulation.\n", "3. πŸ’‘ Write less code. Get more done." ] }, { "cell_type": "markdown", "id": "b2485fa9", "metadata": { "id": "Dg-bBkF8fGaN" }, "source": [ "---\n", "\n", "### 🎯 Challenge 2: Import packages\n", "\n", "#### πŸ‘‡ Tasks\n", "\n", "- βœ”οΈ Import the following Python packages.\n", " 1. `pandas`: Use alias `pd`.\n", " 2. `numpy`: Use alias `np`." ] }, { "cell_type": "code", "execution_count": 5, "id": "1a931be7", "metadata": { "id": "jfnhv8_Yem7j" }, "outputs": [], "source": [ "### BEGIN SOLUTION\n", "import pandas as pd\n", "import numpy as np\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "511e1493", "metadata": { "id": "WCOuwkrzem7j" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- βœ”οΈ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix incorrect parts." ] }, { "cell_type": "code", "execution_count": 6, "id": "a8041cd8", "metadata": { "id": "WQ-COKL8em7k" }, "outputs": [], "source": [ "import sys\n", "tc.assertTrue(\"pd\" in globals(), \"Check whether you have correctly import Pandas with an alias.\")\n", "tc.assertTrue(\"np\" in globals(), \"Check whether you have correctly import NumPy with an alias.\")" ] }, { "cell_type": "markdown", "id": "49f78e8f", "metadata": {}, "source": [ "---\n", "\n", "### It all starts with a `Series`...\n", "\n", "The basic building block of Pandas is a `Series`. A `Series` is like a list, but with many more features.\n", "\n", "You can create a `Series` by passing a list of values to `pd.Series()`." ] }, { "cell_type": "code", "execution_count": 7, "id": "06b980fe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "2 3.0\n", "3 NaN\n", "4 5.0\n", "5 6.0\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series([1, 2, 3, np.nan, 5, 6])\n", "\n", "s" ] }, { "cell_type": "markdown", "id": "50cc82bb", "metadata": {}, "source": [ "### Few things to note here\n", "\n", "1. These look similar to a Python `list`.\n", "2. The last line of the printed output tells us the data type of values in the `Series` (`dtype: float64`).\n", "- What the heck is `np.nan`?\n", " - It is used to indicate a \"missing value\".\n", " - `np.nan` is NOT the same as `0`.\n", " \n", "### Differences between a list and a Series" ] }, { "cell_type": "code", "execution_count": 8, "id": "372f1887", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "[1, 2, 3, 4, 1, 2, 3, 4]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "my_list = [1, 2, 3, 4]\n", "\n", "print(type(my_list))\n", "display(my_list * 2)" ] }, { "cell_type": "code", "execution_count": 9, "id": "0b35e389", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "0 2\n", "1 4\n", "2 6\n", "3 8\n", "dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "my_series = pd.Series([1, 2, 3, 4])\n", "\n", "print(type(my_series))\n", "display(my_series * 2)" ] }, { "cell_type": "markdown", "id": "84ee75dd", "metadata": {}, "source": [ "What happens when you multiply a Python `list` by number `2`? It repeats the elements.\n", "\n", "How about a `Series`? It multiples each element by `2`!" ] }, { "cell_type": "markdown", "id": "ff91c121", "metadata": { "id": "Dg-bBkF8fGaN" }, "source": [ "---\n", "\n", "### 🎯 Challenge 3: Create new `Series`\n", "\n", "#### πŸ‘‡ Tasks\n", "\n", "- βœ”οΈ Create a new Pandas `Series` named `my_series` with the following three values: `10`, `20`, `30`.\n", "\n", "#### πŸš€ Hint\n", "\n", "The code below creates a new Pandas `Series` with the values `1` and `2`.\n", "\n", "```python\n", "my_new_series = pd.Series([1, 2])\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "id": "aaa01f50", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 20\n", "2 30\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "my_series = pd.Series([10, 20, 30])\n", "### END SOLUTION\n", "\n", "my_series" ] }, { "cell_type": "markdown", "id": "ed26cf6d", "metadata": { "id": "EdrK-mBsem7r" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- βœ”οΈ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix any incorrect parts." ] }, { "cell_type": "code", "execution_count": 11, "id": "91ae33ae", "metadata": {}, "outputs": [], "source": [ "pd.testing.assert_series_equal(my_series, pd.Series([1, 2, 3]) * 10)" ] }, { "cell_type": "markdown", "id": "3870734e", "metadata": {}, "source": [ "---\n", "\n", "### Using `Series` methods\n", "\n", "A pandas `Series` is similar to a Python `list`. However, a `Series` provides many methods (equivalent to functions) for you to use.\n", "\n", "As an example, `num_reviews.mean()` will return the average number of reviews." ] }, { "cell_type": "code", "execution_count": 12, "id": "1982fc6a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ellipsis" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_count = [12715, 2274, 2771, 3952, 528, 2766, 724]\n", "num_reviews = pd.Series(reviews_count)\n", "\n", "# YOUR CODE HERE\n", "..." ] }, { "cell_type": "markdown", "id": "370ca8b3-368f-460a-80ff-f19a0a6be150", "metadata": { "id": "Dg-bBkF8fGaN" }, "source": [ "---\n", "\n", "### 🎯 Challenge 4: Create a Pandas DataFrame\n", "\n", "#### πŸ‘‡ Tasks\n", "\n", "- βœ”οΈ You are given two lists - `product_names` and `num_reviews` that contain the names of make-up products and the number of reviews on Sephora.com.\n", "- βœ”οΈ Using the two lists, create a new Pandas `DataFrame` named `df_top_products` that has the following two columns:\n", " 1. `product_name`: Names of the products\n", " 2. `num_review`: Number of reviews\n", "- βœ”οΈ Note that the column names are singular.\n", "\n", "#### πŸš€ Hint\n", "\n", "The code below creates a new Pandas `DataFrame` from two series.\n", "\n", "```python\n", "my_new_dataframe = pd.DataFrame({\n", " \"column_one\": my_series1,\n", " \"column_two\": my_series2\n", "})\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "id": "9371b5a5-da1d-4ca5-9343-aa2040c996d6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_namenum_review
0Laneige Lip Sleeping Mask12715
1The Ordinary Hyaluronic Acid 2% + B52274
2Laneige Lip Glowy Balm2766
3Chanel COCO MADEMOISELLE Eau de Parfum724
\n", "
" ], "text/plain": [ " product_name num_review\n", "0 Laneige Lip Sleeping Mask 12715\n", "1 The Ordinary Hyaluronic Acid 2% + B5 2274\n", "2 Laneige Lip Glowy Balm 2766\n", "3 Chanel COCO MADEMOISELLE Eau de Parfum 724" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "product_names = [\n", " \"Laneige Lip Sleeping Mask\",\n", " \"The Ordinary Hyaluronic Acid 2% + B5\",\n", " \"Laneige Lip Glowy Balm\",\n", " \"Chanel COCO MADEMOISELLE Eau de Parfum\"\n", "]\n", "\n", "num_reviews = [\n", " 12715,\n", " 2274,\n", " 2766,\n", " 724\n", "]\n", "\n", "### BEGIN SOLUTION\n", "df_top_products = pd.DataFrame({\n", " \"product_name\": product_names,\n", " \"num_review\": num_reviews\n", "})\n", "### END SOLUTION\n", "\n", "display(df_top_products)" ] }, { "cell_type": "markdown", "id": "e5c7b7b4-b116-4e62-bcd8-c7d8b7334e77", "metadata": { "id": "EdrK-mBsem7r" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- βœ”οΈ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix any incorrect parts." ] }, { "cell_type": "code", "execution_count": 14, "id": "dba4d57c-2908-42d6-8f2e-6f4e3a88de0d", "metadata": { "id": "DXDG3nzpem7r", "nbgrader": { "grade": true, "grade_id": "challenge-02", "locked": true, "points": "1", "solution": false } }, "outputs": [], "source": [ "pd.testing.assert_frame_equal(\n", " df_top_products.reset_index(drop=True),\n", " pd.DataFrame({\"product_name\": {0: \"Laneige Lip Sleeping Mask\",\n", " 1: \"The Ordinary Hyaluronic Acid 2% + B5\",\n", " 2: \"Laneige Lip Glowy Balm\",\n", " 3: \"Chanel COCO MADEMOISELLE Eau de Parfum\"},\n", " \"num_review\": {0: 12715, 1: 2274, 2: 2766, 3: 724}})\n", ")" ] }, { "cell_type": "markdown", "id": "c59589d5-8fca-42ac-b6eb-992d27206421", "metadata": { "id": "126Uw95nem7k" }, "source": [ "---\n", "\n", "### πŸ“Œ Load data" ] }, { "cell_type": "markdown", "id": "1f4c723f-5eec-4c05-a937-b7cd601c97c1", "metadata": { "id": "kH43B35Mem7l" }, "source": [ "The second part of today's lecture is all about **you**. πŸ‘» Literally.\n", "\n", "▢️ Run the code cell below to create a new `DataFrame` named `df_you`." ] }, { "cell_type": "code", "execution_count": 15, "id": "a2cdcd9a-b41e-42f8-ad46-94a48f3628d2", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "CXJECgQiem7l", "outputId": "59ba80a2-892b-4e64-bbe9-1b8fc3a2880c" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemajor1major2citycountryfav_restaurantfav_moviehas_iphone
0Dane JacobsenAgr & Consumer EconomicsBusinessChicagoUSASushi ManThe Iron ClawTrue
1Bingqing LiStatisticsEconometrics & Quant EconShanghaiChinaYogiFrozenTrue
2Kam Chiu ChongEconomicsInformation SystemsHong KongNaNNaNNaNTrue
3Jiaqi ZengMathematicsComputer ScienceZhuhaiChinaNorthern CuisineInterstellarTrue
4Rishi ShahInformation Sciences + DSBusinessChicagoUSATaco BellForrest GumpTrue
\n", "
" ], "text/plain": [ " name major1 major2 \\\n", "0 Dane Jacobsen Agr & Consumer Economics Business \n", "1 Bingqing Li Statistics Econometrics & Quant Econ \n", "2 Kam Chiu Chong Economics Information Systems \n", "3 Jiaqi Zeng Mathematics Computer Science \n", "4 Rishi Shah Information Sciences + DS Business \n", "\n", " city country fav_restaurant fav_movie has_iphone \n", "0 Chicago USA Sushi Man The Iron Claw True \n", "1 Shanghai China Yogi Frozen True \n", "2 Hong KongΒ  NaN NaN NaN True \n", "3 Zhuhai China Northern Cuisine Interstellar True \n", "4 Chicago USA Taco Bell Forrest Gump True " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_you = pd.read_csv(\"https://github.com/bdi475/datasets/raw/main/about-you.csv\")\n", "\n", "# Used to keep a clean copy\n", "df_you_backup = df_you.copy()\n", "\n", "# head() displays the first 5 rows of a DataFrame\n", "df_you.head()" ] }, { "cell_type": "markdown", "id": "93cb3aa1-6901-45ac-9355-734ace9759cd", "metadata": { "id": "t6vVg2dJem7m" }, "source": [ "☝️ **Hold on.** Didn't we always create `DataFrame`s using `pd.DataFrame()`?\n", "\n", "Yes. But we can *import* existing data as a Pandas `DataFrame` using `pd.read_csv()`. There are many other similar import methods. For now, we'll mostly use `pd.read_csv()`.\n", "\n", "The table below explains each column in `df_you`." ] }, { "cell_type": "markdown", "id": "babad61d-efee-49d2-b6b4-a521bb4f9ba4", "metadata": { "id": "LI33A8-jem7m" }, "source": [ "| Column Name | Description |\n", "|-------------------------|-----------------------------------------------------------|\n", "| name | First name |\n", "| major1 | Major |\n", "| major2 | Second major OR minor (blank if no second major or minor) |\n", "| city | City the person is from |\n", "| country | Country the person is from |\n", "| fav_restaurant | Favorite restaurant (blank if no restaurant was given) |\n", "| fav_movie | Favorite movie (blank if no movie was given) |\n", "| has_iphone | Whether the person use an iPhone |" ] }, { "cell_type": "markdown", "id": "930451c6-a001-4d5f-8537-6d7ca5002235", "metadata": { "id": "d5hE9oVSem7n" }, "source": [ "---\n", "\n", "### πŸ“Œ Concise summary of a `DataFrame`" ] }, { "cell_type": "markdown", "id": "664d9b88-6b5e-4d97-8848-9ab423dec6fb", "metadata": { "id": "lg5s3hW2em7n" }, "source": [ "πŸ‘‰ A common first step in working with a `DataFrame` is to use the `info()` method. `info()` prints a concise summary of a `DataFrame`.\n", "- Index data type\n", "- Column information: for each column, the following information is displayed:\n", " - Number of non-missing values\n", " - Data type of the column\n", "- Memory usage" ] }, { "cell_type": "markdown", "id": "c17d9d82-2330-4966-81d4-2c50dc6980df", "metadata": { "id": "v4myKhF8em7o" }, "source": [ "▢️ Run `df_you.info()` below to see the `info()` method in action." ] }, { "cell_type": "code", "execution_count": 16, "id": "3e566e47-1a11-4110-ac18-9ee1beab9590", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UfyZRM7rem7o", "outputId": "891d9445-721f-4c60-a345-36b679815cf0", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 118 entries, 0 to 117\n", "Data columns (total 8 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 name 118 non-null object\n", " 1 major1 117 non-null object\n", " 2 major2 93 non-null object\n", " 3 city 104 non-null object\n", " 4 country 107 non-null object\n", " 5 fav_restaurant 97 non-null object\n", " 6 fav_movie 99 non-null object\n", " 7 has_iphone 118 non-null bool \n", "dtypes: bool(1), object(7)\n", "memory usage: 6.7+ KB\n" ] } ], "source": [ "### BEGIN SOLUTION\n", "df_you.info()\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "0e8abd55-1c5e-48e4-8029-c61805535a5f", "metadata": { "id": "prZjAYOUem7o" }, "source": [ "πŸ‘‰ From the result of `df_you.info()`, we can understand a couple of things:\n", "\n", "- There are 8 columns.\n", "- 7 out of 8 columns have the `object` data type.\n", " - In Pandas, a string data type is shown as `object`, not `str`.\n", " - We will skip the technical discussion for now.\n", "- The second line of the output tells us the number of rows (i.e., observations).\n", "- Some columns contain one or more missing values.\n", " - Missing values are displayed as `NaN`.\n", " - To denote a missing value, use NumPy's `np.nan` (more on this later)." ] }, { "cell_type": "markdown", "id": "ec0a09ad-6180-4ea9-a8df-20c3aaf8e7a4", "metadata": { "id": "OJ4qqvKIem7q" }, "source": [ "---\n", "\n", "### 🎯 Challenge 5: Display first/last/random rows" ] }, { "cell_type": "markdown", "id": "caec58af-5dcc-42e5-932a-38703bc25a66", "metadata": { "id": "TwWyyaQPem7q" }, "source": [ "▢️ Run `df_you.head()` to print the first 5 rows of `df_you`." ] }, { "cell_type": "code", "execution_count": 17, "id": "5f802a73-5323-4184-827c-0bc5c9aa1469", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OoYuVAyhem7q", "outputId": "8eea66d1-ee93-464c-9889-346fd512d07a", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemajor1major2citycountryfav_restaurantfav_moviehas_iphone
0Dane JacobsenAgr & Consumer EconomicsBusinessChicagoUSASushi ManThe Iron ClawTrue
1Bingqing LiStatisticsEconometrics & Quant EconShanghaiChinaYogiFrozenTrue
2Kam Chiu ChongEconomicsInformation SystemsHong KongNaNNaNNaNTrue
3Jiaqi ZengMathematicsComputer ScienceZhuhaiChinaNorthern CuisineInterstellarTrue
4Rishi ShahInformation Sciences + DSBusinessChicagoUSATaco BellForrest GumpTrue
\n", "
" ], "text/plain": [ " name major1 major2 \\\n", "0 Dane Jacobsen Agr & Consumer Economics Business \n", "1 Bingqing Li Statistics Econometrics & Quant Econ \n", "2 Kam Chiu Chong Economics Information Systems \n", "3 Jiaqi Zeng Mathematics Computer Science \n", "4 Rishi Shah Information Sciences + DS Business \n", "\n", " city country fav_restaurant fav_movie has_iphone \n", "0 Chicago USA Sushi Man The Iron Claw True \n", "1 Shanghai China Yogi Frozen True \n", "2 Hong KongΒ  NaN NaN NaN True \n", "3 Zhuhai China Northern Cuisine Interstellar True \n", "4 Chicago USA Taco Bell Forrest Gump True " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "df_you.head()\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "1f572157-3d4d-46c6-b46e-89349c8f464a", "metadata": { "id": "TwWyyaQPem7q" }, "source": [ "▢️ Run `df_you.tail(4)` to print the last 4 rows of `df_you`." ] }, { "cell_type": "code", "execution_count": 18, "id": "d7b529c3-0972-4fe8-9106-4e56cf6b8315", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OoYuVAyhem7q", "outputId": "8eea66d1-ee93-464c-9889-346fd512d07a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemajor1major2citycountryfav_restaurantfav_moviehas_iphone
114Renee CrawleyInterdisciplinary HealthNaNNaNNaNNaNNaNTrue
115Kashvi PanjoliaComputer ScienceNaNAustinUSAOozu RamenNaNTrue
116Connor GordonAerospace EngineeringBusinessSussexUSAPapa Del'sInterstellarTrue
117Dhruv NambisanIndustrial EngineeringBusinessChicagoUSAFernandos TacosDjango UnchainedTrue
\n", "
" ], "text/plain": [ " name major1 major2 city country \\\n", "114 Renee Crawley Interdisciplinary Health NaN NaN NaN \n", "115 Kashvi Panjolia Computer Science NaN Austin USA \n", "116 Connor Gordon Aerospace Engineering Business Sussex USA \n", "117 Dhruv Nambisan Industrial Engineering Business Chicago USA \n", "\n", " fav_restaurant fav_movie has_iphone \n", "114 NaN NaN True \n", "115 Oozu Ramen NaN True \n", "116 Papa Del's Interstellar True \n", "117 Fernandos Tacos Django Unchained True " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "df_you.tail(4)\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "5ae9815a-a5b7-4747-9175-9535b133e614", "metadata": { "id": "TwWyyaQPem7q" }, "source": [ "▢️ Run `df_you.sample(3)` to print 3 randomly sampled rows from `df_you`." ] }, { "cell_type": "code", "execution_count": 19, "id": "ed993bdb-cb01-4a0e-8071-861107d96fda", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OoYuVAyhem7q", "outputId": "8eea66d1-ee93-464c-9889-346fd512d07a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemajor1major2citycountryfav_restaurantfav_moviehas_iphone
71Youngjin SongEconomicsBusinessRidgewoodUSAMia Za'sThe RevenantTrue
66Andrew JordanEconomicsBusinessChicagoUSAMo's BurritosInterstellarTrue
106Will NeffInformation Sciences + DSPolitical ScienceBaltimoreUSAYogiBabylon or Perfect BlueTrue
\n", "
" ], "text/plain": [ " name major1 major2 city \\\n", "71 Youngjin Song Economics Business Ridgewood \n", "66 Andrew Jordan Economics Business Chicago \n", "106 Will Neff Information Sciences + DS Political Science Baltimore \n", "\n", " country fav_restaurant fav_movie has_iphone \n", "71 USA Mia Za's The Revenant True \n", "66 USA Mo's Burritos Interstellar True \n", "106 USA Yogi Babylon or Perfect Blue True " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "df_you.sample(3)\n", "### END SOLUTION" ] }, { "cell_type": "code", "execution_count": 20, "id": "2ebbb254-bae3-462a-87a1-08b37c7e3441", "metadata": { "nbgrader": { "grade": true, "grade_id": "challenge-03", "locked": true, "points": "1", "solution": false } }, "outputs": [], "source": [ "# Autograder" ] }, { "cell_type": "markdown", "id": "5aec9daa-8aea-4981-a3ae-38733d6e3841", "metadata": { "id": "hgU6chDUem7o" }, "source": [ "---\n", "\n", "### πŸ“Œ Number of rows and columns in a `DataFrame`" ] }, { "cell_type": "markdown", "id": "ed9edce3-3563-47a9-9226-5d5a520b0e46", "metadata": { "id": "3w2j6dSFem7p" }, "source": [ "πŸ‘‰ How many rows and columns does `df_you` have?\n", "\n", "▢️ Run `df_you.shape` below to see the *shape* (number of rows and columns) of the database." ] }, { "cell_type": "code", "execution_count": 21, "id": "19fc92e6-1b9e-41eb-beeb-60a0a83804a8", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "go7OUMtTem7p", "outputId": "cfb21812-0eef-4bc8-e2a0-fc11756ce7a0" }, "outputs": [ { "data": { "text/plain": [ "(118, 8)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "df_you.shape\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "1fc55735-9a55-4ea2-92fd-1940f15c9a0a", "metadata": { "id": "7J--UYFvem7p" }, "source": [ "πŸ‘‰ Can you store the number of rows and columns to variables?\n", "\n", "---\n", "\n", "- `df_you.shape` returns a `tuple` in `(num_rows, num_cols)` format. \n", "- What is a `tuple`? πŸ™€\n", "- A `tuple` is a `list` that cannot be modified once created.\n", "\n", "▢️ Run the code cell below to see how a `tuple` is nearly identical to a `list`." ] }, { "cell_type": "code", "execution_count": 22, "id": "59f5c6b5-3fb9-448b-9b96-1f6a0351ef89", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vAnEKnhUem7p", "outputId": "23c744fd-7028-496b-a243-a015ea486481" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "my_list[1]=20\n", "my_tuple[1]=20\n" ] } ], "source": [ "# These two are nearly identical,\n", "# The only difference is that my_tuple cannot be modified once created\n", "my_list = [10, 20]\n", "my_tuple = (10, 20)\n", "\n", "print(f\"my_list[1]={my_list[1]}\") # prints 20\n", "print(f\"my_tuple[1]={my_tuple[1]}\") # also prints 20" ] }, { "cell_type": "markdown", "id": "68fa4f8f-b6a5-49ca-91a8-b08a35dff0af", "metadata": { "id": "OJ4qqvKIem7q" }, "source": [ "---\n", "\n", "### 🎯 Challenge 6: Find the number of rows and columns in a `DataFrame`" ] }, { "cell_type": "markdown", "id": "c5a24363-6f2c-4573-8be3-576fafb7441f", "metadata": { "id": "TwWyyaQPem7q" }, "source": [ "#### πŸ‘‡ Tasks\n", "\n", "- βœ”οΈ Store the number of rows in `df_you` to a new variable named `num_rows`.\n", "- βœ”οΈ Store the number of columns in `df_you` to a new variable named `num_cols`.\n", "- βœ”οΈ Use `.shape`, not `len()`." ] }, { "cell_type": "code", "execution_count": 23, "id": "43c16cb2-af4e-4c46-9c24-ba087434b1da", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OoYuVAyhem7q", "outputId": "8eea66d1-ee93-464c-9889-346fd512d07a", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "118\n", "8\n" ] } ], "source": [ "### BEGIN SOLUTION\n", "num_rows = df_you.shape[0]\n", "num_cols = df_you.shape[1]\n", "### END SOLUTION\n", "\n", "print(num_rows)\n", "print(num_cols)" ] }, { "cell_type": "markdown", "id": "2e863033-5619-47a1-9221-d8728e798c9c", "metadata": { "id": "EdrK-mBsem7r" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- βœ”οΈ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix incorrect parts." ] }, { "cell_type": "code", "execution_count": 24, "id": "42e11cbf-924e-4679-8c9b-239a3049f796", "metadata": { "id": "DXDG3nzpem7r", "nbgrader": { "grade": true, "grade_id": "challenge-04", "locked": true, "points": "1", "solution": false }, "tags": [] }, "outputs": [], "source": [ "tc.assertEqual(num_rows, len(df_you.index), f\"Number of rows should be {len(df_you.index)}\")\n", "tc.assertEqual(num_cols, len(df_you.columns), f\"Number of columns should be {len(df_you.columns)}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }