{ "cells": [ { "cell_type": "markdown", "id": "abb58d0b", "metadata": {}, "source": [ "# Introduction to Jupyter Notebooks and Pandas\n", "\n", "## What is a Jupyter Notebook?\n", "\n", "A [Jupyter](https://jupyter.org/) notebook is a document that can contain live code w/ results, visualizations, and rich text. It is widely used in data science and analytics. The cell below is a *code* cell. It contains a block of executable code.\n", "\n", "Run the code below by clicking on the cell below and clicking the \"Run\" button on top (▶)." ] }, { "cell_type": "code", "execution_count": 1, "id": "7cccbac8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "30\n" ] } ], "source": [ "print(10 + 20)" ] }, { "cell_type": "markdown", "id": "d085a2a1", "metadata": { "id": "DiznwS2xem7h" }, "source": [ "▶️ Run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder." ] }, { "cell_type": "code", "execution_count": 2, "id": "7c2705fe", "metadata": { "id": "xOFvIip0em7h" }, "outputs": [], "source": [ "import unittest\n", "tc = unittest.TestCase()" ] }, { "cell_type": "markdown", "id": "1f0a04f0", "metadata": {}, "source": [ "## Types of cells\n", "\n", "There are three different type of cells.\n", "\n", "1. Code cell\n", "2. Markdown cell\n", "3. Raw cell\n", "\n", "We will most frequently use the first two types of cells." ] }, { "cell_type": "markdown", "id": "1de8ab20", "metadata": { "id": "Dg-bBkF8fGaN" }, "source": [ "---\n", "\n", "### 🎯 Challenge 1: Find the sum of a list\n", "\n", "#### 👇 Tasks\n", "\n", "- ✔️ Complete the code cell below to find the sum of all values in `my_list`.\n", "- ✔️ Store the result in a new variable named `result`." ] }, { "cell_type": "code", "execution_count": 3, "id": "1b5e79e0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "478\n" ] } ], "source": [ "my_list = [11, 20, 52, 91, 90, 75, 74, 20, 21, 10, 14]\n", "\n", "### BEGIN SOLUTION\n", "result = 0\n", "\n", "for num in my_list:\n", " result = result + num\n", "### END SOLUTION\n", "\n", "print(result)" ] }, { "cell_type": "markdown", "id": "fdd9c853", "metadata": { "id": "EdrK-mBsem7r" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- ✔️ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix any incorrect parts." ] }, { "cell_type": "code", "execution_count": 4, "id": "b2d60d4c", "metadata": {}, "outputs": [], "source": [ "import unittest\n", "\n", "tc = unittest.TestCase()\n", "\n", "tc.assertEqual(result, 478)" ] }, { "cell_type": "markdown", "id": "fbb0c241", "metadata": {}, "source": [ "---\n", "\n", "## Introduction to Pandas\n", "\n", "Pandas is a Python *library* for data manipulation and analysis. Although it's used universally in data-related programming applications, it was initially developed for financial analysis by [AQR Capital Management](https://www.aqr.com/).\n", "\n", "![Pandas logo](https://github.com/bdi475/notebooks/blob/main/images/pandas-logo.png?raw=true)\n", "\n", "Note: A *library* in the context of programming is a collection of functions (and other data) that others have already written for you.\n", "\n", "Pandas is popular for many reasons:\n", "\n", "1. 🏃🏿‍♀️ It's fast (for most cases where the dataset can be loaded to your memory).\n", "2. 🪒 It supports most of the features required for data manipulation.\n", "3. 💡 Write less code. Get more done." ] }, { "cell_type": "markdown", "id": "b2485fa9", "metadata": { "id": "Dg-bBkF8fGaN" }, "source": [ "---\n", "\n", "### 🎯 Challenge 2: Import packages\n", "\n", "#### 👇 Tasks\n", "\n", "- ✔️ Import the following Python packages.\n", " 1. `pandas`: Use alias `pd`.\n", " 2. `numpy`: Use alias `np`." ] }, { "cell_type": "code", "execution_count": 5, "id": "1a931be7", "metadata": { "id": "jfnhv8_Yem7j" }, "outputs": [], "source": [ "### BEGIN SOLUTION\n", "import pandas as pd\n", "import numpy as np\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "511e1493", "metadata": { "id": "WCOuwkrzem7j" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- ✔️ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix incorrect parts." ] }, { "cell_type": "code", "execution_count": 6, "id": "a8041cd8", "metadata": { "id": "WQ-COKL8em7k" }, "outputs": [], "source": [ "import sys\n", "tc.assertTrue(\"pd\" in globals(), \"Check whether you have correctly import Pandas with an alias.\")\n", "tc.assertTrue(\"np\" in globals(), \"Check whether you have correctly import NumPy with an alias.\")" ] }, { "cell_type": "markdown", "id": "49f78e8f", "metadata": {}, "source": [ "---\n", "\n", "### It all starts with a `Series`...\n", "\n", "The basic building block of Pandas is a `Series`. A `Series` is like a list, but with many more features.\n", "\n", "You can create a `Series` by passing a list of values to `pd.Series()`." ] }, { "cell_type": "code", "execution_count": 7, "id": "06b980fe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "2 3.0\n", "3 NaN\n", "4 5.0\n", "5 6.0\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series([1, 2, 3, np.nan, 5, 6])\n", "\n", "s" ] }, { "cell_type": "markdown", "id": "50cc82bb", "metadata": {}, "source": [ "### Few things to note here\n", "\n", "1. These look similar to a Python `list`.\n", "2. The last line of the printed output tells us the data type of values in the `Series` (`dtype: float64`).\n", "- What the heck is `np.nan`?\n", " - It is used to indicate a \"missing value\".\n", " - `np.nan` is NOT the same as `0`.\n", " \n", "### Differences between a list and a Series" ] }, { "cell_type": "code", "execution_count": 8, "id": "372f1887", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "[1, 2, 3, 4, 1, 2, 3, 4]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "my_list = [1, 2, 3, 4]\n", "\n", "print(type(my_list))\n", "display(my_list * 2)" ] }, { "cell_type": "code", "execution_count": 9, "id": "0b35e389", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "0 2\n", "1 4\n", "2 6\n", "3 8\n", "dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "my_series = pd.Series([1, 2, 3, 4])\n", "\n", "print(type(my_series))\n", "display(my_series * 2)" ] }, { "cell_type": "markdown", "id": "84ee75dd", "metadata": {}, "source": [ "What happens when you multiply a Python `list` by number `2`? It repeats the elements.\n", "\n", "How about a `Series`? It multiples each element by `2`!" ] }, { "cell_type": "markdown", "id": "ff91c121", "metadata": { "id": "Dg-bBkF8fGaN" }, "source": [ "---\n", "\n", "### 🎯 Challenge 3: Create new `Series`\n", "\n", "#### 👇 Tasks\n", "\n", "- ✔️ Create a new Pandas `Series` named `my_series` with the following three values: `10`, `20`, `30`.\n", "\n", "#### 🚀 Hint\n", "\n", "The code below creates a new Pandas `Series` with the values `1` and `2`.\n", "\n", "```python\n", "my_new_series = pd.Series([1, 2])\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "id": "aaa01f50", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 20\n", "2 30\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "my_series = pd.Series([10, 20, 30])\n", "### END SOLUTION\n", "\n", "my_series" ] }, { "cell_type": "markdown", "id": "ed26cf6d", "metadata": { "id": "EdrK-mBsem7r" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- ✔️ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix any incorrect parts." ] }, { "cell_type": "code", "execution_count": 11, "id": "91ae33ae", "metadata": {}, "outputs": [], "source": [ "pd.testing.assert_series_equal(my_series, pd.Series([1, 2, 3]) * 10)" ] }, { "cell_type": "markdown", "id": "3870734e", "metadata": {}, "source": [ "---\n", "\n", "### Using `Series` methods\n", "\n", "A pandas `Series` is similar to a Python `list`. However, a `Series` provides many methods (equivalent to functions) for you to use.\n", "\n", "As an example, `num_reviews.mean()` will return the average number of reviews." ] }, { "cell_type": "code", "execution_count": 12, "id": "1982fc6a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ellipsis" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_count = [12715, 2274, 2771, 3952, 528, 2766, 724]\n", "num_reviews = pd.Series(reviews_count)\n", "\n", "# YOUR CODE HERE\n", "..." ] }, { "cell_type": "markdown", "id": "370ca8b3-368f-460a-80ff-f19a0a6be150", "metadata": { "id": "Dg-bBkF8fGaN" }, "source": [ "---\n", "\n", "### 🎯 Challenge 4: Create a Pandas DataFrame\n", "\n", "#### 👇 Tasks\n", "\n", "- ✔️ You are given two lists - `product_names` and `num_reviews` that contain the names of make-up products and the number of reviews on Sephora.com.\n", "- ✔️ Using the two lists, create a new Pandas `DataFrame` named `df_top_products` that has the following two columns:\n", " 1. `product_name`: Names of the products\n", " 2. `num_review`: Number of reviews\n", "- ✔️ Note that the column names are singular.\n", "\n", "#### 🚀 Hint\n", "\n", "The code below creates a new Pandas `DataFrame` from two series.\n", "\n", "```python\n", "my_new_dataframe = pd.DataFrame({\n", " \"column_one\": my_series1,\n", " \"column_two\": my_series2\n", "})\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "id": "9371b5a5-da1d-4ca5-9343-aa2040c996d6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_namenum_review
0Laneige Lip Sleeping Mask12715
1The Ordinary Hyaluronic Acid 2% + B52274
2Laneige Lip Glowy Balm2766
3Chanel COCO MADEMOISELLE Eau de Parfum724
\n", "
" ], "text/plain": [ " product_name num_review\n", "0 Laneige Lip Sleeping Mask 12715\n", "1 The Ordinary Hyaluronic Acid 2% + B5 2274\n", "2 Laneige Lip Glowy Balm 2766\n", "3 Chanel COCO MADEMOISELLE Eau de Parfum 724" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "product_names = [\n", " \"Laneige Lip Sleeping Mask\",\n", " \"The Ordinary Hyaluronic Acid 2% + B5\",\n", " \"Laneige Lip Glowy Balm\",\n", " \"Chanel COCO MADEMOISELLE Eau de Parfum\"\n", "]\n", "\n", "num_reviews = [\n", " 12715,\n", " 2274,\n", " 2766,\n", " 724\n", "]\n", "\n", "### BEGIN SOLUTION\n", "df_top_products = pd.DataFrame({\n", " \"product_name\": product_names,\n", " \"num_review\": num_reviews\n", "})\n", "### END SOLUTION\n", "\n", "display(df_top_products)" ] }, { "cell_type": "markdown", "id": "e5c7b7b4-b116-4e62-bcd8-c7d8b7334e77", "metadata": { "id": "EdrK-mBsem7r" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- ✔️ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix any incorrect parts." ] }, { "cell_type": "code", "execution_count": 14, "id": "dba4d57c-2908-42d6-8f2e-6f4e3a88de0d", "metadata": { "id": "DXDG3nzpem7r", "nbgrader": { "grade": true, "grade_id": "challenge-02", "locked": true, "points": "1", "solution": false } }, "outputs": [], "source": [ "pd.testing.assert_frame_equal(\n", " df_top_products.reset_index(drop=True),\n", " pd.DataFrame({\"product_name\": {0: \"Laneige Lip Sleeping Mask\",\n", " 1: \"The Ordinary Hyaluronic Acid 2% + B5\",\n", " 2: \"Laneige Lip Glowy Balm\",\n", " 3: \"Chanel COCO MADEMOISELLE Eau de Parfum\"},\n", " \"num_review\": {0: 12715, 1: 2274, 2: 2766, 3: 724}})\n", ")" ] }, { "cell_type": "markdown", "id": "c59589d5-8fca-42ac-b6eb-992d27206421", "metadata": { "id": "126Uw95nem7k" }, "source": [ "---\n", "\n", "### 📌 Load data" ] }, { "cell_type": "markdown", "id": "1f4c723f-5eec-4c05-a937-b7cd601c97c1", "metadata": { "id": "kH43B35Mem7l" }, "source": [ "The second part of today's lecture is all about **you**. 👻 Literally.\n", "\n", "▶️ Run the code cell below to create a new `DataFrame` named `df_you`." ] }, { "cell_type": "code", "execution_count": 15, "id": "a2cdcd9a-b41e-42f8-ad46-94a48f3628d2", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "CXJECgQiem7l", "outputId": "59ba80a2-892b-4e64-bbe9-1b8fc3a2880c" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemajor1major2citycountryfav_restaurantfav_moviehas_iphone
0Ahana ChakrabortyStatisticsBusiness & InformaticsChicagoUSAPoke LabShrek 2True
1Andrew RozmusPsychologyNaNElmhurstUSANaNNaNTrue
2Anusha AdiraComputer EngineeringBusinessCupertinoUSABangkok ThaiThree IdiotsTrue
3Arthur PyptyukEconomicsBusinessHoffman EstatesUSASakanayaHereditaryTrue
4Aryajit DasEconomicsBusiness & Global Markets plus SocietyStreamwoodUSADubai GrillTransformers: Age of ExtinctionTrue
\n", "
" ], "text/plain": [ " name major1 \\\n", "0 Ahana Chakraborty Statistics \n", "1 Andrew Rozmus Psychology \n", "2 Anusha Adira Computer Engineering \n", "3 Arthur Pyptyuk Economics \n", "4 Aryajit Das Economics \n", "\n", " major2 city country \\\n", "0 Business & Informatics Chicago USA \n", "1 NaN Elmhurst USA \n", "2 Business Cupertino USA \n", "3 Business Hoffman Estates USA \n", "4 Business & Global Markets plus Society Streamwood USA \n", "\n", " fav_restaurant fav_movie has_iphone \n", "0 Poke Lab Shrek 2 True \n", "1 NaN NaN True \n", "2 Bangkok Thai Three Idiots True \n", "3 Sakanaya Hereditary True \n", "4 Dubai Grill Transformers: Age of Extinction True " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_you = pd.read_csv(\"https://github.com/bdi475/datasets/raw/main/about-you.csv\")\n", "\n", "# Used to keep a clean copy\n", "df_you_backup = df_you.copy()\n", "\n", "# head() displays the first 5 rows of a DataFrame\n", "df_you.head()" ] }, { "cell_type": "markdown", "id": "93cb3aa1-6901-45ac-9355-734ace9759cd", "metadata": { "id": "t6vVg2dJem7m" }, "source": [ "☝️ **Hold on.** Didn't we always create `DataFrame`s using `pd.DataFrame()`?\n", "\n", "Yes. But we can *import* existing data as a Pandas `DataFrame` using `pd.read_csv()`. There are many other similar import methods. For now, we'll mostly use `pd.read_csv()`.\n", "\n", "The table below explains each column in `df_you`." ] }, { "cell_type": "markdown", "id": "babad61d-efee-49d2-b6b4-a521bb4f9ba4", "metadata": { "id": "LI33A8-jem7m" }, "source": [ "| Column Name | Description |\n", "|-------------------------|-----------------------------------------------------------|\n", "| name | First name |\n", "| major1 | Major |\n", "| major2 | Second major OR minor (blank if no second major or minor) |\n", "| city | City the person is from |\n", "| country | Country the person is from |\n", "| fav_restaurant | Favorite restaurant (blank if no restaurant was given) |\n", "| fav_movie | Favorite movie (blank if no movie was given) |\n", "| has_iphone | Whether the person use an iPhone |" ] }, { "cell_type": "markdown", "id": "930451c6-a001-4d5f-8537-6d7ca5002235", "metadata": { "id": "d5hE9oVSem7n" }, "source": [ "---\n", "\n", "### 📌 Concise summary of a `DataFrame`" ] }, { "cell_type": "markdown", "id": "664d9b88-6b5e-4d97-8848-9ab423dec6fb", "metadata": { "id": "lg5s3hW2em7n" }, "source": [ "👉 A common first step in working with a `DataFrame` is to use the `info()` method. `info()` prints a concise summary of a `DataFrame`.\n", "- Index data type\n", "- Column information: for each column, the following information is displayed:\n", " - Number of non-missing values\n", " - Data type of the column\n", "- Memory usage" ] }, { "cell_type": "markdown", "id": "c17d9d82-2330-4966-81d4-2c50dc6980df", "metadata": { "id": "v4myKhF8em7o" }, "source": [ "▶️ Run `df_you.info()` below to see the `info()` method in action." ] }, { "cell_type": "code", "execution_count": 16, "id": "3e566e47-1a11-4110-ac18-9ee1beab9590", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UfyZRM7rem7o", "outputId": "891d9445-721f-4c60-a345-36b679815cf0", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 45 entries, 0 to 44\n", "Data columns (total 8 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 name 45 non-null object\n", " 1 major1 44 non-null object\n", " 2 major2 36 non-null object\n", " 3 city 35 non-null object\n", " 4 country 37 non-null object\n", " 5 fav_restaurant 33 non-null object\n", " 6 fav_movie 31 non-null object\n", " 7 has_iphone 45 non-null bool \n", "dtypes: bool(1), object(7)\n", "memory usage: 2.6+ KB\n" ] } ], "source": [ "### BEGIN SOLUTION\n", "df_you.info()\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "0e8abd55-1c5e-48e4-8029-c61805535a5f", "metadata": { "id": "prZjAYOUem7o" }, "source": [ "👉 From the result of `df_you.info()`, we can understand a couple of things:\n", "\n", "- There are 8 columns.\n", "- 7 out of 8 columns have the `object` data type.\n", " - In Pandas, a string data type is shown as `object`, not `str`.\n", " - We will skip the technical discussion for now.\n", "- The second line of the output tells us the number of rows (i.e., observations).\n", "- Some columns contain one or more missing values.\n", " - Missing values are displayed as `NaN`.\n", " - To denote a missing value, use NumPy's `np.nan` (more on this later)." ] }, { "cell_type": "markdown", "id": "ec0a09ad-6180-4ea9-a8df-20c3aaf8e7a4", "metadata": { "id": "OJ4qqvKIem7q" }, "source": [ "---\n", "\n", "### 🎯 Challenge 5: Display first/last/random rows" ] }, { "cell_type": "markdown", "id": "caec58af-5dcc-42e5-932a-38703bc25a66", "metadata": { "id": "TwWyyaQPem7q" }, "source": [ "▶️ Run `df_you.head()` to print the first 5 rows of `df_you`." ] }, { "cell_type": "code", "execution_count": 17, "id": "5f802a73-5323-4184-827c-0bc5c9aa1469", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OoYuVAyhem7q", "outputId": "8eea66d1-ee93-464c-9889-346fd512d07a", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemajor1major2citycountryfav_restaurantfav_moviehas_iphone
0Ahana ChakrabortyStatisticsBusiness & InformaticsChicagoUSAPoke LabShrek 2True
1Andrew RozmusPsychologyNaNElmhurstUSANaNNaNTrue
2Anusha AdiraComputer EngineeringBusinessCupertinoUSABangkok ThaiThree IdiotsTrue
3Arthur PyptyukEconomicsBusinessHoffman EstatesUSASakanayaHereditaryTrue
4Aryajit DasEconomicsBusiness & Global Markets plus SocietyStreamwoodUSADubai GrillTransformers: Age of ExtinctionTrue
\n", "
" ], "text/plain": [ " name major1 \\\n", "0 Ahana Chakraborty Statistics \n", "1 Andrew Rozmus Psychology \n", "2 Anusha Adira Computer Engineering \n", "3 Arthur Pyptyuk Economics \n", "4 Aryajit Das Economics \n", "\n", " major2 city country \\\n", "0 Business & Informatics Chicago USA \n", "1 NaN Elmhurst USA \n", "2 Business Cupertino USA \n", "3 Business Hoffman Estates USA \n", "4 Business & Global Markets plus Society Streamwood USA \n", "\n", " fav_restaurant fav_movie has_iphone \n", "0 Poke Lab Shrek 2 True \n", "1 NaN NaN True \n", "2 Bangkok Thai Three Idiots True \n", "3 Sakanaya Hereditary True \n", "4 Dubai Grill Transformers: Age of Extinction True " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "df_you.head()\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "1f572157-3d4d-46c6-b46e-89349c8f464a", "metadata": { "id": "TwWyyaQPem7q" }, "source": [ "▶️ Run `df_you.tail(4)` to print the last 4 rows of `df_you`." ] }, { "cell_type": "code", "execution_count": 18, "id": "d7b529c3-0972-4fe8-9106-4e56cf6b8315", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OoYuVAyhem7q", "outputId": "8eea66d1-ee93-464c-9889-346fd512d07a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemajor1major2citycountryfav_restaurantfav_moviehas_iphone
41Twinkle YeruvaComputer ScienceBusinessSchaumburgUSASticky RiceMaze RunnerTrue
42Valentina FloresEconomicsBusiness & FrenchChicagoUSANaNBook of LifeTrue
43Victoria HernandezIndustrial DesignSpanishEast MolineUSABangkok ThaiMamma Mia or ShrekTrue
44Vikas ChavdaEconomicsBusinessGenevaUSAYogiKingsman: Secret ServiceTrue
\n", "
" ], "text/plain": [ " name major1 major2 city \\\n", "41 Twinkle Yeruva Computer Science Business Schaumburg \n", "42 Valentina Flores Economics Business & French Chicago \n", "43 Victoria Hernandez Industrial Design Spanish East Moline \n", "44 Vikas Chavda Economics Business Geneva \n", "\n", " country fav_restaurant fav_movie has_iphone \n", "41 USA Sticky Rice Maze Runner True \n", "42 USA NaN Book of Life True \n", "43 USA Bangkok Thai Mamma Mia or Shrek True \n", "44 USA Yogi Kingsman: Secret Service True " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "df_you.tail(4)\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "5ae9815a-a5b7-4747-9175-9535b133e614", "metadata": { "id": "TwWyyaQPem7q" }, "source": [ "▶️ Run `df_you.sample(3)` to print 3 randomly sampled rows from `df_you`." ] }, { "cell_type": "code", "execution_count": 19, "id": "ed993bdb-cb01-4a0e-8071-861107d96fda", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OoYuVAyhem7q", "outputId": "8eea66d1-ee93-464c-9889-346fd512d07a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemajor1major2citycountryfav_restaurantfav_moviehas_iphone
8Cole JordanComputer ScienceBusinessNaNUSAChipotleRatatouilleTrue
21Julia KevinBioengineeringBusinessElmhurstUSAKoFusionSet it UpTrue
38Spencer SadlerComputer ScienceBusinessChicagoUSABangkok ThaiRatatouilleTrue
\n", "
" ], "text/plain": [ " name major1 major2 city country \\\n", "8 Cole Jordan Computer Science Business NaN USA \n", "21 Julia Kevin Bioengineering Business Elmhurst USA \n", "38 Spencer Sadler Computer Science Business Chicago USA \n", "\n", " fav_restaurant fav_movie has_iphone \n", "8 Chipotle Ratatouille True \n", "21 KoFusion Set it Up True \n", "38 Bangkok Thai Ratatouille True " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "df_you.sample(3)\n", "### END SOLUTION" ] }, { "cell_type": "code", "execution_count": 20, "id": "2ebbb254-bae3-462a-87a1-08b37c7e3441", "metadata": { "nbgrader": { "grade": true, "grade_id": "challenge-03", "locked": true, "points": "1", "solution": false } }, "outputs": [], "source": [ "# Autograder" ] }, { "cell_type": "markdown", "id": "5aec9daa-8aea-4981-a3ae-38733d6e3841", "metadata": { "id": "hgU6chDUem7o" }, "source": [ "---\n", "\n", "### 📌 Number of rows and columns in a `DataFrame`" ] }, { "cell_type": "markdown", "id": "ed9edce3-3563-47a9-9226-5d5a520b0e46", "metadata": { "id": "3w2j6dSFem7p" }, "source": [ "👉 How many rows and columns does `df_you` have?\n", "\n", "▶️ Run `df_you.shape` below to see the *shape* (number of rows and columns) of the database." ] }, { "cell_type": "code", "execution_count": 21, "id": "19fc92e6-1b9e-41eb-beeb-60a0a83804a8", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "go7OUMtTem7p", "outputId": "cfb21812-0eef-4bc8-e2a0-fc11756ce7a0" }, "outputs": [ { "data": { "text/plain": [ "(45, 8)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### BEGIN SOLUTION\n", "df_you.shape\n", "### END SOLUTION" ] }, { "cell_type": "markdown", "id": "1fc55735-9a55-4ea2-92fd-1940f15c9a0a", "metadata": { "id": "7J--UYFvem7p" }, "source": [ "👉 Can you store the number of rows and columns to variables?\n", "\n", "---\n", "\n", "- `df_you.shape` returns a `tuple` in `(num_rows, num_cols)` format. \n", "- What is a `tuple`? 🙀\n", "- A `tuple` is a `list` that cannot be modified once created.\n", "\n", "▶️ Run the code cell below to see how a `tuple` is nearly identical to a `list`." ] }, { "cell_type": "code", "execution_count": 22, "id": "59f5c6b5-3fb9-448b-9b96-1f6a0351ef89", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vAnEKnhUem7p", "outputId": "23c744fd-7028-496b-a243-a015ea486481" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "my_list[1]=20\n", "my_tuple[1]=20\n" ] } ], "source": [ "# These two are nearly identical,\n", "# The only difference is that my_tuple cannot be modified once created\n", "my_list = [10, 20]\n", "my_tuple = (10, 20)\n", "\n", "print(f\"my_list[1]={my_list[1]}\") # prints 20\n", "print(f\"my_tuple[1]={my_tuple[1]}\") # also prints 20" ] }, { "cell_type": "markdown", "id": "68fa4f8f-b6a5-49ca-91a8-b08a35dff0af", "metadata": { "id": "OJ4qqvKIem7q" }, "source": [ "---\n", "\n", "### 🎯 Challenge 6: Find the number of rows and columns in a `DataFrame`" ] }, { "cell_type": "markdown", "id": "c5a24363-6f2c-4573-8be3-576fafb7441f", "metadata": { "id": "TwWyyaQPem7q" }, "source": [ "#### 👇 Tasks\n", "\n", "- ✔️ Store the number of rows in `df_you` to a new variable named `num_rows`.\n", "- ✔️ Store the number of columns in `df_you` to a new variable named `num_cols`.\n", "- ✔️ Use `.shape`, not `len()`." ] }, { "cell_type": "code", "execution_count": 23, "id": "43c16cb2-af4e-4c46-9c24-ba087434b1da", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OoYuVAyhem7q", "outputId": "8eea66d1-ee93-464c-9889-346fd512d07a", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "45\n", "8\n" ] } ], "source": [ "### BEGIN SOLUTION\n", "num_rows = df_you.shape[0]\n", "num_cols = df_you.shape[1]\n", "### END SOLUTION\n", "\n", "print(num_rows)\n", "print(num_cols)" ] }, { "cell_type": "markdown", "id": "2e863033-5619-47a1-9221-d8728e798c9c", "metadata": { "id": "EdrK-mBsem7r" }, "source": [ "#### 🧭 Check Your Work\n", "\n", "- Once you're done, run the code cell below to test correctness.\n", "- ✔️ If the code cell runs without an error, you're good to move on.\n", "- ❌ If the code cell throws an error, go back and fix incorrect parts." ] }, { "cell_type": "code", "execution_count": 24, "id": "42e11cbf-924e-4679-8c9b-239a3049f796", "metadata": { "id": "DXDG3nzpem7r", "nbgrader": { "grade": true, "grade_id": "challenge-04", "locked": true, "points": "1", "solution": false }, "tags": [] }, "outputs": [], "source": [ "tc.assertEqual(num_rows, len(df_you.index), f\"Number of rows should be {len(df_you.index)}\")\n", "tc.assertEqual(num_cols, len(df_you.columns), f\"Number of columns should be {len(df_you.columns)}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 5 }