# Introduction to Jupyter Notebooks and Pandas

## What is a Jupyter Notebook?

A [Jupyter](https://jupyter.org/) notebook is a document that can contain live code w/ results, visualizations, and rich text. It is widely used in data science and analytics. The cell below is a *code* cell. It contains a block of executable code.

Run the code below by clicking on the cell below and clicking the "Run" button on top (‚ñ∂).

In [1]:
print(10 + 20)

30


‚ñ∂Ô∏è Run the code cell below to import `unittest`, a module used for **üß≠ Check Your Work** sections and the autograder.

In [2]:
import unittest
tc = unittest.TestCase()

## Types of cells

There are three different type of cells.

1. Code cell
2. Markdown cell
3. Raw cell

We will most frequently use the first two types of cells.

---

### üéØ Challenge 1: Find the sum of a list

#### üëá Tasks

- ‚úîÔ∏è Complete the code cell below to find the sum of all values in `my_list`.
- ‚úîÔ∏è Store the result in a new variable named `result`.

In [3]:
my_list = [11, 20, 52, 91, 90, 75, 74, 20, 21, 10, 14]

### BEGIN SOLUTION
result = 0

for num in my_list:
    result = result + num
### END SOLUTION

print(result)

478


#### üß≠ Check Your Work

- Once you're done, run the code cell below to test correctness.
- ‚úîÔ∏è If the code cell runs without an error, you're good to move on.
- ‚ùå If the code cell throws an error, go back and fix any incorrect parts.

In [4]:
import unittest

tc = unittest.TestCase()

tc.assertEqual(result, 478)

---

## Introduction to Pandas

Pandas is a Python *library* for data manipulation and analysis. Although it's used universally in data-related programming applications, it was initially developed for financial analysis by [AQR Capital Management](https://www.aqr.com/).

![Pandas logo](https://github.com/bdi475/notebooks/blob/main/images/pandas-logo.png?raw=true)

Note: A *library* in the context of programming is a collection of functions (and other data) that others have already written for you.

Pandas is popular for many reasons:

1. üèÉüèø‚Äç‚ôÄÔ∏è It's fast (for most cases where the dataset can be loaded to your memory).
2. ü™í It supports most of the features required for data manipulation.
3. üí° Write less code. Get more done.

---

### üéØ Challenge 2: Import packages

#### üëá Tasks

- ‚úîÔ∏è Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [5]:
### BEGIN SOLUTION
import pandas as pd
import numpy as np
### END SOLUTION

#### üß≠ Check Your Work

- Once you're done, run the code cell below to test correctness.
- ‚úîÔ∏è If the code cell runs without an error, you're good to move on.
- ‚ùå If the code cell throws an error, go back and fix incorrect parts.

In [6]:
import sys
tc.assertTrue("pd" in globals(), "Check whether you have correctly import Pandas with an alias.")
tc.assertTrue("np" in globals(), "Check whether you have correctly import NumPy with an alias.")

---

### It all starts with a `Series`...

The basic building block of Pandas is a `Series`. A `Series` is like a list, but with many more features.

You can create a `Series` by passing a list of values to `pd.Series()`.

In [7]:
s = pd.Series([1, 2, 3, np.nan, 5, 6])

s

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64

### Few things to note here

1. These look similar to a Python `list`.
2. The last line of the printed output tells us the data type of values in the `Series` (`dtype: float64`).
- What the heck is `np.nan`?
    - It is used to indicate a "missing value".
    - `np.nan` is NOT the same as `0`.
    
### Differences between a list and a Series

In [8]:
my_list = [1, 2, 3, 4]

print(type(my_list))
display(my_list * 2)

<class 'list'>


[1, 2, 3, 4, 1, 2, 3, 4]

In [9]:
my_series = pd.Series([1, 2, 3, 4])

print(type(my_series))
display(my_series * 2)

<class 'pandas.core.series.Series'>


0    2
1    4
2    6
3    8
dtype: int64

What happens when you multiply a Python `list` by number `2`? It repeats the elements.

How about a `Series`? It multiples each element by `2`!

---

### üéØ Challenge 3: Create new `Series`

#### üëá Tasks

- ‚úîÔ∏è Create a new Pandas `Series` named `my_series` with the following three values: `10`, `20`, `30`.

#### üöÄ Hint

The code below creates a new Pandas `Series` with the values `1` and `2`.

```python
my_new_series = pd.Series([1, 2])
```

In [10]:
### BEGIN SOLUTION
my_series = pd.Series([10, 20, 30])
### END SOLUTION

my_series

0    10
1    20
2    30
dtype: int64

#### üß≠ Check Your Work

- Once you're done, run the code cell below to test correctness.
- ‚úîÔ∏è If the code cell runs without an error, you're good to move on.
- ‚ùå If the code cell throws an error, go back and fix any incorrect parts.

In [11]:
pd.testing.assert_series_equal(my_series, pd.Series([1, 2, 3]) * 10)

---

### Using `Series` methods

A pandas `Series` is similar to a Python `list`. However, a `Series` provides many methods (equivalent to functions) for you to use.

As an example, `num_reviews.mean()` will return the average number of reviews.

In [12]:
reviews_count = [12715, 2274, 2771, 3952, 528, 2766, 724]
num_reviews = pd.Series(reviews_count)

# YOUR CODE HERE
...

Ellipsis

---

### üéØ Challenge 4: Create a Pandas DataFrame

#### üëá Tasks

- ‚úîÔ∏è You are given two lists - `product_names` and `num_reviews` that contain the names of make-up products and the number of reviews on Sephora.com.
- ‚úîÔ∏è Using the two lists, create a new Pandas `DataFrame` named `df_top_products` that has the following two columns:
    1. `product_name`: Names of the products
    2. `num_review`: Number of reviews
- ‚úîÔ∏è Note that the column names are singular.

#### üöÄ Hint

The code below creates a new Pandas `DataFrame` from two series.

```python
my_new_dataframe = pd.DataFrame({
    "column_one": my_series1,
    "column_two": my_series2
})
```

In [13]:
product_names = [
    "Laneige Lip Sleeping Mask",
    "The Ordinary Hyaluronic Acid 2% + B5",
    "Laneige Lip Glowy Balm",
    "Chanel COCO MADEMOISELLE Eau de Parfum"
]

num_reviews = [
    12715,
    2274,
    2766,
    724
]

### BEGIN SOLUTION
df_top_products = pd.DataFrame({
    "product_name": product_names,
    "num_review": num_reviews
})
### END SOLUTION

display(df_top_products)

Unnamed: 0,product_name,num_review
0,Laneige Lip Sleeping Mask,12715
1,The Ordinary Hyaluronic Acid 2% + B5,2274
2,Laneige Lip Glowy Balm,2766
3,Chanel COCO MADEMOISELLE Eau de Parfum,724


#### üß≠ Check Your Work

- Once you're done, run the code cell below to test correctness.
- ‚úîÔ∏è If the code cell runs without an error, you're good to move on.
- ‚ùå If the code cell throws an error, go back and fix any incorrect parts.

In [14]:
pd.testing.assert_frame_equal(
    df_top_products.reset_index(drop=True),
    pd.DataFrame({"product_name": {0: "Laneige Lip Sleeping Mask",
        1: "The Ordinary Hyaluronic Acid 2% + B5",
        2: "Laneige Lip Glowy Balm",
        3: "Chanel COCO MADEMOISELLE Eau de Parfum"},
        "num_review": {0: 12715, 1: 2274, 2: 2766, 3: 724}})
)

---

### üìå Load data

The second part of today's lecture is all about **you**. üëª Literally.

‚ñ∂Ô∏è Run the code cell below to create a new `DataFrame` named `df_you`.

In [15]:
df_you = pd.read_csv("https://github.com/bdi475/datasets/raw/main/about-you.csv")

# Used to keep a clean copy
df_you_backup = df_you.copy()

# head() displays the first 5 rows of a DataFrame
df_you.head()

Unnamed: 0,name,major1,major2,city,country,fav_restaurant,fav_movie,has_iphone
0,Ahana Chakraborty,Statistics,Business & Informatics,Chicago,USA,Poke Lab,Shrek 2,True
1,Andrew Rozmus,Psychology,,Elmhurst,USA,,,True
2,Anusha Adira,Computer Engineering,Business,Cupertino,USA,Bangkok Thai,Three Idiots,True
3,Arthur Pyptyuk,Economics,Business,Hoffman Estates,USA,Sakanaya,Hereditary,True
4,Aryajit Das,Economics,Business & Global Markets plus Society,Streamwood,USA,Dubai Grill,Transformers: Age of Extinction,True


‚òùÔ∏è **Hold on.** Didn't we always create `DataFrame`s using `pd.DataFrame()`?

Yes. But we can *import* existing data as a Pandas `DataFrame` using `pd.read_csv()`. There are many other similar import methods. For now, we'll mostly use `pd.read_csv()`.

The table below explains each column in `df_you`.

| Column Name             | Description                                               |
|-------------------------|-----------------------------------------------------------|
| name                    | First name                                                |
| major1                  | Major                                                     |
| major2                  | Second major OR minor (blank if no second major or minor) |
| city                    | City the person is from                                   |
| country                 | Country the person is from                                   |
| fav_restaurant          | Favorite restaurant (blank if no restaurant was given)    |
| fav_movie               | Favorite movie (blank if no movie was given)              |
| has_iphone              | Whether the person use an iPhone                          |

---

### üìå Concise summary of a `DataFrame`

üëâ A common first step in working with a `DataFrame` is to use the `info()` method. `info()` prints a concise summary of a `DataFrame`.
- Index data type
- Column information: for each column, the following information is displayed:
    - Number of non-missing values
    - Data type of the column
- Memory usage

‚ñ∂Ô∏è Run `df_you.info()` below to see the `info()` method in action.

In [16]:
### BEGIN SOLUTION
df_you.info()
### END SOLUTION

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            45 non-null     object
 1   major1          44 non-null     object
 2   major2          36 non-null     object
 3   city            35 non-null     object
 4   country         37 non-null     object
 5   fav_restaurant  33 non-null     object
 6   fav_movie       31 non-null     object
 7   has_iphone      45 non-null     bool  
dtypes: bool(1), object(7)
memory usage: 2.6+ KB


üëâ From the result of `df_you.info()`, we can understand a couple of things:

- There are 8 columns.
- 7 out of 8 columns have the `object` data type.
    - In Pandas, a string data type is shown as `object`, not `str`.
        - We will skip the technical discussion for now.
- The second line of the output tells us the number of rows (i.e., observations).
- Some columns contain one or more missing values.
    - Missing values are displayed as `NaN`.
    - To denote a missing value, use NumPy's `np.nan` (more on this later).

---

### üéØ Challenge 5: Display first/last/random rows

‚ñ∂Ô∏è Run `df_you.head()` to print the first 5 rows of `df_you`.

In [17]:
### BEGIN SOLUTION
df_you.head()
### END SOLUTION

Unnamed: 0,name,major1,major2,city,country,fav_restaurant,fav_movie,has_iphone
0,Ahana Chakraborty,Statistics,Business & Informatics,Chicago,USA,Poke Lab,Shrek 2,True
1,Andrew Rozmus,Psychology,,Elmhurst,USA,,,True
2,Anusha Adira,Computer Engineering,Business,Cupertino,USA,Bangkok Thai,Three Idiots,True
3,Arthur Pyptyuk,Economics,Business,Hoffman Estates,USA,Sakanaya,Hereditary,True
4,Aryajit Das,Economics,Business & Global Markets plus Society,Streamwood,USA,Dubai Grill,Transformers: Age of Extinction,True


‚ñ∂Ô∏è Run `df_you.tail(4)` to print the last 4 rows of `df_you`.

In [18]:
### BEGIN SOLUTION
df_you.tail(4)
### END SOLUTION

Unnamed: 0,name,major1,major2,city,country,fav_restaurant,fav_movie,has_iphone
41,Twinkle Yeruva,Computer Science,Business,Schaumburg,USA,Sticky Rice,Maze Runner,True
42,Valentina Flores,Economics,Business & French,Chicago,USA,,Book of Life,True
43,Victoria Hernandez,Industrial Design,Spanish,East Moline,USA,Bangkok Thai,Mamma Mia or Shrek,True
44,Vikas Chavda,Economics,Business,Geneva,USA,Yogi,Kingsman: Secret Service,True


‚ñ∂Ô∏è Run `df_you.sample(3)` to print 3 randomly sampled rows from `df_you`.

In [19]:
### BEGIN SOLUTION
df_you.sample(3)
### END SOLUTION

Unnamed: 0,name,major1,major2,city,country,fav_restaurant,fav_movie,has_iphone
8,Cole Jordan,Computer Science,Business,,USA,Chipotle,Ratatouille,True
21,Julia Kevin,Bioengineering,Business,Elmhurst,USA,KoFusion,Set it Up,True
38,Spencer Sadler,Computer Science,Business,Chicago,USA,Bangkok Thai,Ratatouille,True


In [20]:
# Autograder

---

### üìå Number of rows and columns in a `DataFrame`

üëâ How many rows and columns does `df_you` have?

‚ñ∂Ô∏è Run `df_you.shape` below to see the *shape* (number of rows and columns) of the database.

In [21]:
### BEGIN SOLUTION
df_you.shape
### END SOLUTION

(45, 8)

üëâ Can you store the number of rows and columns to variables?

---

- `df_you.shape` returns a `tuple` in `(num_rows, num_cols)` format. 
- What is a `tuple`? üôÄ
- A `tuple` is a `list` that cannot be modified once created.

‚ñ∂Ô∏è Run the code cell below to see how a `tuple` is nearly identical to a `list`.

In [22]:
# These two are nearly identical,
# The only difference is that my_tuple cannot be modified once created
my_list = [10, 20]
my_tuple = (10, 20)

print(f"my_list[1]={my_list[1]}")    # prints 20
print(f"my_tuple[1]={my_tuple[1]}")  # also prints 20

my_list[1]=20
my_tuple[1]=20


---

### üéØ Challenge 6: Find the number of rows and columns in a `DataFrame`

#### üëá Tasks

- ‚úîÔ∏è Store the number of rows in `df_you` to a new variable named `num_rows`.
- ‚úîÔ∏è Store the number of columns in `df_you` to a new variable named `num_cols`.
- ‚úîÔ∏è Use `.shape`, not `len()`.

In [23]:
### BEGIN SOLUTION
num_rows = df_you.shape[0]
num_cols = df_you.shape[1]
### END SOLUTION

print(num_rows)
print(num_cols)

45
8


#### üß≠ Check Your Work

- Once you're done, run the code cell below to test correctness.
- ‚úîÔ∏è If the code cell runs without an error, you're good to move on.
- ‚ùå If the code cell throws an error, go back and fix incorrect parts.

In [24]:
tc.assertEqual(num_rows, len(df_you.index), f"Number of rows should be {len(df_you.index)}")
tc.assertEqual(num_cols, len(df_you.columns), f"Number of columns should be {len(df_you.columns)}")