<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 

Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

# <center> Topic 1. Exploratory data analysis with Pandas
## <center>Practice. Analyzing "Titanic" passengers. Solution

**Fill in the missing code ("Your code here") and choose answers in a [web-form](https://docs.google.com/forms/d/16EfhpDGPrREry0gfDQdRPjoiQX9IumaL2mPR0rcj19k/edit).**

In [None]:
import numpy as np
import pandas as pd

pd.set_option("display.precision", 2)
from matplotlib import pyplot as plt

# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'svg'

**Read data into a Pandas DataFrame**

In [None]:
data = pd.read_csv("../../data/titanic_train.csv", index_col="PassengerId")

**First 5 rows**

In [None]:
data.head(5)

In [None]:
data.describe()

**Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).**

Make sure you understand how actually this construction works.

In [None]:
data[(data["Embarked"] == "C") & (data.Fare > 200)].head()

**We can sort these people by Fare in descending order.**

In [None]:
data[(data["Embarked"] == "C") & (data["Fare"] > 200)].sort_values(
    by="Fare", ascending=False
).head()

**Let's create a new feature.**

In [None]:
def age_category(age):
    """
    < 30 -> 1
    >= 30, <55 -> 2
    >= 55 -> 3
    """
    if age < 30:
        return 1
    elif age < 55:
        return 2
    elif age >= 55:
        return 3

In [None]:
age_categories = [age_category(age) for age in data.Age]
data["Age_category"] = age_categories

**Another way is to do it with `apply`.**

In [None]:
data["Age_category"] = data["Age"].apply(age_category)

**1. How many men/women were there onboard?**
- 412 men and 479 women
- 314 men и 577 women
- 479 men и 412 women
- **<font color='green'>577 men и 314 women [+]</font>**

In [None]:
(data["Sex"] == "male").sum(), (data["Sex"] == "female").sum()

**Easier:**

In [None]:
data["Sex"].value_counts()

**2. Print the distribution of the `Pclass` feature. Then the same, but for men and women separately. How many men from second class were there onboard?**
- 104
- **<font color='green'>108 [+]</font>**
- 112
- 125

In [None]:
pd.crosstab(data["Pclass"], data["Sex"], margins=True)

We can plot a picture as well, though it's not necessary here. 

In [None]:
data["Pclass"].hist(label="all")
data[data["Sex"] == "male"]["Pclass"].hist(color="green", label="male")
data[data["Sex"] == "female"]["Pclass"].hist(color="yellow", label="female")
plt.title("Distribution by class and gender.")
plt.xlabel("Pclass")
plt.ylabel("Frequency")
plt.legend(loc="upper left");

**3. What are median and standard deviation of `Fare`?. Round to two decimals.**
- **<font color='green'>median is  14.45, standard deviation is 49.69 [+]</font>**
- median is 15.1, standard deviation is 12.15
- median is 13.15, standard deviation is 35.3
- median is  17.43, standard deviation is 39.1

In [None]:
print("Median fare: ", round(data["Fare"].median(), 2))
print("Fare std: ", round(data["Fare"].std(), 2))

**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**
- Yes
- **<font color='green'>No [+]</font>**

In [None]:
data[data["Survived"] == 1]["Age"].hist(
    color="green", label="Survived", alpha=0.5, density=True
)
data[data["Survived"] == 0]["Age"].hist(
    color="red", label="Died", alpha=0.5, density=True
)
plt.title("Age for survived and died")
plt.xlabel("Years")
plt.ylabel("Frequency")
plt.legend();

In [None]:
#!pip install seaborn
import seaborn as sns

sns.set()

In [None]:
sns.boxplot(data["Survived"], data["Age"]);

Can't see the difference through eye-balling only. Let's calculate.

In [None]:
data.groupby("Survived")["Age"].mean()

**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**
- 22.7% among young and 40.6% among old
- **<font color='green'>40.6% among young and 22.7% among old [+]</font>**
- 35.3% among young and 27.4% among old
- 27.4% among young and  35.3% among old

In [None]:
young_survived = data.loc[data["Age"] < 30, "Survived"]
old_survived = data.loc[data["Age"] > 60, "Survived"]

print(
    "Shares of survived people: \n\t  among young {}%, \n\t  among old {}%.".format(
        round(100 * young_survived.mean(), 1), round(100 * old_survived.mean(), 1)
    )
)

**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**
- 30.2% among men and 46.2% among women
- 35.7% among men and 74.2% among women
- 21.1% among men and 46.2% among women
- **<font color='green'>18.9% among men and 74.2% among women [+]</font>**

In [None]:
male_survived = data[data["Sex"] == "male"]["Survived"]
female_survived = data[data["Sex"] == "female"]["Survived"]


print(
    "Shares of survived people: \n\t among women {}%, \n\t among men {}%".format(
        round(100 * female_survived.mean(), 1), round(100 * male_survived.mean(), 1)
    )
)

**7. What's the most popular first name among male passengers?**
- Charles
- Thomas
- **<font color='green'>William [+]</font>**
- John

In [None]:
data["Name"].head()

In [None]:
data.loc[1, "Name"].split(",")[1].split()[1]

In [None]:
first_names = data.loc[data["Sex"] == "male", "Name"].apply(
    lambda full_name: full_name.split(",")[1].split()[1]
)
first_names.value_counts().head()

**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**
- **<font color='green'>On average, men of 1 class are older than 40 [+]</font>**
- On average, women of 1 class are older than 40
- **<font color='green'>Men of all classes are on average older than women of the same class [+]</font>**
- **<font color='green'> On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class [+]</font>**

In [None]:
for cl in data["Pclass"].unique():
    for sex in data["Sex"].unique():
        print(
            "Average age for {0} and class {1}: {2}".format(
                sex,
                cl,
                round(
                    data[(data["Sex"] == sex) & (data["Pclass"] == cl)]["Age"].mean(), 2
                ),
            )
        )

Nicer:

In [None]:
for (cl, sex), sub_df in data.groupby(["Pclass", "Sex"]):
    print(
        "Average age for {0} and class {1}: {2}".format(
            sex, cl, round(sub_df["Age"].mean(), 2)
        )
    )

And even nicer:

In [None]:
pd.crosstab(data["Pclass"], data["Sex"], values=data["Age"], aggfunc=np.mean)

In [None]:
sns.boxplot(data["Pclass"], data["Age"]);

## Useful resources
* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-practice-solution)
* Topic 1 "Exploratory Data Analysis with Pandas" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)
* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)