\n",
"\n",
" \n",
"## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
"\n",
"Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#
Topic 1. Exploratory data analysis with Pandas\n",
"##
Practice. Analyzing \"Titanic\" passengers. Solution\n",
"\n",
"**Fill in the missing code (\"Your code here\") and choose answers in a [web-form](https://docs.google.com/forms/d/16EfhpDGPrREry0gfDQdRPjoiQX9IumaL2mPR0rcj19k/edit).**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"pd.set_option(\"display.precision\", 2)\n",
"from matplotlib import pyplot as plt\n",
"\n",
"# Graphics in SVG format are more sharp and legible\n",
"%config InlineBackend.figure_format = 'svg'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Read data into a Pandas DataFrame**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv(\"../../data/titanic_train.csv\", index_col=\"PassengerId\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**First 5 rows**"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Survived
\n",
"
Pclass
\n",
"
Name
\n",
"
Sex
\n",
"
Age
\n",
"
SibSp
\n",
"
Parch
\n",
"
Ticket
\n",
"
Fare
\n",
"
Cabin
\n",
"
Embarked
\n",
"
\n",
"
\n",
"
PassengerId
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1
\n",
"
0
\n",
"
3
\n",
"
Braund, Mr. Owen Harris
\n",
"
male
\n",
"
22.0
\n",
"
1
\n",
"
0
\n",
"
A/5 21171
\n",
"
7.25
\n",
"
NaN
\n",
"
S
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
1
\n",
"
Cumings, Mrs. John Bradley (Florence Briggs Th...
\n",
"
female
\n",
"
38.0
\n",
"
1
\n",
"
0
\n",
"
PC 17599
\n",
"
71.28
\n",
"
C85
\n",
"
C
\n",
"
\n",
"
\n",
"
3
\n",
"
1
\n",
"
3
\n",
"
Heikkinen, Miss. Laina
\n",
"
female
\n",
"
26.0
\n",
"
0
\n",
"
0
\n",
"
STON/O2. 3101282
\n",
"
7.92
\n",
"
NaN
\n",
"
S
\n",
"
\n",
"
\n",
"
4
\n",
"
1
\n",
"
1
\n",
"
Futrelle, Mrs. Jacques Heath (Lily May Peel)
\n",
"
female
\n",
"
35.0
\n",
"
1
\n",
"
0
\n",
"
113803
\n",
"
53.10
\n",
"
C123
\n",
"
S
\n",
"
\n",
"
\n",
"
5
\n",
"
0
\n",
"
3
\n",
"
Allen, Mr. William Henry
\n",
"
male
\n",
"
35.0
\n",
"
0
\n",
"
0
\n",
"
373450
\n",
"
8.05
\n",
"
NaN
\n",
"
S
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"3 1 3 \n",
"4 1 1 \n",
"5 0 3 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund, Mr. Owen Harris male 22.0 \n",
"2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n",
"3 Heikkinen, Miss. Laina female 26.0 \n",
"4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"5 Allen, Mr. William Henry male 35.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.25 NaN S \n",
"2 1 0 PC 17599 71.28 C85 C \n",
"3 0 0 STON/O2. 3101282 7.92 NaN S \n",
"4 1 0 113803 53.10 C123 S \n",
"5 0 0 373450 8.05 NaN S "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"data[\"Pclass\"].hist(label=\"all\")\n",
"data[data[\"Sex\"] == \"male\"][\"Pclass\"].hist(color=\"green\", label=\"male\")\n",
"data[data[\"Sex\"] == \"female\"][\"Pclass\"].hist(color=\"yellow\", label=\"female\")\n",
"plt.title(\"Distribution by class and gender.\")\n",
"plt.xlabel(\"Pclass\")\n",
"plt.ylabel(\"Frequency\")\n",
"plt.legend(loc=\"upper left\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3. What are median and standard deviation of `Fare`?. Round to two decimals.**\n",
"- **median is 14.45, standard deviation is 49.69 [+]**\n",
"- median is 15.1, standard deviation is 12.15\n",
"- median is 13.15, standard deviation is 35.3\n",
"- median is 17.43, standard deviation is 39.1"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Median fare: 14.45\n",
"Fare std: 49.69\n"
]
}
],
"source": [
"print(\"Median fare: \", round(data[\"Fare\"].median(), 2))\n",
"print(\"Fare std: \", round(data[\"Fare\"].std(), 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**\n",
"- Yes\n",
"- **No [+]**"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.boxplot(data[\"Survived\"], data[\"Age\"]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can't see the difference through eye-balling only. Let's calculate."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Survived\n",
"0 30.63\n",
"1 28.34\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.groupby(\"Survived\")[\"Age\"].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**\n",
"- 22.7% among young and 40.6% among old\n",
"- **40.6% among young and 22.7% among old [+]**\n",
"- 35.3% among young and 27.4% among old\n",
"- 27.4% among young and 35.3% among old"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shares of survived people: \n",
"\t among young 40.6%, \n",
"\t among old 22.7%.\n"
]
}
],
"source": [
"young_survived = data.loc[data[\"Age\"] < 30, \"Survived\"]\n",
"old_survived = data.loc[data[\"Age\"] > 60, \"Survived\"]\n",
"\n",
"print(\n",
" \"Shares of survived people: \\n\\t among young {}%, \\n\\t among old {}%.\".format(\n",
" round(100 * young_survived.mean(), 1), round(100 * old_survived.mean(), 1)\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**\n",
"- 30.2% among men and 46.2% among women\n",
"- 35.7% among men and 74.2% among women\n",
"- 21.1% among men and 46.2% among women\n",
"- **18.9% among men and 74.2% among women [+]**"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shares of survived people: \n",
"\t among women 74.2%, \n",
"\t among men 18.9%\n"
]
}
],
"source": [
"male_survived = data[data[\"Sex\"] == \"male\"][\"Survived\"]\n",
"female_survived = data[data[\"Sex\"] == \"female\"][\"Survived\"]\n",
"\n",
"\n",
"print(\n",
" \"Shares of survived people: \\n\\t among women {}%, \\n\\t among men {}%\".format(\n",
" round(100 * female_survived.mean(), 1), round(100 * male_survived.mean(), 1)\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**7. What's the most popular first name among male passengers?**\n",
"- Charles\n",
"- Thomas\n",
"- **William [+]**\n",
"- John"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 Braund, Mr. Owen Harris\n",
"2 Cumings, Mrs. John Bradley (Florence Briggs Th...\n",
"3 Heikkinen, Miss. Laina\n",
"4 Futrelle, Mrs. Jacques Heath (Lily May Peel)\n",
"5 Allen, Mr. William Henry\n",
"Name: Name, dtype: object"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[\"Name\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Owen'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.loc[1, \"Name\"].split(\",\")[1].split()[1]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"William 35\n",
"John 25\n",
"George 14\n",
"Thomas 13\n",
"Charles 13\n",
"Name: Name, dtype: int64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"first_names = data.loc[data[\"Sex\"] == \"male\", \"Name\"].apply(\n",
" lambda full_name: full_name.split(\",\")[1].split()[1]\n",
")\n",
"first_names.value_counts().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**\n",
"- **On average, men of 1 class are older than 40 [+]**\n",
"- On average, women of 1 class are older than 40\n",
"- **Men of all classes are on average older than women of the same class [+]**\n",
"- ** On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class [+]**"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average age for male and class 3: 26.51\n",
"Average age for female and class 3: 21.75\n",
"Average age for male and class 1: 41.28\n",
"Average age for female and class 1: 34.61\n",
"Average age for male and class 2: 30.74\n",
"Average age for female and class 2: 28.72\n"
]
}
],
"source": [
"for cl in data[\"Pclass\"].unique():\n",
" for sex in data[\"Sex\"].unique():\n",
" print(\n",
" \"Average age for {0} and class {1}: {2}\".format(\n",
" sex,\n",
" cl,\n",
" round(\n",
" data[(data[\"Sex\"] == sex) & (data[\"Pclass\"] == cl)][\"Age\"].mean(), 2\n",
" ),\n",
" )\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nicer:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average age for female and class 1: 34.61\n",
"Average age for male and class 1: 41.28\n",
"Average age for female and class 2: 28.72\n",
"Average age for male and class 2: 30.74\n",
"Average age for female and class 3: 21.75\n",
"Average age for male and class 3: 26.51\n"
]
}
],
"source": [
"for (cl, sex), sub_df in data.groupby([\"Pclass\", \"Sex\"]):\n",
" print(\n",
" \"Average age for {0} and class {1}: {2}\".format(\n",
" sex, cl, round(sub_df[\"Age\"].mean(), 2)\n",
" )\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And even nicer:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"