\n",
"\n",
" \n",
"## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
"\n",
"Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#
Topic 1. Exploratory data analysis with Pandas\n",
"##
Practice. Analyzing \"Titanic\" passengers\n",
"\n",
"**Fill in the missing code (\"You code here\") and choose answers in a [web-form](https://docs.google.com/forms/d/16EfhpDGPrREry0gfDQdRPjoiQX9IumaL2mPR0rcj19k/edit).**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from matplotlib import pyplot as plt\n",
"\n",
"# Graphics in SVG format are more sharp and legible\n",
"%config InlineBackend.figure_format = 'svg'\n",
"pd.set_option(\"display.precision\", 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Read data into a Pandas DataFrame**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv(\"../../data/titanic_train.csv\", index_col=\"PassengerId\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**First 5 rows**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).**\n",
"\n",
"Make sure you understand how actually this construction works."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[(data[\"Embarked\"] == \"C\") & (data.Fare > 200)].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We can sort these people by Fare in descending order.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[(data[\"Embarked\"] == \"C\") & (data[\"Fare\"] > 200)].sort_values(\n",
" by=\"Fare\", ascending=False\n",
").head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's create a new feature.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def age_category(age):\n",
" \"\"\"\n",
" < 30 -> 1\n",
" >= 30, <55 -> 2\n",
" >= 55 -> 3\n",
" \"\"\"\n",
" if age < 30:\n",
" return 1\n",
" elif age < 55:\n",
" return 2\n",
" elif age >= 55:\n",
" return 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"age_categories = [age_category(age) for age in data.Age]\n",
"data[\"Age_category\"] = age_categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Another way is to do it with `apply`.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[\"Age_category\"] = data[\"Age\"].apply(age_category)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**1. How many men/women were there onboard?**\n",
"- 412 men and 479 women\n",
"- 314 men and 577 women\n",
"- 479 men and 412 women\n",
"- 577 men and 314 women"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2. Print the distribution of the `Pclass` feature. Then the same, but for men and women separately. How many men from second class were there onboard?**\n",
"- 104\n",
"- 108\n",
"- 112\n",
"- 125"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3. What are median and standard deviation of `Fare`?. Round to two decimals.**\n",
"- median is 14.45, standard deviation is 49.69\n",
"- median is 15.1, standard deviation is 12.15\n",
"- median is 13.15, standard deviation is 35.3\n",
"- median is 17.43, standard deviation is 39.1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**\n",
"- Yes\n",
"- No\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**\n",
"- 22.7% among young and 40.6% among old\n",
"- 40.6% among young and 22.7% among old\n",
"- 35.3% among young and 27.4% among old\n",
"- 27.4% among young and 35.3% among old"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**\n",
"- 30.2% among men and 46.2% among women\n",
"- 35.7% among men and 74.2% among women\n",
"- 21.1% among men and 46.2% among women\n",
"- 18.9% among men and 74.2% among women"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**7. What's the most popular first name among male passengers?**\n",
"- Charles\n",
"- Thomas\n",
"- William\n",
"- John"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**\n",
"- On average, men of 1 class are older than 40\n",
"- On average, women of 1 class are older than 40\n",
"- Men of all classes are on average older than women of the same class\n",
"- On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Useful resources\n",
"* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-practice-analyzing-titanic-passengers) with a [solution](https://www.kaggle.com/kashnitsky/topic-1-practice-solution)\n",
"* Topic 1 \"Exploratory Data Analysis with Pandas\" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)\n",
"* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
},
"name": "seminar02_practice_pandas_titanic.ipynb"
},
"nbformat": 4,
"nbformat_minor": 1
}