{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "\n", "Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Topic 1. Exploratory data analysis with Pandas\n", "##
Practice. Analyzing \"Titanic\" passengers\n", "\n", "**Fill in the missing code (\"You code here\") and choose answers in a [web-form](https://docs.google.com/forms/d/16EfhpDGPrREry0gfDQdRPjoiQX9IumaL2mPR0rcj19k/edit).**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "\n", "# Graphics in SVG format are more sharp and legible\n", "%config InlineBackend.figure_format = 'svg'\n", "pd.set_option(\"display.precision\", 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Read data into a Pandas DataFrame**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"../../data/titanic_train.csv\", index_col=\"PassengerId\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**First 5 rows**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.25NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.28C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.92NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.10C123S
503Allen, Mr. William Henrymale35.0003734508.05NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22.0 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n", "3 Heikkinen, Miss. Laina female 26.0 \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen, Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.25 NaN S \n", "2 1 0 PC 17599 71.28 C85 C \n", "3 0 0 STON/O2. 3101282 7.92 NaN S \n", "4 1 0 113803 53.10 C123 S \n", "5 0 0 373450 8.05 NaN S " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head(5)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassAgeSibSpParchFare
count891.00891.00714.00891.00891.00891.00
mean0.382.3129.700.520.3832.20
std0.490.8414.531.100.8149.69
min0.001.000.420.000.000.00
25%0.002.0020.120.000.007.91
50%0.003.0028.000.000.0014.45
75%1.003.0038.001.000.0031.00
max1.003.0080.008.006.00512.33
\n", "
" ], "text/plain": [ " Survived Pclass Age SibSp Parch Fare\n", "count 891.00 891.00 714.00 891.00 891.00 891.00\n", "mean 0.38 2.31 29.70 0.52 0.38 32.20\n", "std 0.49 0.84 14.53 1.10 0.81 49.69\n", "min 0.00 1.00 0.42 0.00 0.00 0.00\n", "25% 0.00 2.00 20.12 0.00 0.00 7.91\n", "50% 0.00 3.00 28.00 0.00 0.00 14.45\n", "75% 1.00 3.00 38.00 1.00 0.00 31.00\n", "max 1.00 3.00 80.00 8.00 6.00 512.33" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).**\n", "\n", "Make sure you understand how actually this construction works." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
11901Baxter, Mr. Quigg Edmondmale24.001PC 17558247.52B58 B60C
25911Ward, Miss. Annafemale35.000PC 17755512.33NaNC
30011Baxter, Mrs. James (Helene DeLaudeniere Chaput)female50.001PC 17558247.52B58 B60C
31211Ryerson, Miss. Emily Boriefemale18.022PC 17608262.38B57 B59 B63 B66C
37801Widener, Mr. Harry Elkinsmale27.002113503211.50C82C
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "119 0 1 \n", "259 1 1 \n", "300 1 1 \n", "312 1 1 \n", "378 0 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "119 Baxter, Mr. Quigg Edmond male 24.0 \n", "259 Ward, Miss. Anna female 35.0 \n", "300 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0 \n", "312 Ryerson, Miss. Emily Borie female 18.0 \n", "378 Widener, Mr. Harry Elkins male 27.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "119 0 1 PC 17558 247.52 B58 B60 C \n", "259 0 0 PC 17755 512.33 NaN C \n", "300 0 1 PC 17558 247.52 B58 B60 C \n", "312 2 2 PC 17608 262.38 B57 B59 B63 B66 C \n", "378 0 2 113503 211.50 C82 C " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[(data[\"Embarked\"] == \"C\") & (data.Fare > 200)].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**We can sort these people by Fare in descending order.**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
25911Ward, Miss. Annafemale35.000PC 17755512.33NaNC
68011Cardeza, Mr. Thomas Drake Martinezmale36.001PC 17755512.33B51 B53 B55C
73811Lesurer, Mr. Gustave Jmale35.000PC 17755512.33B101C
31211Ryerson, Miss. Emily Boriefemale18.022PC 17608262.38B57 B59 B63 B66C
74311Ryerson, Miss. Susan Parker \"Suzette\"female21.022PC 17608262.38B57 B59 B63 B66C
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex \\\n", "PassengerId \n", "259 1 1 Ward, Miss. Anna female \n", "680 1 1 Cardeza, Mr. Thomas Drake Martinez male \n", "738 1 1 Lesurer, Mr. Gustave J male \n", "312 1 1 Ryerson, Miss. Emily Borie female \n", "743 1 1 Ryerson, Miss. Susan Parker \"Suzette\" female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "259 35.0 0 0 PC 17755 512.33 NaN C \n", "680 36.0 0 1 PC 17755 512.33 B51 B53 B55 C \n", "738 35.0 0 0 PC 17755 512.33 B101 C \n", "312 18.0 2 2 PC 17608 262.38 B57 B59 B63 B66 C \n", "743 21.0 2 2 PC 17608 262.38 B57 B59 B63 B66 C " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[(data[\"Embarked\"] == \"C\") & (data[\"Fare\"] > 200)].sort_values(\n", " by=\"Fare\", ascending=False\n", ").head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's create a new feature.**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def age_category(age):\n", " \"\"\"\n", " < 30 -> 1\n", " >= 30, <55 -> 2\n", " >= 55 -> 3\n", " \"\"\"\n", " if age < 30:\n", " return 1\n", " elif age < 55:\n", " return 2\n", " elif age >= 55:\n", " return 3" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "age_categories = [age_category(age) for age in data.Age]\n", "data[\"Age_category\"] = age_categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Another way is to do it with `apply`.**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "data[\"Age_category\"] = data[\"Age\"].apply(age_category)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1. How many men/women were there onboard?**\n", "- 412 men and 479 women\n", "- 314 men and 577 women\n", "- 479 men and 412 women\n", "- 577 men and 314 women" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2. Print the distribution of the `Pclass` feature. Then the same, but for men and women separately. How many men from second class were there onboard?**\n", "- 104\n", "- 108\n", "- 112\n", "- 125" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3. What are median and standard deviation of `Fare`?. Round to two decimals.**\n", "- median is 14.45, standard deviation is 49.69\n", "- median is 15.1, standard deviation is 12.15\n", "- median is 13.15, standard deviation is 35.3\n", "- median is 17.43, standard deviation is 39.1" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**\n", "- Yes\n", "- No\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**\n", "- 22.7% among young and 40.6% among old\n", "- 40.6% among young and 22.7% among old\n", "- 35.3% among young and 27.4% among old\n", "- 27.4% among young and 35.3% among old" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**\n", "- 30.2% among men and 46.2% among women\n", "- 35.7% among men and 74.2% among women\n", "- 21.1% among men and 46.2% among women\n", "- 18.9% among men and 74.2% among women" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**7. What's the most popular first name among male passengers?**\n", "- Charles\n", "- Thomas\n", "- William\n", "- John" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**\n", "- On average, men of 1 class are older than 40\n", "- On average, women of 1 class are older than 40\n", "- Men of all classes are on average older than women of the same class\n", "- On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Useful resources\n", "* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-practice-analyzing-titanic-passengers) with a [solution](https://www.kaggle.com/kashnitsky/topic-1-practice-solution)\n", "* Topic 1 \"Exploratory Data Analysis with Pandas\" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)\n", "* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "name": "seminar02_practice_pandas_titanic.ipynb" }, "nbformat": 4, "nbformat_minor": 1 }