{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "\n", "Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Topic 1. Exploratory data analysis with Pandas\n", "##
Practice. Analyzing \"Titanic\" passengers\n", "\n", "**Fill in the missing code (\"You code here\") and choose answers in a [web-form](https://docs.google.com/forms/d/16EfhpDGPrREry0gfDQdRPjoiQX9IumaL2mPR0rcj19k/edit).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "\n", "# Graphics in SVG format are more sharp and legible\n", "%config InlineBackend.figure_format = 'svg'\n", "pd.set_option(\"display.precision\", 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Read data into a Pandas DataFrame**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"../../data/titanic_train.csv\", index_col=\"PassengerId\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**First 5 rows**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).**\n", "\n", "Make sure you understand how actually this construction works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[(data[\"Embarked\"] == \"C\") & (data.Fare > 200)].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**We can sort these people by Fare in descending order.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[(data[\"Embarked\"] == \"C\") & (data[\"Fare\"] > 200)].sort_values(\n", " by=\"Fare\", ascending=False\n", ").head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's create a new feature.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def age_category(age):\n", " \"\"\"\n", " < 30 -> 1\n", " >= 30, <55 -> 2\n", " >= 55 -> 3\n", " \"\"\"\n", " if age < 30:\n", " return 1\n", " elif age < 55:\n", " return 2\n", " elif age >= 55:\n", " return 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "age_categories = [age_category(age) for age in data.Age]\n", "data[\"Age_category\"] = age_categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Another way is to do it with `apply`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[\"Age_category\"] = data[\"Age\"].apply(age_category)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1. How many men/women were there onboard?**\n", "- 412 men and 479 women\n", "- 314 men and 577 women\n", "- 479 men and 412 women\n", "- 577 men and 314 women" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2. Print the distribution of the `Pclass` feature. Then the same, but for men and women separately. How many men from second class were there onboard?**\n", "- 104\n", "- 108\n", "- 112\n", "- 125" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3. What are median and standard deviation of `Fare`?. Round to two decimals.**\n", "- median is 14.45, standard deviation is 49.69\n", "- median is 15.1, standard deviation is 12.15\n", "- median is 13.15, standard deviation is 35.3\n", "- median is 17.43, standard deviation is 39.1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**\n", "- Yes\n", "- No\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**\n", "- 22.7% among young and 40.6% among old\n", "- 40.6% among young and 22.7% among old\n", "- 35.3% among young and 27.4% among old\n", "- 27.4% among young and 35.3% among old" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**\n", "- 30.2% among men and 46.2% among women\n", "- 35.7% among men and 74.2% among women\n", "- 21.1% among men and 46.2% among women\n", "- 18.9% among men and 74.2% among women" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**7. What's the most popular first name among male passengers?**\n", "- Charles\n", "- Thomas\n", "- William\n", "- John" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**\n", "- On average, men of 1 class are older than 40\n", "- On average, women of 1 class are older than 40\n", "- Men of all classes are on average older than women of the same class\n", "- On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Useful resources\n", "* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-practice-analyzing-titanic-passengers) with a [solution](https://www.kaggle.com/kashnitsky/topic-1-practice-solution)\n", "* Topic 1 \"Exploratory Data Analysis with Pandas\" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)\n", "* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "name": "seminar02_practice_pandas_titanic.ipynb" }, "nbformat": 4, "nbformat_minor": 1 }