{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "\n", "Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Topic 1. Exploratory data analysis with Pandas\n", "##
Practice. Analyzing \"Titanic\" passengers. Solution\n", "\n", "**Fill in the missing code (\"Your code here\") and choose answers in a [web-form](https://docs.google.com/forms/d/16EfhpDGPrREry0gfDQdRPjoiQX9IumaL2mPR0rcj19k/edit).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "pd.set_option(\"display.precision\", 2)\n", "from matplotlib import pyplot as plt\n", "\n", "# Graphics in SVG format are more sharp and legible\n", "%config InlineBackend.figure_format = 'svg'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Read data into a Pandas DataFrame**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"../../data/titanic_train.csv\", index_col=\"PassengerId\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**First 5 rows**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).**\n", "\n", "Make sure you understand how actually this construction works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[(data[\"Embarked\"] == \"C\") & (data.Fare > 200)].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**We can sort these people by Fare in descending order.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[(data[\"Embarked\"] == \"C\") & (data[\"Fare\"] > 200)].sort_values(\n", " by=\"Fare\", ascending=False\n", ").head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's create a new feature.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def age_category(age):\n", " \"\"\"\n", " < 30 -> 1\n", " >= 30, <55 -> 2\n", " >= 55 -> 3\n", " \"\"\"\n", " if age < 30:\n", " return 1\n", " elif age < 55:\n", " return 2\n", " elif age >= 55:\n", " return 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "age_categories = [age_category(age) for age in data.Age]\n", "data[\"Age_category\"] = age_categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Another way is to do it with `apply`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[\"Age_category\"] = data[\"Age\"].apply(age_category)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1. How many men/women were there onboard?**\n", "- 412 men and 479 women\n", "- 314 men и 577 women\n", "- 479 men и 412 women\n", "- **577 men и 314 women [+]**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(data[\"Sex\"] == \"male\").sum(), (data[\"Sex\"] == \"female\").sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Easier:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[\"Sex\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2. Print the distribution of the `Pclass` feature. Then the same, but for men and women separately. How many men from second class were there onboard?**\n", "- 104\n", "- **108 [+]**\n", "- 112\n", "- 125" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data[\"Pclass\"], data[\"Sex\"], margins=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can plot a picture as well, though it's not necessary here. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[\"Pclass\"].hist(label=\"all\")\n", "data[data[\"Sex\"] == \"male\"][\"Pclass\"].hist(color=\"green\", label=\"male\")\n", "data[data[\"Sex\"] == \"female\"][\"Pclass\"].hist(color=\"yellow\", label=\"female\")\n", "plt.title(\"Distribution by class and gender.\")\n", "plt.xlabel(\"Pclass\")\n", "plt.ylabel(\"Frequency\")\n", "plt.legend(loc=\"upper left\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3. What are median and standard deviation of `Fare`?. Round to two decimals.**\n", "- **median is 14.45, standard deviation is 49.69 [+]**\n", "- median is 15.1, standard deviation is 12.15\n", "- median is 13.15, standard deviation is 35.3\n", "- median is 17.43, standard deviation is 39.1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Median fare: \", round(data[\"Fare\"].median(), 2))\n", "print(\"Fare std: \", round(data[\"Fare\"].std(), 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**\n", "- Yes\n", "- **No [+]**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data[\"Survived\"] == 1][\"Age\"].hist(\n", " color=\"green\", label=\"Survived\", alpha=0.5, density=True\n", ")\n", "data[data[\"Survived\"] == 0][\"Age\"].hist(\n", " color=\"red\", label=\"Died\", alpha=0.5, density=True\n", ")\n", "plt.title(\"Age for survived and died\")\n", "plt.xlabel(\"Years\")\n", "plt.ylabel(\"Frequency\")\n", "plt.legend();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!pip install seaborn\n", "import seaborn as sns\n", "\n", "sns.set()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.boxplot(data[\"Survived\"], data[\"Age\"]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can't see the difference through eye-balling only. Let's calculate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.groupby(\"Survived\")[\"Age\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**\n", "- 22.7% among young and 40.6% among old\n", "- **40.6% among young and 22.7% among old [+]**\n", "- 35.3% among young and 27.4% among old\n", "- 27.4% among young and 35.3% among old" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "young_survived = data.loc[data[\"Age\"] < 30, \"Survived\"]\n", "old_survived = data.loc[data[\"Age\"] > 60, \"Survived\"]\n", "\n", "print(\n", " \"Shares of survived people: \\n\\t among young {}%, \\n\\t among old {}%.\".format(\n", " round(100 * young_survived.mean(), 1), round(100 * old_survived.mean(), 1)\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**\n", "- 30.2% among men and 46.2% among women\n", "- 35.7% among men and 74.2% among women\n", "- 21.1% among men and 46.2% among women\n", "- **18.9% among men and 74.2% among women [+]**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "male_survived = data[data[\"Sex\"] == \"male\"][\"Survived\"]\n", "female_survived = data[data[\"Sex\"] == \"female\"][\"Survived\"]\n", "\n", "\n", "print(\n", " \"Shares of survived people: \\n\\t among women {}%, \\n\\t among men {}%\".format(\n", " round(100 * female_survived.mean(), 1), round(100 * male_survived.mean(), 1)\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**7. What's the most popular first name among male passengers?**\n", "- Charles\n", "- Thomas\n", "- **William [+]**\n", "- John" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[\"Name\"].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.loc[1, \"Name\"].split(\",\")[1].split()[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_names = data.loc[data[\"Sex\"] == \"male\", \"Name\"].apply(\n", " lambda full_name: full_name.split(\",\")[1].split()[1]\n", ")\n", "first_names.value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**\n", "- **On average, men of 1 class are older than 40 [+]**\n", "- On average, women of 1 class are older than 40\n", "- **Men of all classes are on average older than women of the same class [+]**\n", "- ** On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class [+]**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for cl in data[\"Pclass\"].unique():\n", " for sex in data[\"Sex\"].unique():\n", " print(\n", " \"Average age for {0} and class {1}: {2}\".format(\n", " sex,\n", " cl,\n", " round(\n", " data[(data[\"Sex\"] == sex) & (data[\"Pclass\"] == cl)][\"Age\"].mean(), 2\n", " ),\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nicer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for (cl, sex), sub_df in data.groupby([\"Pclass\", \"Sex\"]):\n", " print(\n", " \"Average age for {0} and class {1}: {2}\".format(\n", " sex, cl, round(sub_df[\"Age\"].mean(), 2)\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And even nicer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(data[\"Pclass\"], data[\"Sex\"], values=data[\"Age\"], aggfunc=np.mean)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.boxplot(data[\"Pclass\"], data[\"Age\"]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Useful resources\n", "* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-practice-solution)\n", "* Topic 1 \"Exploratory Data Analysis with Pandas\" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)\n", "* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "name": "seminar02_practice_pandas_titanic.ipynb" }, "nbformat": 4, "nbformat_minor": 1 }