{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\" />\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
    "\n",
    "Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center> Topic 1. Exploratory data analysis with Pandas\n",
    "## <center>Practice. Analyzing \"Titanic\" passengers. Solution\n",
    "\n",
    "**Fill in the missing code (\"Your code here\") and choose answers in a [web-form](https://docs.google.com/forms/d/16EfhpDGPrREry0gfDQdRPjoiQX9IumaL2mPR0rcj19k/edit).**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "pd.set_option(\"display.precision\", 2)\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "# Graphics in SVG format are more sharp and legible\n",
    "%config InlineBackend.figure_format = 'svg'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Read data into a Pandas DataFrame**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"../../data/titanic_train.csv\", index_col=\"PassengerId\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**First 5 rows**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).**\n",
    "\n",
    "Make sure you understand how actually this construction works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[(data[\"Embarked\"] == \"C\") & (data.Fare > 200)].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**We can sort these people by Fare in descending order.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[(data[\"Embarked\"] == \"C\") & (data[\"Fare\"] > 200)].sort_values(\n",
    "    by=\"Fare\", ascending=False\n",
    ").head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's create a new feature.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def age_category(age):\n",
    "    \"\"\"\n",
    "    < 30 -> 1\n",
    "    >= 30, <55 -> 2\n",
    "    >= 55 -> 3\n",
    "    \"\"\"\n",
    "    if age < 30:\n",
    "        return 1\n",
    "    elif age < 55:\n",
    "        return 2\n",
    "    elif age >= 55:\n",
    "        return 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "age_categories = [age_category(age) for age in data.Age]\n",
    "data[\"Age_category\"] = age_categories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Another way is to do it with `apply`.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"Age_category\"] = data[\"Age\"].apply(age_category)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1. How many men/women were there onboard?**\n",
    "- 412 men and 479 women\n",
    "- 314 men и 577 women\n",
    "- 479 men и 412 women\n",
    "- **<font color='green'>577 men и 314 women [+]</font>**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(data[\"Sex\"] == \"male\").sum(), (data[\"Sex\"] == \"female\").sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Easier:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"Sex\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2. Print the distribution of the `Pclass` feature. Then the same, but for men and women separately. How many men from second class were there onboard?**\n",
    "- 104\n",
    "- **<font color='green'>108 [+]</font>**\n",
    "- 112\n",
    "- 125"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.crosstab(data[\"Pclass\"], data[\"Sex\"], margins=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can plot a picture as well, though it's not necessary here. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"Pclass\"].hist(label=\"all\")\n",
    "data[data[\"Sex\"] == \"male\"][\"Pclass\"].hist(color=\"green\", label=\"male\")\n",
    "data[data[\"Sex\"] == \"female\"][\"Pclass\"].hist(color=\"yellow\", label=\"female\")\n",
    "plt.title(\"Distribution by class and gender.\")\n",
    "plt.xlabel(\"Pclass\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.legend(loc=\"upper left\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3. What are median and standard deviation of `Fare`?. Round to two decimals.**\n",
    "- **<font color='green'>median is  14.45, standard deviation is 49.69 [+]</font>**\n",
    "- median is 15.1, standard deviation is 12.15\n",
    "- median is 13.15, standard deviation is 35.3\n",
    "- median is  17.43, standard deviation is 39.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Median fare: \", round(data[\"Fare\"].median(), 2))\n",
    "print(\"Fare std: \", round(data[\"Fare\"].std(), 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**\n",
    "- Yes\n",
    "- **<font color='green'>No [+]</font>**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[data[\"Survived\"] == 1][\"Age\"].hist(\n",
    "    color=\"green\", label=\"Survived\", alpha=0.5, density=True\n",
    ")\n",
    "data[data[\"Survived\"] == 0][\"Age\"].hist(\n",
    "    color=\"red\", label=\"Died\", alpha=0.5, density=True\n",
    ")\n",
    "plt.title(\"Age for survived and died\")\n",
    "plt.xlabel(\"Years\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install seaborn\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(data[\"Survived\"], data[\"Age\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can't see the difference through eye-balling only. Let's calculate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.groupby(\"Survived\")[\"Age\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**\n",
    "- 22.7% among young and 40.6% among old\n",
    "- **<font color='green'>40.6% among young and 22.7% among old [+]</font>**\n",
    "- 35.3% among young and 27.4% among old\n",
    "- 27.4% among young and  35.3% among old"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "young_survived = data.loc[data[\"Age\"] < 30, \"Survived\"]\n",
    "old_survived = data.loc[data[\"Age\"] > 60, \"Survived\"]\n",
    "\n",
    "print(\n",
    "    \"Shares of survived people: \\n\\t  among young {}%, \\n\\t  among old {}%.\".format(\n",
    "        round(100 * young_survived.mean(), 1), round(100 * old_survived.mean(), 1)\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**\n",
    "- 30.2% among men and 46.2% among women\n",
    "- 35.7% among men and 74.2% among women\n",
    "- 21.1% among men and 46.2% among women\n",
    "- **<font color='green'>18.9% among men and 74.2% among women [+]</font>**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "male_survived = data[data[\"Sex\"] == \"male\"][\"Survived\"]\n",
    "female_survived = data[data[\"Sex\"] == \"female\"][\"Survived\"]\n",
    "\n",
    "\n",
    "print(\n",
    "    \"Shares of survived people: \\n\\t among women {}%, \\n\\t among men {}%\".format(\n",
    "        round(100 * female_survived.mean(), 1), round(100 * male_survived.mean(), 1)\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**7. What's the most popular first name among male passengers?**\n",
    "- Charles\n",
    "- Thomas\n",
    "- **<font color='green'>William [+]</font>**\n",
    "- John"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"Name\"].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.loc[1, \"Name\"].split(\",\")[1].split()[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_names = data.loc[data[\"Sex\"] == \"male\", \"Name\"].apply(\n",
    "    lambda full_name: full_name.split(\",\")[1].split()[1]\n",
    ")\n",
    "first_names.value_counts().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**\n",
    "- **<font color='green'>On average, men of 1 class are older than 40 [+]</font>**\n",
    "- On average, women of 1 class are older than 40\n",
    "- **<font color='green'>Men of all classes are on average older than women of the same class [+]</font>**\n",
    "- **<font color='green'> On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class [+]</font>**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for cl in data[\"Pclass\"].unique():\n",
    "    for sex in data[\"Sex\"].unique():\n",
    "        print(\n",
    "            \"Average age for {0} and class {1}: {2}\".format(\n",
    "                sex,\n",
    "                cl,\n",
    "                round(\n",
    "                    data[(data[\"Sex\"] == sex) & (data[\"Pclass\"] == cl)][\"Age\"].mean(), 2\n",
    "                ),\n",
    "            )\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nicer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for (cl, sex), sub_df in data.groupby([\"Pclass\", \"Sex\"]):\n",
    "    print(\n",
    "        \"Average age for {0} and class {1}: {2}\".format(\n",
    "            sex, cl, round(sub_df[\"Age\"].mean(), 2)\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And even nicer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.crosstab(data[\"Pclass\"], data[\"Sex\"], values=data[\"Age\"], aggfunc=np.mean)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(data[\"Pclass\"], data[\"Age\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Useful resources\n",
    "* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-practice-solution)\n",
    "* Topic 1 \"Exploratory Data Analysis with Pandas\" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)\n",
    "* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "name": "seminar02_practice_pandas_titanic.ipynb"
 },
 "nbformat": 4,
 "nbformat_minor": 1
}