\n",
"\n",
" \n",
"## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
"\n",
"Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#
Topic 2. Visual data analysis\n",
"##
Practice. Analyzing \"Titanic\" passengers\n",
"\n",
"**Fill in the missing code (\"You code here\"). No need to select answers in a webform.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Competition Kaggle \"Titanic: Machine Learning from Disaster\".**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"sns.set()\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Read data**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df = pd.read_csv(\"../../data/titanic_train.csv\", index_col=\"PassengerId\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.describe(include=\"all\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's drop`Cabin`, and then – all rows with missing values.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df = train_df.drop(\"Cabin\", axis=1).dropna()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**1. Build a picture to visualize all scatter plots for each pair of features `Age`, `Fare`, `SibSp`, `Parch` and `Survived`. ( `scatter_matrix ` from Pandas or `pairplot` from Seaborn)**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2. How does ticket price (`Fare`) depend on `Pclass`? Build a boxplot.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3. Let's build the same plot but restricting values of `Fare` to be less than 95% quantile of the initial vector (to drop outliers that make the plot less clear).**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4. How is the percentage of surviving passengers dependent on passengers' gender? Depict it with `Seaborn.countplot` using the `hue` argument.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5. How does the distribution of ticket prices differ for those who survived and those who didn't. Depict it with `Seaborn.boxplot`**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**6. How does survival depend on passengers' age? Verify (graphically) an assumption that youngsters (< 30 y.o.) survived more frequently than old people (> 55 y.o.).**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Useful resources\n",
"* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-2-practice-visualization) with a [solution](https://www.kaggle.com/kashnitsky/topic-2-practice-solution)\n",
"* Topic 2 \"Visual data analysis in Python\" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-2-visual-data-analysis-in-python)\n",
"* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
},
"name": "seminar13_optional_practice_trees_titanic.ipynb"
},
"nbformat": 4,
"nbformat_minor": 1
}