{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Ch. 9 - Feature engineering & data preparation\n", "In this chapter we will prepare the data we got from the bank marketing campaign. We will examine it closely, clean it up and make it ready for our neural network." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing out tools\n", "Before starting out on the actual analysis, it makes sense to prepare the tools of trade. We will use four libraries:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# As always, first some libraries\n", "# Numpy handles matrices\n", "import numpy as np\n", "# Pandas handles data \n", "import pandas as pd\n", "# Matplotlib is a plotting library\n", "import matplotlib.pyplot as plt\n", "# Set matplotlib to render imediately\n", "%matplotlib inline\n", "# Seaborn is a plotting library built on top of matplotlib that can handle some more advanced plotting\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make our charts look nice, we will use the FiveThirtyRight color scheme. [FiveThirtyEight](http://fivethirtyeight.com/) is a quantitative journalism website that has built a very nice graph scheme. So to make this chapter pretty, we will use their scheme." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Define colors for seaborn\n", "five_thirty_eight = [\n", " \"#30a2da\",\n", " \"#fc4f30\",\n", " \"#e5ae38\",\n", " \"#6d904f\",\n", " \"#8b8b8b\",\n", "]\n", "# Tell seaborn to use the 538 colors\n", "sns.set(palette=five_thirty_eight)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data\n", "The data is taken from [Moro et al., 2014](https://archive.ics.uci.edu/ml/datasets/bank+marketing) via the UCI machine learning repository. The balanced version we are working with is included in the GitHub repository." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Load data with pandas\n", "df = pd.read_csv('balanced_bank.csv',index_col=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting an overview\n", "The first step in data preparation is to check what we are actually working with. After we have surveyed the dataset as a whole we will look at the individual features in it. As a start we can use pandas ```head()```function to get the first few rows of the dataset for a manual overview." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "job | \n", "marital | \n", "education | \n", "default | \n", "housing | \n", "loan | \n", "contact | \n", "month | \n", "day_of_week | \n", "... | \n", "campaign | \n", "pdays | \n", "previous | \n", "poutcome | \n", "emp.var.rate | \n", "cons.price.idx | \n", "cons.conf.idx | \n", "euribor3m | \n", "nr.employed | \n", "y | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34579 | \n", "35 | \n", "admin. | \n", "single | \n", "university.degree | \n", "no | \n", "yes | \n", "no | \n", "cellular | \n", "may | \n", "thu | \n", "... | \n", "1 | \n", "999 | \n", "1 | \n", "failure | \n", "-1.8 | \n", "92.893 | \n", "-46.2 | \n", "1.266 | \n", "5099.1 | \n", "no | \n", "
446 | \n", "42 | \n", "technician | \n", "married | \n", "professional.course | \n", "no | \n", "no | \n", "no | \n", "telephone | \n", "may | \n", "tue | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "yes | \n", "
20173 | \n", "36 | \n", "admin. | \n", "married | \n", "university.degree | \n", "no | \n", "no | \n", "no | \n", "cellular | \n", "aug | \n", "mon | \n", "... | \n", "2 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.4 | \n", "93.444 | \n", "-36.1 | \n", "4.965 | \n", "5228.1 | \n", "yes | \n", "
18171 | \n", "37 | \n", "admin. | \n", "married | \n", "high.school | \n", "no | \n", "yes | \n", "yes | \n", "telephone | \n", "jul | \n", "wed | \n", "... | \n", "2 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.4 | \n", "93.918 | \n", "-42.7 | \n", "4.963 | \n", "5228.1 | \n", "yes | \n", "
30128 | \n", "31 | \n", "management | \n", "single | \n", "university.degree | \n", "no | \n", "yes | \n", "no | \n", "cellular | \n", "apr | \n", "thu | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "-1.8 | \n", "93.075 | \n", "-47.1 | \n", "1.365 | \n", "5099.1 | \n", "no | \n", "
5 rows × 21 columns
\n", "