## Exploratory Data Analysis and Visualization

The goal of __exploratory data analysis (EDA)__ is to explore attributes across multiple entities to decide what statistical or machine learning techniques to apply to the data. Visualizations are used to assist in understanding the data.

In [15]:
# loads the pandas library 
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  # Ignore Pandas future warnings

# creates data frame named df by reading in the Baltimore csv
df = pd.read_csv("manipulated_baltimore_data.csv")
df.head(n=3)



Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Form,State,Security_Grade,Area_Number,Terrain_Description,Favorable_Influences,Detrimental_Influences,INHABITANTS_Type,...,max_annual_income,terrain_rolling,white_collar,mixture_or_jewish,professional,business_or_executive,laborer,clerks,mechanics,industrial
0,0,1,NS FORM-8 6-1-37,Maryland,A,1,undulating,Very nicely planned residential area of medium...,No,executives professional men,...,5000.0,1,0,0,1,1,0,0,0,0
1,1,0,NS FORM-8 6-1-37,Maryland,A,2,rolling,Fairly new suburban area of homogeneous charac...,No,substantial middle class,...,5000.0,1,0,0,0,0,0,0,0,0
2,2,2,NS FORM-8 6-1-37,Maryland,A,3,rolling,Good residential area. Well planned.,Distance to City,executives professional men,...,7000.0,1,0,0,1,1,0,0,0,0


The `.describe()` function summarizes a data frame column. Since the data type of `max_building_age` is currently type 'object', which in python is an indcator of type 'string', we have to first convert this attribute into a numeric value.

In [16]:
df['max_building_age'].describe()

count    46.000000
mean     30.086957
std      16.497577
min      10.000000
25%      20.000000
50%      25.000000
75%      40.000000
max      65.000000
Name: max_building_age, dtype: float64

Now that `max_building_age` is numeric type, we see that `describe()` provides __summary statistics__ on this attribute.

In [17]:
# converts max_building age to numeric type
df["max_building_age"] = pd.to_numeric(df["max_building_age"])
df['max_building_age'].describe()

count    46.000000
mean     30.086957
std      16.497577
min      10.000000
25%      20.000000
50%      25.000000
75%      40.000000
max      65.000000
Name: max_building_age, dtype: float64

We can the same operations to `max_annual_income`.

In [18]:
df['max_annual_income'].describe()

count       46.000000
mean      3139.130435
std       2009.806874
min       1000.000000
25%       1850.000000
50%       2750.000000
75%       4000.000000
max      10000.000000
Name: max_annual_income, dtype: float64

In [19]:
df['max_annual_income'] = pd.to_numeric(df['max_annual_income'])
df['max_annual_income'].describe()

count       46.000000
mean      3139.130435
std       2009.806874
min       1000.000000
25%       1850.000000
50%       2750.000000
75%       4000.000000
max      10000.000000
Name: max_annual_income, dtype: float64

Finally we create some plots our data. A __scatter plot__ and a __bar chart__ are shown below.

In [20]:
%%HTML
<script type='text/javascript' src='https://10ay.online.tableau.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 1920px; height: 895px;'><object class='tableauViz' width='1920' height='895' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='&#47;t&#47;sadata' /><param name='name' value='mapping_inequality&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>

### Exercise 4

> 1. Hover over different points and explore their additional characteristics. __Note__:`INHABITANTS_F/N` should be multiplied by 100 to be a percent.
2. The different points are clustered by grades. Which clusters have the most variation?
3. How does `BUILDINGS_Construction` vary across the different points?
4. Can you identify a trend overall? 

In [21]:
%%HTML
<script type='text/javascript' src='https://public.tableau.com/javascripts/api/tableau-2.min.js'></script><div class='tableauPlaceholder' style='width: 1440px; height: 715px;'><object class='tableauViz' width='1440' height='715' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='&#47;t&#47;sadata' /><param name='name' value='mapping_inequality&#47;Sheet3' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>

### Excercise 5

> 1. Recall the preperations done to the INHABITANTS_Foreignborn, how might this have influenced these outcomes?
2. What can you learn from this graph?
3. What do you learn about the different grades?

In [22]:
%%HTML
<script type='text/javascript' src='https://10ay.online.tableau.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 1440px; height: 715px;'><object class='tableauViz' width='1440' height='715' style='display:none;'><param name='host_url' value='https%3A%2F%2F10ay.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='&#47;t&#47;sadata' /><param name='name' value='mapping_inequality&#47;Sheet2' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>

### Exercise 6

> 1. What can you learn from this graph?
2. What are some explanations for the outcomes?
3. What can you learn about the different grades?
4. Compare to the previous graph, what are the similarities and differences?