# Intro to Data Science
## Part I. - What is Data Science?

## Table of contents

- ##### Administration
    - <a href="#Administration">Administration</a>

- ##### Data Science intro
    - <a href="#Intro">Intro</a>
    - <a href="#Basic-taxonomy-of-data-science-methods">Taxonomy</a>
    - <a href="#Basic-workflow-with-scikit-learn">Basic workflow</a>

- ##### Pipelines
    - <a href="#Introducing-pipelines">Pipelines</a>

---

## Administration

### Curriculum:
- Overview, technical basics, pipelines
- Data Discovery, Naive linear classifiers
- Data Transformation, Decision trees
- Dimensionality Reduction, SVMs
- Text mining, Neural networks
- Model Evaluation, Hyperparameter optimization, Clustering
- Regression and Embedding pipelines

### Requirements:

- Weekly Assignments
- A data science project

---

## Intro

### WTF is Data Science?

According to a random venn diagram:

<img src="pics/data_science_venn_diagram.png" width=300 align="left">
<br style="clear:left;"/>
from <a href="https://www.kdnuggets.com/2016/10/battle-data-science-venn-diagrams.html/2" target="new">kdnuggets</a>

As a metro map: 

<img src="pics/RoadToDataScientist.png" width=500 align="left">
<br style="clear:left;"/>
from <a href="http://nirvacana.com/thoughts/2013/07/08/becoming-a-data-scientist/" target="new">pragmatic perspectives</a>

### At the end of the day:

It's just a fancier name for Data Mining. Maybe throw some more hacking skill to the mix.


### Who is a Data Scientist then?

- <a href="https://twitter.com/jeremyjarvis/status/428848527226437632">_"A data scientist is a statistician who lives in San Francisco"_</a>
- <a href="https://twitter.com/josh_wills/status/198093512149958656">_"A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician."_</a>


### Thanks, much clearer now. (NOT) Can you please tell me at least what does he do? 
#### A.k.a: the typical workflow - The KDD (Knowledge Discovery in Databases) Process

<img src="pics/kdd.png" width=500 align="left">
<br style="clear:left;"/>
from <a href="https://data-flair.training/blogs/data-mining-and-knowledge-discovery/">data flair</a>


## Basic taxonomy of data science methods

There is a lot of "implicit" information in the data which humans can't directly observe, but can be extracted by statistical methods (a.k.a. _analytics_). Our goal is exactly this. Basically, there are two main types of analytics:

#### Descriptive analytics
<img src="pics/descriptive_analytics.png" width=200 align=left> 
<br style="clear:left;"/>
from <a href="https://www.analyticbridge.datasciencecentral.com/profiles/blogs/descriptive-predictive-prescriptive-analytics-will-fail-to-help">data science central</a>

**Goal:** To extract valuable information from a given dataset. Answer the question: _"What has happened?"_  
**Example:** Describe the relation between the students' math grade in high school and their achieved points in the university statistics course's tests.

#### Predictive analytics
<img src="pics/predictive_analytics.jpg" width=200 align=left> 
<br style="clear:left;"/>
from <a href="https://data-mining.philippe-fournier-viger.com/">Philippe Fournier-Viger</a>

**Goal:** Being able to make predictions on missing information based on previous knowledge. Answer the question: _"What could happen?"_   
**Example #1:** When you apply for a loan, the bank gets your data, and puts it into its model for predicting the probability of you repaying that loan. Depending on this prediction it can choose to grant you the loan you asked for or not.  
**Example #2:** A store has some information on its customers, and from that information it can determine what type of people visit its stores (like students, retirees, etc.). This way it can adjust the stores open hours to fit the need of the different group of customers it serves. (This is called clustering.)

---
  
There is another way of categorizing the statistical/machine learning/data mining methods: **supervised** and **unsupervised** learning.

#### Supervised learning
<img src="pics/kittens_puppies.jpg" width=200 align=left> <br style="clear:left;"/>
**Supervised learning** is based on data that is already 'labeled'. In other words we have data for which we know what the correct output is. We train our model on this dataset, and after this our model can predict the output of any input we give it (eg. is a picture shows a cat or a dog). The simplest supervised learning method is the linear regression.

#### Unsupervised learning
<img src="./pics/kacsa.png" width=400 align=left> <br style="clear:left;"/>
With **unsupervised learning** we don't know what the correct output should be - we try to detect a hidden structure in the data. The simplest example for this is the above mentioned clustering example.

### Validation

How can we validate our model/output? In the case of unsupervised learning, we can't. With supervised learning, however the basic idea is pretty straightforward. We split our dataset into two parts: training and test set. We train our model _only on the training set_, and then compare the model's output on the test set to the known good output on it.

<br style="clear:left;"/>

---

## Basic workflow with scikit-learn

<img src="pics/titanic.gif" align=left>
<br style="clear:left;"/>

To introduce the basic workflow we'll try to answer a simple question: _"Will I survive the sinking of the Titanic?"_ 
This is a __classification problem__ which is a __prediction task__. We'll choose a familiar method to solve this problem: _logistic regression_. It is a __supervised method__ which we'll use to predict if a passenger survives the titanic catastrophe.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

import random

import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from sklearn.metrics import confusion_matrix

### 1. Read and transform data

In [None]:
data = pd.read_csv('data/titanic_full.csv', index_col='PassengerId').dropna(subset=['Embarked'])
test_mask = pd.read_csv('data/titanic.csv', index_col='PassengerId')
test_mask = test_mask['Survived'].isnull()

data.head()

In [None]:
sex = LabelEncoder()
embark = LabelEncoder()

data['Sex'] = sex.fit_transform(data['Sex'])
data['Embarked'] = embark.fit_transform(data['Embarked'])
data.head()

In [None]:
data.shape

In [5]:
input_cols = [col for col in data.columns
              if col not in ('Name', 'Ticket', 'Cabin', 'Survived')]
target_col = 'Survived'

train = data.loc[~test_mask]
test = data.loc[test_mask]

X_train = train[input_cols].fillna(-1)
y_train = train[target_col]

X_test = test[input_cols].fillna(-1)
y_test = test[target_col]

### 2. Train models

### Introducing pipelines

Since we only want a logistic regression in our model, we could simply use the LogisticRegression() function we imported from sklearn's linear_model module. However, there is a useful concept called **pipeline**, which really comes in handy when dealing with more complicated models.

When dealing with data, we may first want to transform our data to make it more digestible to our estimators (e.g. getting rid of some attributes). There can be multiple transformation steps involved in our process, and each transformation may have multiple parameters that can be tweaked independently. Pipelines provide a wrapping for these steps which makes working with these transformations easier and more conscise.

- Create the pipline

In [None]:
logistic_regression = LogisticRegression()
pipe = Pipeline(steps=[
    ('logistic', logistic_regression)
])
pipe

- fit the pipeline

In [None]:
estimator = pipe.fit(X_train, y_train)
estimator

### 3. Validation
- Validation accuracy

In [None]:
y_pred = estimator.predict(X_test)
print("Prediction accuracy: {:.2f}%".format(np.sum(y_pred == y_test) / len(y_pred) * 100))

- Confusion matrix

In [None]:
cnf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cnf_matrix, annot=True, fmt="d", cmap=plt.cm.Blues);

### 4. Use the validated model

In [None]:
my_pclass = 2  # 1st, 2nd or 3rd class
my_sex = sex.transform(['male'])
my_age = 40
my_sibsp = 1   # Number of siblings/spouses aboard
my_parch = 1   # Number of parents/children aboard
my_fare = data.loc[data['Pclass'] == my_pclass, 'Fare'].mean()  # the average fare for my_pclass
my_embarked = embark.transform([random.choice('CQS')])   # Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

me = pd.DataFrame([{
    'Pclass': my_pclass,
    'Sex': my_sex[0],
    'Age': my_age,
    'SibSp': my_sibsp,
    'Parch': my_parch,
    'Fare': my_fare,
    'Embarked': my_embarked[0]
}])
me

##### Drumroll
<img src="pics/drumroll.gif" align=left>

In [None]:
estimator.predict(me)