# Intro to Data Science
## Part II. - Data discovery

### Table of contents

- ##### Data Discovery
    - <a href="#What-is-Data-Discovery?">Theory</a>
    - <a href="#Let's-do-it-then!">Examples</a>

- ##### Classification basics
    - <a href="#A-bit-more-on-classification">Theory</a>
    - <a href="#Now-look-at-the-iris-dataset">Examples</a>

---

## What is Data Discovery?
Data discovery is the process in which one looks into data and tries to:
- figure out what is interesting in the data
- what can one do with it
- if it needs extensive preprocessing

From <a href="https://en.wikipedia.org/wiki/Data_discovery#Definition">Wikipedia</a>:
> Data Discovery is a user-driven process of searching for patterns or specific items in a data set.
> Data Discovery applications use visual tools such as geographical maps, pivot-tables, and heat-maps
> to make the process of finding patterns or specific items rapid and intuitive. Data Discovery may 
> leverage statistical and data mining techniques to accomplish these goals.

### Why it is important?
To speed up the whole process by giving you insights about:
- if the data can be used at all
- the necessary preprocessing steps
- the possible algorithms
- the interesting data points
- which features to use

### Tools
Anything and everything. Two important factor:
- speed __->__ base statistics
- ease of understanding __->__ PLOTS-PLOTS-PLOTS!

#### Plots vs descriptive metrics:

<img src="pics/dino.gif" width=400 align="left">
<br style="clear:left;"/>
from <a href="https://www.autodesk.com/research/publications/same-stats-different-graphs">autodesk's blog</a>

### Let's do it then! 
Load the built-in iris dataset with sklearn's `load_iris` function and discover the dataset! (hint: load the dataset with `return_X_y=True` parameter and create a `pandas.DataFrame` from the data; then use the `pandas.DataFrame`'s `plot` function for plotting. You can try `pandas.DataFrame`'s `describe` method as well.).

#### Answer the following questions:
- What is the task to solve?
- Is anything interesting showed up?
- What question should we ask about the dataset?
- How should we solve the task?
- What should we do as the first step of preprocessing?

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd

from sklearn.datasets import load_iris

pd.set_option('display.max_columns', 100)

- Load the dataset into a pandas DataFrame

In [None]:
iris = load_iris(as_frame=True)

X, y = iris['data'], iris['target'].to_frame()
df = pd.concat([X, y], axis='columns')

df.head()

In [None]:
df.target.unique()

In [None]:
df.target.nunique()

- Plot the data points

In [None]:
df.plot(1, 3, kind='scatter', c='target', colormap='Set1');

In [None]:
df.plot.scatter(1, 3, c='target', colormap='Set1');

- Generate basic statistics about the data

In [None]:
df.describe(percentiles=[.1, .25, .5, .75, .9])

In [None]:
sns.boxplot(data=X);

- Generate basic statistics by target labels

In [None]:
df.groupby('target').describe()

In [None]:
fig, axes = plt.subplots(ncols=3, sharey="row", figsize=(16,6))

for i in range(3):
    sns.boxplot(data=df.loc[df.target == i, X.columns], ax=axes[i])
    axes[i].tick_params(axis='x', labelrotation = -45)

-  Plot every feature against each other!

In [None]:
colormap = {0: 'tab:blue', 1: 'tab:red', 2: 'tab:green'}
fig, axes = plt.subplots(nrows=4, ncols=4, 
                       sharex="col", sharey="row", 
                       figsize=(12,12))

for i, row in enumerate(axes):
    for j, col in enumerate(row):
        if i != j:
            col.scatter(df.iloc[:, i], df.iloc[:, j], c=df.replace({"target": colormap})['target'])
            col.set_title('{} - {}'.format(i, j))
        else:
            col.hist([df.loc[df['target'] == k].iloc[:, i] for k in range(3)], 
                     bins=20, histtype='stepfilled', color=colormap.values())
            col.set_title('{} - {}'.format(i, j))

In [None]:
sns.pairplot(df, hue='target', vars=X.columns);

- Generate the correlation matrix

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr(), cmap='Blues');

- Dealing with missing values and outliers

In [None]:
filtered = df.copy()
outliers = []

for col in X.columns:
    upper_thres = df[col].mean() + 2 * df[col].std()
    lower_thres = df[col].mean() - 2 * df[col].std()
    
    filtered = filtered.loc[filtered[col].between(lower_thres, upper_thres)]
    
    outliers.append(df.loc[~df[col].between(lower_thres, upper_thres)])

outliers = pd.concat(outliers)

In [None]:
filtered.corr()

In [None]:
sns.heatmap(filtered.corr());

In [None]:
outliers.plot(1, 3, kind='scatter', c='target', colormap='jet');

- Using third party lib `ydata-profiling`  
  Install it with:  
  ```bash
  conda activate szisz_ds_23
  pip install ydata-profiling
  ```

In [None]:
import ydata_profiling

In [None]:
ydata_profiling.ProfileReport(df)

---

### Task
Build a dummy classifier based on your observations!

In [None]:
def predict_iris(sepal_l, sepal_w, petal_l, petal_w):
    return 0

---

## A bit more on classification

Last time we have seen a special case of classification: where there are two, mutually exclusive classes. The generalization of this can go two ways: there are $\lvert C \lvert > 2$ number of classes, which are mutually exclusive or not. The latter case is called *any-of* , *multilabel* , or *multivalue* classification. This problem can be broken down to $\lvert C \lvert$ number of *binary* classifications, each applied independently (but these classes need not to be independent in the statistical sense) to the train and test sets. The other, *one-of*, or *multiclass* case is a bit more complicated... (<a href="http://nlp.stanford.edu/IR-book/html/htmledition/classification-with-more-than-two-classes-1.html">source</a>)

Imagine instances as points in a $d$ dimensional space, where every dimension corresponds to a feature. Linear classification works by dividing this *feature space* by a hyperplane (hyperplane is the generalization of a line to higher dimensions). (Or in other words, linear classifiers make the classification decision based on a linear combination of the features.)
This is most easily understood by a simple example: a problem solvable by linear classification (image **A**) and not solvable by linear classification (image **B**):  
<img src="pics/linear_vs_nonlinear_problems.png" width="400px" align="left">
<br style="clear:left;"/>
from <a href="https://sebastianraschka.com/Articles/2014_naive_bayes_1.html">Sebastian Raschka's blog</a>

## Getting a bit technical...
Since jupyter notebooks can handle $\LaTeX$, let's write some equations!  
Formally, the linearity means that the classification *can* be expressed like this:  

$$ c = f \left( \sum _{i=0} ^{N} w_i x_i \right) $$  
Where $c \in C$ is the class we predict for a given instance $\bf{x}$, $w_i$ is the weight of attribute $x_i$, and $f$ is a function that maps its input to a class. Geometrically, $\bf{w}$ is the normal of the separator hyperplane. **The weights are basically what we create in the process of learning, and we use the learnt weights to predict the class of a given input instance.**  
Note that $x_0 = 1$ is a dummy feature, only to make the notation convenient, since the equation for a hyperplane is $w_1 x_1 + w_2 x_2 + \dots + w_N x_N + w_0 = 0 $  

Referring back to the *multiclass* classification: In the linear case, it happens by applying binary classifiers to each instance, and the decision is made based on the score/probability/etc. of each binary classifier.

## Example 0: <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression">Logistic regression</a>

Yes, the logistic regression is linear! Why? Because the predictions can be written in the form 
$$ \hat{p} = \frac{1}{1+e^{-\bf wx}} $$
so more precisely, the *log-odds* are linear functions of $x$.

## Example 1: <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html">Perceptron</a>

This is a very simple algorithm: binary classification, where $c$ can be $-1$ or $1$. initialize the weights with some values (0s or arbitrary small random numbers), then go through the training set in some order (but the resulting separator is not unique, it depends on the order!). Training rule: if the sign of the $\bf wx$ product is equal to the known output, don't change the weights. If the sign is different, then modify the weights by $ c \cdot \bf x$. The generalization of this algorithm to the multiclass case, along with good pictures and examples can be read on the <a href="https://en.wikipedia.org/wiki/Perceptron" >Perceptron's Wikipedia page</a>.

## Example 2: <a href="http://scikit-learn.org/stable/modules/naive_bayes.html">Naive Bayes</a>

For this, we'll look at the classification problem from a different angle. We have a given input vector $X$, and we need the probability of it belonging to a class $Y$.
Nomenclature: the input features $x_i$ are in $X$, and the class variable is $y$. $P(y)$ is called **a priori**, and $P(y\lvert X)$ is called **a posteriori** probability of $y$. This a posteriory probability is what we need.  
At first glance, the Bayes classifier works like this:  

$$\hat{y} = \mathrm{arg}\,\mathrm{max}\, P(y\lvert X),$$  
that means "choosing the class with the maximum probability given an input $X$".  
Of course, we would need an enormous training set to be able to get $P(y\lvert X)$ for every possible input $X$. This is where the Bayes theorem comes in:

$$P(y\lvert X) = \frac{P(X\lvert y) P(y)}{P(X)}$$  
Great, we now have to calculate $P(X\lvert y)$ (but at least we don't have to worry about the divisor, as it is the same for all possible $y$s). But if we suppose that the attributes are not dependent on each other (**this assumption makes it naive!**), then the probability $P(X\lvert y)$ can be written in a product form: $ \prod _i P(X_i \lvert y) $! (The proof of this is given as an exercise to the reader.) We now have to simply calculate the $P(x_i \lvert y)$ probabilities, which requires a much smaller train set. The calculation is simple if the $x_i$-s are categorical variables - just use relative frequencies from the train set. In the continuous case, we assume that these probabilities are from a distribution (gauss, binomial, multinomial, etc.), and we use the training set to guess the parameters of this distribution. (Note: the naive bayes classification is linear only if this distribution comes from exponential families). Finally $P(y)$, the a priori probabilities are simply calculated as relative frequencies from the training set. 
So the training consists of calculating $P(x_i \lvert y)$-s or the distribution parameters, and $P(y)$ from the training set, and the prediction is just calculating the values $ P(y) \prod _i P(x_i \lvert y)$ for each possible $y$, and choosing which maximises this.

---

## Now look at the iris dataset

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [None]:
def validate(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print("Prediction accuracy: {:.2f}%"
          .format(np.sum(y_pred == y_test) / float(len(y_pred)) * 100))

Split the dataset into a train and a test part.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y['target'],
                                                    test_size=1/3, 
                                                    random_state=42)

- Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

logistic_pipe = Pipeline(steps=[('logistic', LogisticRegression())])
logistic_pipe.fit(X_train, y_train)

In [None]:
validate(logistic_pipe, X_test, y_test)

The default parameters set the logistic regression to do binary classification for each label, then choose the best.  
Let's try something else! With pipelines, we can easily set parameters for our predictors, like this:

In [None]:
logistic_pipe.set_params(logistic__multi_class='multinomial', 
                         logistic__solver='sag')
logistic_pipe.fit(X_train, y_train)

In [None]:
validate(logistic_pipe, X_test, y_test)

- Perceptron

In [None]:
from sklearn.linear_model import Perceptron

perceptron_pipe =  # TODO
perceptron_pipe.fit(X_train, y_train)

In [None]:
validate(perceptron_pipe, X_test, y_test)

- Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb_pipe =  # TODO
nb_pipe.fit(X_train, y_train)

In [None]:
validate(nb_pipe, X_test, y_test)