# Classification of Italian Wines
![alt text](https://viaverdimiami.com/wp-content/uploads/2017/07/Italian-Wine.jpg)

In this notebook we will be using supervised learning to classify Italian wines. 
The question is: Can we teach a machine to figure out which type of wine an obseration belongs to?

We will work with a famous but small dataset that can be found [here](https://archive.ics.uci.edu/ml/datasets/wine) (along more informaion).
The data is clean, contains only numerical and no missing values. We will not do any EDA but only focus on prediction. The only preprocessing step will be standardization of the physiochemical variables.

We will be using Pandas and Scikit-Learn which are both parts of the Anaconda distribution.

In [42]:
# Download the dateset using WGET.
# If this is not possible, then just paste the URL in your browser and download 
# the file, or if you use GithubDesktop then it should be in the folder
# after a pull.


!wget https://cdn.rawgit.com/SDS-AAU/M1-2018/182abaa2/data/wine.csv


Redirecting output to ‘wget-log.1’.


In [0]:
# Importing the libraries

import numpy as np # for working with arrays
np.set_printoptions(suppress=True) # not a must but nice to avoid scientific notation


import pandas as pd # as usual for handling dataframes
pd.options.display.float_format = '{:.4f}'.format #same for pandas to turn off scientific notation

In [0]:
# Importing the dataset
dataset = pd.read_csv('wine.csv')

In [45]:
# Quick check of the dataframe proportions
dataset.shape

(178, 15)

In [46]:
# Checking the first 5 rows to get familiar with the data
dataset.head()

Unnamed: 0,class_label,class_name,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280,proline
0,1,Barolo,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,Barolo,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,Barolo,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,Barolo,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,Barolo,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [47]:
# Getting basic descriptives for all nummerical variables
dataset.describe()

Unnamed: 0,class_label,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.9382,13.0006,2.3363,2.3665,19.4949,99.7416,2.2951,2.0293,0.3619,1.5909,5.0581,0.9574,2.6117,746.8933
std,0.775,0.8118,1.1171,0.2743,3.3396,14.2825,0.6259,0.9989,0.1245,0.5724,2.3183,0.2286,0.71,314.9075
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


We can see here that means and spread (standard deviation) of the features is very different and thus we will need to standardize the dataset. 


> "As a rule of thumb I’d say: When in doubt, just standardize the data, it shouldn’t hurt."" [Sebastian Raschka](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html)

In [0]:
# Selecting the relevant data
# using the iloc selector allows to grab a range 2-15 of columns
# withouth having to call their names. That's practical
# Also, we ask for values only, as we are going to pass the data into
# the ML algorithms in the form of arrays rather than pandas DFs

X = dataset.iloc[:, 2:15].values
y = dataset.iloc[:, 1].values

Yes, there is a ```class_lable``` in the dataset but for the sake of learning and because it is very simple, we are going to construct our class_lables on our own. For this we will use the ```LabelEncoder``` from Scikit-Learn. Note that in contrast to Pandas, the Scikit-Learn is more of a (HUGE!!!) Library where you have to import different functionalities separately. You can find an index of all classes [here](http://scikit-learn.org/stable/modules/classes.html).

In [0]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder

Classes such as the ```LabelEncoder``` or any modely type that you import have several parameters that can (but don't have to be) specified. Also, you are usually fitting them to some data first before performind transformations. Thus, they are *cutom-made* for each use case and therefore you will need to define an encoder object from the imported class. This is a general philosophy behind all Scikit-Learn classes. The good news: The syntax is the same across all classes.

Below we first define a ```labelencoder_y``` and then use the ```fit_transform``` method (we could also first use ```fit``` and then ```transform```) to turn our wine-type names into numbers.

In [0]:
# From labels to numbers
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

As you have seen from the descriptives above our variables lie on very different scales. Therefore, we will standardize them before going further. The procedure using the ```StandardScaler```is exactly the same as before with the label encoder.

This scaling will for each value substract the mean (of the column) and devide it by the standard deviation, thus bringing them all on the same scale with a mean of 0 and a standard deviation of 1.

In [0]:
# Feature scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(X)

In [56]:
# We can check our transform data using pandas describe
pd.DataFrame(X).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0
std,1.0028,1.0028,1.0028,1.0028,1.0028,1.0028,1.0028,1.0028,1.0028,1.0028,1.0028,1.0028,1.0028
min,-2.4342,-1.433,-3.6792,-2.671,-2.0883,-2.1072,-1.696,-1.8682,-2.069,-1.6343,-2.0947,-1.8951,-1.4932
25%,-0.7882,-0.6587,-0.5721,-0.6891,-0.8244,-0.8855,-0.8275,-0.7401,-0.5973,-0.7951,-0.7676,-0.9522,-0.7846
50%,0.061,-0.4231,-0.0238,0.0015,-0.1223,0.096,0.1061,-0.1761,-0.0629,-0.1592,0.0331,0.2377,-0.2337
75%,0.8361,0.6698,0.6981,0.6021,0.5096,0.809,0.8491,0.6095,0.6292,0.494,0.7132,0.7886,0.7582
max,2.2598,3.1092,3.1563,3.1545,4.3714,2.5395,3.0628,2.4024,3.4851,3.4354,3.3017,1.9609,2.9715


In the next step we split the data into a training and a test-set. Very often you will see a split of 80/20 %


![alt text](https://cdn-images-1.medium.com/max/1000/1*4G__SV580CxFj78o9yUXuQ.png)

80% of the data will be used to fit a model, while we will keep 20% of the data for testing the models performance.

The train_test_split class takes 4 parameters: (X, y, test_size = 0.2, random_state = 21)


1.   Input matrix: X
2.   Output matrix: y
3. The test size: We take 20%
4. A random state (optional): Some number for the random generator that will shuffle the values*

*The whole random state thing is mostly for easier reproducibility and can also be let our. 





In [0]:
# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21)

![alt text](https://uproxx.files.wordpress.com/2015/12/bender-pointless-day.jpg?quality=95)

Now it's time for the model to meet the wine data.

We will be using 3 different models. The reason why we use 3 models is because, it is nice to see how easy it is to switch them aroun to experiment what works best. Since we can calculate an (kind of) objective quality measure, it is easy to compare and evaluate them agains each other. 

*   Logistic Regression
*   Suport Vector Classifier
* Random Forest Classifier

Remember that this is a classification problem rather than a regression. The models will be estimating probabilities for some class vs. other classes.

In [57]:
# We first import and train a Logistic Regression

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state = 22)

classifier.fit(X_train, y_train)


# After training the model we should jump further down (over the next 2 models)
# To evaluate the results

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=22, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [59]:
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators = 50, criterion = 'entropy', random_state = 22)

classifier.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=22, verbose=0, warm_start=False)

In [61]:
# Finally we train a Support Vector Classifier
from sklearn.svm import SVC

classifier = SVC(kernel = 'linear', random_state = 21)

classifier.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=21, shrinking=True,
  tol=0.001, verbose=False)

Perhaps this time the algorithm was just lucky because of a random allocation of the data in the train-test split. To make sure which model is the most accurate, we can run a k-Fold Cross Validation deviding x_train into (here) 10 parts, training on 9 and testing on 1. This will be done 10 times, every time measuring the accuracy and finally returning the average accuracy.

![alt text](https://www.researchgate.net/profile/Kiret_Dhindsa/publication/323969239/figure/fig10/AS:607404244873216@1521827865007/The-K-fold-cross-validation-scheme-133-Each-of-the-K-partitions-is-used-as-a-test.png)

In [62]:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 5)

print(accuracies.mean())
print(accuracies.std())

0.9856960408684546
0.017537311768860152


Now that we fitted or trained a model we need to figure out how well it performes. This approach to evaluation is very different from what many of you are used to from econometrics. 

Here we are not interested in a model summary table, rather we will be exploring predictive performance.
In the next cell we ask the classifier object (our trained model) to gives us predictions for data it never has seen before.

Then we will compare the predictions made against the real-world values that we actually know.

In [0]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [0]:
# Making a classification report
from sklearn.metrics import classification_report

cm = classification_report(y_test, y_pred)

In [67]:
print(cm)

             precision    recall  f1-score   support

          0       0.92      1.00      0.96        11
          1       1.00      1.00      1.00        15
          2       1.00      0.90      0.95        10

avg / total       0.97      0.97      0.97        36



There is also a slightly more intuitive way to evaluate our predictions in the case of a multiclass-classification where we cannot just create a confusion-matrix. What we can do is using pandas to crosstabulate our real against our predicted wines.

To get the wine names, we will use the ```inverse_transform``` function of our ```labelencoder```

In [68]:
# Transforming nummerical labels to wine types

true_wines = labelencoder_y.inverse_transform(y_test)

predicted_wines = labelencoder_y.inverse_transform(y_pred)

  if diff:
  if diff:


In [71]:
# Creating a pandas DataFrame and cross-tabulation

df = pd.DataFrame({'true_wines': true_wines, 'predicted_wines': predicted_wines}) 

pd.crosstab(df.true_wines, df.predicted_wines)

predicted_wines,Barbera,Barolo,Grignolino
true_wines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Barbera,11,0,0
Barolo,0,15,0
Grignolino,1,0,9


**But is that not the same as PCA or soe other kind of clustering?**

Well, let's try to use unsupervised learning on the same data-set. We will be using KMeans (because it is simple and nice for illustration)

Just as before, we import a model class, define a model object and fit it. Same 3 steps as before.

In [0]:
# We import KMeans and creade a model object (we know that there are 3 wines...kind of cheating)
from sklearn.cluster import KMeans

model = KMeans(n_clusters = 3)

In [83]:
# Fitting the model is super easy, jsut one line
model.fit(X_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [0]:
# Prediction is easy, too

predicted_wine_clusters = model.predict(X_train)

predicted_new_wine_clusters = model.predict(X_test)

Note that the clustering model never met any y-values - only X values

In [85]:
# Quick print out of the labels

predicted_wine_clusters

array([0, 2, 0, 1, 2, 2, 1, 2, 1, 1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 2, 2, 2,
       1, 0, 0, 1, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 2, 2, 1, 1, 1, 1, 1, 1,
       1, 2, 0, 1, 2, 1, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 1, 1, 2, 1, 2, 1,
       1, 2, 1, 1, 0, 2, 1, 0, 2, 2, 0, 1, 0, 1, 1, 1, 2, 0, 2, 2, 1, 1,
       2, 2, 0, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 1, 2, 0, 0, 1, 0, 2, 2, 0,
       1, 2, 0, 1, 2, 1, 0, 1, 2, 1, 1, 0, 1, 1, 0, 2, 0, 2, 2, 2, 0, 2,
       0, 2, 2, 2, 0, 2, 2, 1, 1, 1], dtype=int32)

In [86]:
# Transforming nummerical labels to wine types

true_wines = labelencoder_y.inverse_transform(y_train)

df = pd.DataFrame({'true_wines': true_wines, 'predicted_wines': predicted_wine_clusters}) 
pd.crosstab(df.true_wines, df.predicted_wines)

  if diff:


predicted_wines,0,1,2
true_wines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Barbera,37,0,0
Barolo,0,44,0
Grignolino,3,4,54
