# In-Class Basics


In [None]:
import numpy as np
import matplotlib.pyplot as plt

**Linear Regression**

The goal of this week's exercise is to explore a simple linear regression problem based on Portugese white wine.

The dataset is based on 
Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. **Modeling wine preferences by data mining from physicochemical properties**. Published in Decision Support Systems, Elsevier, 47(4):547-553, 2009. 



In [None]:
# The code snippet below is responsible for downloading the dataset to
# Google. You can directly download the file using the link
# if you work with a local anaconda setup

# Temporarily replaced link as the ML dataset archive seems to be down
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
!wget https://raw.githubusercontent.com/zygmuntz/wine-quality/master/winequality/winequality-white.csv

--2021-05-10 08:16:34--  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out.
Retrying.

--2021-05-10 08:17:07--  (try: 2)  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out.
Retrying.

--2021-05-10 08:17:41--  (try: 3)  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out.
Retrying.

--2021-05-10 08:18:16--  (try: 4)  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.2

**Before we start**

The downloaded file contains data on 4989 wines. For each wine 11 features are recorded (column 0 to 10). The final columns contains the quality of the wine. This is what we want to predict.

List of columns/features: 
0. fixed acidity
1. volatile acidity
2. citric acid
3. residual sugar
4. chlorides
5. free sulfur dioxide
6. total sulfur dioxide
7. density
8. pH
9. sulphates
10. alcohol
11. quality



[file]: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

In [None]:
# load all examples from the file
data = np.genfromtxt('winequality-white.csv',delimiter=";",skip_header=1)

print("data:", data.shape)

# Prepare for proper training
np.random.shuffle(data) # randomly sort examples

# take the first 3000 examples for training
# (remember array slicing from last week)
X_train = data[:3000,:11] # all features except last column
y_train = data[:3000,11]  # quality column

# and the remaining examples for testing
X_test = data[3000:,:11] # all features except last column
y_test = data[3000:,11] # quality column

print("First example:")
print("Features:", X_train[0])
print("Quality:", y_train[0])

('data:', (4898, 12))
First example:
('Features:', array([7.600e+00, 3.800e-01, 2.800e-01, 4.200e+00, 2.900e-02, 7.000e+00,
       1.120e+02, 9.906e-01, 3.000e+00, 4.100e-01, 1.260e+01]))
('Quality:', 6.0)


# Homework

1. First we want to understand the data better. Plot (`plt.hist`) the distribution of each of the features for the training data as well as the 2D distribution (either `plt.scatter` or `plt.hist2d`) of each feature versus quality. Also calculate the correlation coefficient (`np.corrcoef`) for each feature with quality. Which feature by itself seems most
 predictive for the quality?

2. Calculate the linear regression weights as derived in the lecture. Numpy provides functions for matrix multiplication (`np.matmul`), matrix transposition (`.T`) and matrix inversion (`np.linalg.inv`).

3. Use the weights to predict the quality for the test dataset. How does your predicted quality compare with the true quality of the test data? Calculate the correlation coefficient between predicted and true quality and draw the scatter plot. 

In [None]:
x = np.random.uniform(size=(3,4))

In [None]:
x

array([[0.27061972, 0.85093187, 0.06038869, 0.6430975 ],
       [0.05802941, 0.1492127 , 0.93073299, 0.70555297],
       [0.4806267 , 0.27201085, 0.75607278, 0.88637951]])

In [None]:
x[1,1]

0.14921269768865764

In [None]:
f = x[1:,2:]
print(f)

[[0.93073299 0.70555297]
 [0.75607278 0.88637951]]


In [None]:
f[0,0] = 999
print(f)

[[9.99000000e+02 7.05552973e-01]
 [7.56072781e-01 8.86379512e-01]]


In [None]:
x

array([[2.70619720e-01, 8.50931871e-01, 6.03886907e-02, 6.43097505e-01],
       [5.80294054e-02, 1.49212698e-01, 9.99000000e+02, 7.05552973e-01],
       [4.80626701e-01, 2.72010854e-01, 7.56072781e-01, 8.86379512e-01]])