# Intro to Data Science
## Part VIII. - Deep Learning and it's applications

### Table of contents

- #### Deep learning basics
    - <a href="#What-is-Deep-Learning?">Theory</a>
    - <a href="#1.-Architectures">Layer Architecture types</a>
        - Dense Neural Networks
            - Activision and Loss Functions
        - Convolutional Neural Networks
        - Recurrent Neural Networks
        - Word Embeddings
        - Regularization
    
---

# I. Deep learning basics

## What is Deep Learning?

> _Deep learning consists of neural networks with multiple hidden layers that learn increasingly abstract representations of input data._ [source](https://elitedatascience.com/keras-tutorial-deep-learning-in-python)

> _Deep learning is a class of neural network algorithms that:_
> - _Use a cascade of __multiple layers__ of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input._
> - _Learn in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis) settings._
> - _Learn __multiple levels of representations__ that correspond to __different levels of abstraction__; the levels form a hierarchy of concepts._  
[source](https://en.wikipedia.org/wiki/Deep_learning#Definition)

## Why is it important?

Deep learning is widely used in our daily lives. It powers web search engines, recommender systems, image recognition systems, and self-driving cars. It enables the generation of realistic sound, images, and text, as well as the development of advanced AI agents.  

It represents the current state-of-the-art in machine learning for many tasks, including image recognition, text mining, and classification.

## Tools
- TensorFlow
- PyTorch
- Keras
- Gensim (for word embeddings)
- Hugging Face `transformers` (for pre-trained deep learning models)
- (Optional: Scikit-Learn, primarily for simpler neural networks)

# II. Deep Neural Network Architectures

## [Dense Feedforward Network](https://keras.io/api/layers/core_layers/dense/)

A **dense layer** is a fully connected neural network layer where each neuron receives input from all the neurons in the previous layer. This makes it **densely connected**. The layer has a **weight matrix (W)**, a **bias vector (b)**, and the activations of the previous layer (a).  

The following is the definition from the Keras documentation:

> `output = activation(dot(input, kernel) + bias)`

where:
- **activation** is the non-linear activation function passed as an argument.
- **kernel** is the weight matrix learned by the layer.
- **bias** is the bias vector.

---

### [Activation Functions](https://keras.io/api/layers/activations/)

Activation functions introduce **non-linearity** into neural networks, allowing them to learn complex patterns. Here are some commonly used activation functions:

- **Sigmoid**: $\frac{{\rm e}^x}{{\rm e}^x + 1}$  
  - Maps input values to the range (0,1). Useful for binary classification.
- **Tanh**: $\tanh(x)$  
  - Maps input values to the range (-1,1), making it zero-centered.
- **ReLU (Rectified Linear Unit)**: $\max(0, x)$  
  - Sets negative values to 0 while keeping positive values unchanged. Helps mitigate vanishing gradient problems.
- **Softmax**:  $\frac{{\rm e}^{x_i}}{\sum{{\rm e}^{x_j}}}$  
  - Converts raw scores into probabilities for multi-class classification.
- **Hierarchical Softmax**  
  - Used for large output spaces, speeding up computations.

#### Further Reading:
- [Deep Learning: Neurons and Activation Functions](https://medium.com/@srnghn/deep-learning-overview-of-neurons-and-activation-functions-1d98286cf1e4)
- [Choosing the Right Activation Function](https://www.analyticsvidhya.com/blog/2017/10/fundamentals-deep-learning-activation-functions-when-to-use-them/)
- [Stanford CS231n: Activation Functions](http://cs231n.github.io/neural-networks-1/#actfun)

---

### [Loss Functions](https://keras.io/api/losses/)

Loss functions measure how well a model's predictions match the actual values. Some commonly used loss functions include:

- **Mean Squared Error (MSE)**:  
  $$ MSE = \frac{1}{n} \sum (y_{\text{true}} - y_{\text{pred}})^2 $$  
  - Penalizes large errors more than small ones. Used for regression tasks.
  
- **Mean Absolute Error (MAE)**:  
  $$ MAE = \frac{1}{n} \sum \left| y_{\text{true}} - y_{\text{pred}} \right| $$  
  - Measures absolute differences. More robust to outliers than MSE.

- **Categorical Hinge Loss**:  
  $$ \max(0, 1 - t \cdot y) $$  
  - Used for multi-class classification with hinge loss.

- **Cross-Entropy Loss**:  
  $$ V(f(x), t) = -t \ln(f(x)) - (1 - t) \ln(1 - f(x)) $$  
  - Used for binary and multi-class classification tasks.

- **Cosine Proximity**  
  - Measures similarity between predicted and true vectors.

#### Further Reading:
- [Stanford CS231n: Loss Functions](http://cs231n.github.io/neural-networks-2/#loss-functions)
- [Choosing Loss and Activation Functions](https://towardsdatascience.com/deep-learning-which-loss-and-activation-functions-should-i-use-ac02f1c56aa8)

---

### [Regularization](https://chatbotslife.com/regularization-in-deep-learning-f649a45d6e0)

Regularization techniques help **prevent overfitting** by constraining the model’s complexity.

#### **Early Stopping**
Early stopping monitors a validation loss criterion during training. If the loss stops improving for a specified number of iterations (patience parameter), training is halted to prevent overfitting.

#### **Dropout**
<img src="pics/dl_dense_dropout_network.png" width=500>  

*By [Srivastava et al. (2014)](http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf)*  

Dropout is a **regularization technique** that randomly drops a fraction of neurons during training. This prevents co-adaptation of neurons and reduces overfitting.  

The Dropout method in Keras (`keras.layers.Dropout`) takes a float between 0 and 1, representing the fraction of neurons to drop.  

From the Keras documentation:

> Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

#### **Weight Penalty (L1 / L2 Regularization)**
- **L1 Regularization (Lasso)**: Adds an absolute value penalty, promoting sparsity in weights.
- **L2 Regularization (Ridge)**: Adds a squared value penalty, reducing large weights but keeping all features.

#### Further Reading:
- [Stanford CS231n: Regularization](http://cs231n.github.io/neural-networks-2/#reg)

---

### In Practice
#### Building a simple dense network to classify hand-written digits

#### 1. Loading data

In [None]:
import numpy as np

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [None]:
X, y = load_digits(return_X_y=True)
yt = OneHotEncoder(categories='auto', sparse_output=False).fit_transform(y.reshape(-1, 1))

X_train, X_test, y_train, y_test = train_test_split(X, yt, random_state=42)

#### 2. Model Construction

We'll use the [`TensorFlow`](https://www.tensorflow.org/) library to define neural networks.  

To install TensorFlow, activate your environment and run:

```bash
conda activate szisz_ds_2025
conda install tensorflow
```

In [None]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Input, Dense
from keras.callbacks import EarlyStopping, TensorBoard

In [None]:
model = Sequential([
    Input((64,)),
    Dense(8, activation='relu'),
    Dense(10, activation='softmax'),
])

#### 3. Assembly

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

##### 3.a Model validation

In [None]:
model.summary()

#### 4. Creating callback functions

In [None]:
# stopping early to prevent overfitting
earlystopping = EarlyStopping(patience=3)

# monitor training process through an UI
tensorboard = TensorBoard(
    log_dir='tensor', 
    histogram_freq=0, 
    write_graph=True, 
    write_images=True, 
    update_freq='epoch'
)

#### 5. Model training

In [None]:
model.fit(
    X_train, y_train,                        # training data
    batch_size=16,                           # number of data points to use in a training round
    epochs=100,                              # number of full training cycle 
    validation_data=(X_test, y_test),        # validation dataset
    callbacks=[earlystopping, tensorboard]   # function to execute at the end of each epoch
)

##### Follow training process throuch [TensorBoard UI](https://www.tensorflow.org/tensorboard/get_started)

Run at your terminal:
```bash
tensorboard --logdir tensor
```

Then open in browser:
```bash
http://localhost:6007
```

#### 6. Model evaluation

In [None]:
loss, acc = model.evaluate(X_test, y_test)
print(f'test loss: {loss}, test acc: {acc}')

#### 7. Prediction

In [None]:
def predict_classes(model, X):
    predictions = model.predict(X)
    predicted_classes = np.argmax(predictions, axis=-1)
    return predicted_classes

In [None]:
predict_classes(model, X)

#### Exercise: Build a classification model for the iris dataset

---

### [Convolutional Neural Network (CNN)](https://keras.io/layers/convolutional/)

<img src="pics/dl_cnn.png" width=600 alt="Typical CNN architecture"><br>By <a href="//commons.wikimedia.org/w/index.php?title=User:Aphex34&amp;action=edit&amp;redlink=1" class="new" title="User:Aphex34 (page does not exist)">Aphex34</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=45679374">Link</a>

> _Convolutional Neural Networks are very similar to ordinary Neural Networks: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply._  
> _So what changes? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network._ - [source](http://cs231n.github.io/convolutional-networks/)

> _Convolutional Neural Networks have a different architecture than regular Neural Networks. Regular Neural Networks transform an input by putting it through a series of hidden layers. Every layer is made up of a set of neurons, where each layer is fully connected to all neurons in the layer before. Finally, there is a last fully-connected layer — the output layer — that represent the predictions._  
> _Convolutional Neural Networks are a bit different. First of all, the layers are organised in 3 dimensions: width, height and depth. Further, the neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it. Lastly, the final output will be reduced to a single vector of probability scores, organized along the depth dimension._ - [source](https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050)

### **Key Components of a CNN**
A Convolutional Neural Network consists of several building blocks:
- **Convolutional layers**: Feature extraction
- **Pooling layers**: Feature selection and dimensionality reduction
- **Fully connected layers**: Classification

#### **Convolutional Layer**

A convolutional layer applies a set of learnable filters (kernels) to detect patterns in the input. In image processing, these patterns range from simple edges to complex textures and objects.

> _In mathematics convolution is a mathematical operation on two functions (f and g) to produce a third function that expresses how the shape of one is modified by the other._ - [source](https://en.wikipedia.org/wiki/Convolution)

<img src="pics/dl_convolution.gif" alt="Convolution operation"><br>By Brian Amberg, derivative work: <a href="//commons.wikimedia.org/wiki/User:Tinos" title="User:Tinos">Tinos</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=11003835">Link</a>

The key parameters of a convolutional layer are:
- **Depth**: Number of filters used in the layer
- **Stride**: Step size when moving the filter over the input
- **Padding**: Adding zero-padding to retain spatial dimensions

During convolution, the filter slides over the input, performing matrix multiplications at each step, creating a feature map.
> _We execute a convolution by sliding the filter over the input. At every location, a matrix multiplication is performed and sums the result onto the feature map._  
> _In the animation below, you can see the convolution operation. You can see the filter (the green square) is sliding over our input (the blue square) and the sum of the convolution goes into the feature map (the red square)._ - [source](https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050)

<div style="display: inline-block;">
<img src="pics/dl_sliding_window.gif" width=400 align='left'>
<img src="pics/dl_filter.png" width=400 align='left'>
</div>

<div style='align: clear'>
<br>
Animation by <a href="https://towardsdatascience.com/@ardendertat">Arden Dertat</a>, <a href="https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2">Link</a> 
Image by <a href="https://towardsdatascience.com/@ardendertat">Arden Dertat</a>, <a href="https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2">Link</a>
</div>

#### **Pooling Layer**

<img src="pics/dl_pooling.png" width=400><br>By <a href="https://cs.stanford.edu/people/karpathy/">Andrej Karpathy</a>, <a href="http://cs231n.github.io/convolutional-networks/">Link</a>

Pooling reduces the spatial dimensions of the input, making the network computationally efficient while retaining important features. It helps prevent overfitting and reduces the number of parameters.

The most common pooling operation is **max pooling**, which selects the highest value in a given region. 

#### **Fully Connected Layer**
A standard fully connected (dense) layer follows the convolutional and pooling layers. This layer performs classification using an appropriate loss function, such as cross-entropy loss for multi-class problems.

---

### **Further Reading**
- [An Intuitive Guide to CNNs](https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050)
- [CS231n: Convolutional Networks](http://cs231n.github.io/convolutional-networks/)
- [Applied Deep Learning: CNNs](https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2)
- [Beginner’s Guide to CNNs](https://www.analyticsvidhya.com/blog/2018/12/guide-convolutional-neural-network-cnn/)
- [Keras CNN Tutorial](https://github.com/ardendertat/Applied-Deep-Learning-with-Keras/blob/master/notebooks/Part%204%20%28GPU%29%20-%20Convolutional%20Neural%20Networks.ipynb)

---

### In Practice

#### Build a CNN classifier for the hand digits dataset

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from keras.layers import Dense, Flatten
from keras.layers import Conv2D, MaxPooling2D

In [None]:
X, y = load_digits(return_X_y=True)

In [None]:
# number of cases, width, height, channels (rgb)
Xt = X.reshape((X.shape[0], 8, 8, 1))
yt = OneHotEncoder(categories='auto', sparse_output=False).fit_transform(y.reshape(-1, 1))

X_train, X_test, y_train, y_test = train_test_split(Xt, yt, random_state=42)

In [None]:
sns.heatmap(Xt[1, :, :, 0], cmap="gray")

In [None]:
model = Sequential([
    Input((8, 8, 1)),
    Conv2D(32, kernel_size=(3, 3), strides=(1, 1), activation='relu'),
    MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
    Flatten(),
    Dense(10, activation='softmax')
])

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
model.fit(
    X_train, y_train,                  # training data
    batch_size=16,                     # number of data points to use in a training round
    epochs=100,                        # number of full training cycle 
    validation_data=(X_test, y_test),  # validation dataset
    callbacks=[earlystopping],         # function to execute at the end of each epoch
)

In [None]:
loss, acc = model.evaluate(X_test, y_test)
print(f'test loss: {loss}, test acc: {acc}')

#### Exercise: Build a CNN for the MNIST classification problem

In case you stuck in the process, use [this](https://github.com/adventuresinML/adventures-in-ml-code/blob/master/keras_cnn.py) [tutorial]((https://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/)).

In [None]:
from keras.datasets import mnist
from keras.utils import to_categorical

num_classes = 10

# input image dimensions
img_x, img_y = 28, 28

# load the MNIST data set, which already splits into train and test sets for us
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# because the MNIST is greyscale, we only have a single channel
X_train = X_train.reshape()  # TODO: fill in the required shape 
X_test = X_test.reshape()    # TODO: fill in the required shape 
input_shape = ()           # TODO: fill in the required shape 

# keras built-in OneHotEncoder solution
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)

In [None]:
# plot the first image in Xtrain with sns.heatmap


In [None]:
# define model here
model = Sequential([
    
])

In [None]:
# compile model here


In [None]:
model.summary()

In [None]:
# fit model


In [None]:
# evaluate model


---

### [Recurrent Neural Networks (RNN)](https://keras.io/layers/recurrent/)

<img src="pics/dl_rnn.svg" alt="Recurrent neural network unfold.svg" height="213" width="640"><br>By <a href="//commons.wikimedia.org/wiki/User:Ixnay" title="User:Ixnay">François Deloche</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=60109157">Link</a>


> _A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition._ (or time-series forecasting). - [source](https://en.wikipedia.org/wiki/Recurrent_neural_network)

RNNs differ from traditional neural networks because they have memory, allowing them to retain and utilize information from previous time steps. This is achieved by passing outputs from one step as inputs to the next, effectively creating a chain-like structure.

> _A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop: as you can see above, this chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists._ - [source](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

#### The Challenge: Long-Term Dependencies

A major issue with standard RNNs is their difficulty in capturing long-term dependencies in sequences. For example, in the sentence: _"I grew up in **France**... that's why I speak fluent **French**."_, the model must remember "France" to correctly infer "French." However, as the gap between relevant information increases, RNNs struggle to retain context due to the vanishing gradient problem.

To address this, more advanced architectures such as Long Short-Term Memory (LSTM) networks were developed.

---

### [Long Short-Term Memory (LSTM) Networks](https://keras.io/layers/recurrent/#lstm)

LSTMs are designed to handle long-term dependencies more effectively. While they follow the same chain-like structure as RNNs, their internal mechanisms are different, allowing them to selectively retain or discard information.

<img src="pics/dl_lstm.png" width="600"><br>By <a href="https://colah.github.io/about.html">Christopher Olah</a>, <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Link</a>

> _LSTMs are a specialized form of recurrent neural networks with feedback connections that enable them to process sequences of data. Unlike standard RNNs, LSTMs include mechanisms that prevent the vanishing gradient problem, allowing them to learn long-term dependencies. LSTMs are widely used in applications like speech recognition, language modeling, and time-series prediction._ - [source](https://en.wikipedia.org/wiki/Long_short-term_memory)

#### Key Components of LSTMs:

- **Cell states** – Remembers values over arbitrary time intervals, and the gates regulate the flow of information into and out of the cell.
- **Forget gates** – Decide what information to discard from the previous state, by mapping the previous state and the current input to a value between 0 and 1. A (rounded) value of 1 signifies retention of the information, and a value of 0 represents discarding.
- **Input gates** – Decide which pieces of new information to store in the current cell state, using the same system as forget gates.
- **Output gates** – Control which pieces of information in the current cell state to output, by assigning a value from 0 to 1 to the information, considering the previous and current states.

By using these gates, LSTMs can learn which information is important and maintain it across long sequences, making them superior to standard RNNs in many tasks.

---

### Further Reading:

- [Understanding LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [A High-Level Introduction to LSTMs](https://medium.com/datadriveninvestor/a-high-level-introduction-to-lstms-34f81bfa262d)
- [LSTM Overview by Skymind](https://skymind.ai/wiki/lstm)
- [Keras: Understanding `return_state` and `return_sequences`](https://www.dlology.com/blog/how-to-use-return_state-or-return_sequences-in-keras/)

---

### In Practice

#### Build a sentiment predictor on movie reviews

Based on [this](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) tutorial.

In [None]:
from keras.layers import Embedding
from keras.layers import LSTM

from keras.datasets import imdb

from keras.utils import pad_sequences

In [None]:
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [None]:
max_review_length = 500

# pad sequences will fill every doc in the corpus to a given length
X_train = pad_sequences(X_train, maxlen=max_review_length)
X_test = pad_sequences(X_test, maxlen=max_review_length)

In [None]:
embedding_vector_length = 32

model = Sequential([
    Input((max_review_length,)),
    Embedding(input_dim=top_words,                 # number of words in the vocab
              output_dim=embedding_vector_length),  # size of the embedding vector)
    LSTM(units=100),
    Dense(1, activation='sigmoid')
])

In [None]:
model.compile(
    loss='binary_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
)
model.summary()

In [None]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

In [None]:
score = model.evaluate(X_test, y_test, batch_size=16)
print('test loss: {}, test accuracy: {}'.format(*score))

#### Exercise: Predict simulated stock prices

Follow this [tutorial](https://stackabuse.com/time-series-analysis-with-lstm-using-pythons-keras-library/).

---

### [Word](https://keras.io/layers/embeddings/) [Embeddings](https://radimrehurek.com/gensim/models/word2vec.html)

<div style="display: flex; gap: 20px;">
  <img src="pics/dl_king_queen_embedding.png" width=400>
  <img src="pics/dl_king_queen_composition.png" width=400>
</div>

<div style='align: clear'/>
<br>Images from <a href="https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/">The Morning Paper</a>

> _Word embeddings refer to a set of techniques in natural language processing (NLP) that map words or phrases from a vocabulary into continuous vector spaces with much lower dimensions. These embeddings capture semantic relationships between words, enabling models to understand context and meaning._  
> _Methods for generating word embeddings include neural networks, dimensionality reduction techniques applied to word co-occurrence matrices, probabilistic models, and explicit representations based on word contexts._ - [source](https://en.wikipedia.org/wiki/Word_embedding)

The key intuition behind word embeddings is that words appearing in similar contexts tend to have similar meanings.

### Training Word Embeddings

<div style="display: flex; gap: 20px;">
  <img src="pics/dl_w2v_training_data.png" width=300>
  <img src="pics/dl_w2v_skip_grams.png" width=300>
  <img src="pics/dl_w2v_weight_matrix.png" width=300>
</div>

<div style='align: clear'/>
<br>Images from <a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">Word2Vec Tutorial - The Skip-Gram Model</a>, by <a href="http://mccormickml.com/">Chris McCormick</a>

There are two primary architectures for learning word embeddings:

- **Continuous Bag-of-Words (CBOW)**: Predicts a target word based on surrounding context words in a given window. The model is order-invariant, meaning word order in the context does not affect predictions.
- **Skip-Gram Model**: Predicts context words based on a given target word. Context words closer to the target word are weighted more heavily than those further away.

#### How Neural Network Weights Become Embedding Vectors

The Word2Vec model is trained as a shallow neural network with one hidden layer. It learns to predict words based on their surrounding context (CBOW) or vice versa (Skip-Gram). The key insight is that after training, we **discard the output layer** and use the **weights of the hidden layer** as the word embeddings. 

1. **Input Representation**: Each word in the vocabulary is assigned a unique one-hot vector (a sparse vector where only one element is 1, and the rest are 0).
2. **Projection Layer (Hidden Layer Weights)**: The one-hot input is multiplied by a weight matrix \( W \), mapping it into a dense vector space of lower dimensionality.
3. **Output Layer (Discarded After Training)**: The model is trained to predict either a target word (CBOW) or context words (Skip-Gram) using another weight matrix \( W' \). However, once training is complete, we do not need this layer for embeddings.
4. **Final Embedding Extraction**: The trained weights of the **first weight matrix \( W \) (input-to-hidden layer)** become the word embeddings. Each row in this matrix corresponds to a word's dense vector representation.

This process allows similar words (based on their contexts) to have similar vector representations, capturing semantic relationships like **"King - Man + Woman = Queen"**.

### Further Reading

- [Vector Representations of Words (TensorFlow)](https://www.tensorflow.org/tutorials/representation/word2vec#vector-representations-of-words)
- [How Does Word2Vec Work? (Quora)](https://www.quora.com/How-does-word2vec-work-Can-someone-walk-through-a-specific-example)
- [Word2Vec Tutorial - Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
- [Introduction to Word Embeddings and Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
- [Neural Network Embeddings Explained](https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526)
- [Word Embeddings in NLP and Their Applications](https://hackernoon.com/word-embeddings-in-nlp-and-its-applications-fab15eaf7430)
- [Build Your Own Embedding and Use It in a Neural Network](https://blog.cambridgespark.com/tutorial-build-your-own-embedding-and-use-it-in-a-neural-network-e9cde4a81296)
- [Word2Vec Wiki - Skymind AI](https://skymind.ai/wiki/word2vec)
- [Word2Vec Graph Visualization](https://github.com/anvaka/word2vec-graph)
- [Using Word Embedding Layers in Deep Learning with Keras](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)
- [Handling Text Data Using Keras Embedding Layer](https://heartbeat.fritz.ai/using-a-keras-embedding-layer-to-handle-text-data-2c88dc019600)
- [Google Word2Vec Archive](https://code.google.com/archive/p/word2vec/)

In [None]:
import numpy as np

from tensorflow.keras.preprocessing.text import one_hot

In [None]:
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']

labels = np.array([1, 1, 1, 1, 1,
                   0, 0, 0, 0, 0])

In [None]:
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

In [None]:
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

In [None]:
# define the model
model = Sequential([
    Input((max_length,)),
    Embedding(vocab_size, 8),
    Flatten(),
    Dense(1, activation='sigmoid')
])

In [None]:
# compile the model
model.compile(
    optimizer='adam', 
    loss='binary_crossentropy', 
    metrics=['accuracy']
)

In [None]:
# summarize the model
print(model.summary())

In [None]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

In [None]:
# evaluate the model
score = model.evaluate(padded_docs, labels, verbose=0)
print('loss: {}, accuracy: {}'.format(*score))

Test model with an example:

In [None]:
text = "good effort"
enc_text = [one_hot(text, vocab_size)]
pad_text = pad_sequences(enc_text, maxlen=max_length, padding='post')
pred_text = predict_classes(model, pad_text)

text, enc_text, pad_text, pred_text

#### Exercise: News classification

Classify the 20newsgroups dataset while building an embedding. As a first step, try to separate the atheism documents (`alt.atheism`) from the christian documents (`soc.religion.christian`).

---

### Further tutorials:
- https://www.pyimagesearch.com/2018/09/10/keras-tutorial-how-to-get-started-with-keras-deep-learning-and-python/
- https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/
- https://www.datacamp.com/community/tutorials/deep-learning-python
- https://elitedatascience.com/keras-tutorial-deep-learning-in-python
- https://www.guru99.com/keras-tutorial.html
- https://github.com/adventuresinML/adventures-in-ml-code