<div>
<img src="https://discuss.pytorch.org/uploads/default/original/2X/3/35226d9fbc661ced1c5d17e374638389178c3176.png" width="400" style="margin: 50px auto; display: block; position: relative; left: -30px;" />
</div>

<!--NAVIGATION-->
# < [Transfer Learning](8-Transfer-Learning.ipynb) | Pretrained models for NLP |

### When What you Want is Already Out There

Sometimes, the task you want to solve might align well with pre-existing tasks. If that is the case, you're in luck, as you can leverage work done by others. In this notebook, we'll see how to use some useful libraries to fetch pretrained models and obtained some predictions. 

### Table of Contents

#### 1. [NLP = Transformers](#NLP-%3d-Transformers)
#### 2. [Hugging Face 🤗](#Hugging-Face-🤗)
#### 3. [Task #1: Sentence similarity](#Task-#1:-Sentence-similarity)
#### 4. [Task #2: Named Entity Recognition (NER)](#Task-#2:-Named-Entity-Recognition-(NER))
#### 5. [Task #3: Text generation with GPT-2](#Task-#3:-Text-generation-with-GPT-2)

---

# NLP = Transformers

Transformers are a class of models, just like CNNs. Their particularities are:

1. Not having any assumption on the structure of the input representation. If CNN assume a 1D or 2D sequence with spatial coherence, transformers work of sets.

2. The use of self-attention to correlate informations from different elements of the set.

A more detailed description of transformers is available [here](https://jalammar.github.io/illustrated-transformer/).

<br>
<img src="figures/transformer.png" alt="" width="600px"/>
<br>

Today, the field of Natural Language Processing is heavily driven by transformer architectures. The following picture is the leaderboard of the [GLUE benchmark](https://gluebenchmark.com/leaderboard), a well known benchmark for analysing the natural language understanding capabilities of ML models.

<img src="figures/Glue-benchmark-top.png" alt="600px"/>

The top of the ranking consists in mainly transformer architectures. We saw a real explosion of those models in the last few years: 

<br>
<img src="https://www.researchgate.net/publication/342684048/figure/fig4/AS:909563557580800@1593868259667/The-Pre-trained-language-model-family.png" alt="Too many transformers" width="600px"/>
<br>

---

# Hugging Face 🤗

<img src="https://uptime-storage.s3.amazonaws.com/logos/d32f5c39b694f3e64d29fc2c9b988cdd.png" width="200px"/>

[Hugging Face](https://huggingface.co/) is an open-source provider of natural language processing (NLP) technologies. They created the `transformers` library, which contains powerful abstractions to train/fine-tune/test transformer models. This is useful considering the explosion of tranformer architctures available.


### Setup

We need to install the `transformers` library:

In [None]:
!pip install transformers

The documentation for the `transformers` library is available [here](https://huggingface.co/transformers/). 

### Finding a pre-trained model 

Hugging Face provides a nice UI to search for models: https://huggingface.co/models

---
# Task #1: Sentence similarity

Link to the model description provided [here](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1). 

## Load the model & tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1')

In [None]:
sentences = ['I love transformers', 'The cat sat on the hat']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
encoded_input

For more info on the tokenizer, see the [doc](https://huggingface.co/transformers/internal/tokenization_utils.html?highlight=tokenizer#transformers.tokenization_utils_base.PreTrainedTokenizerBase.__call__).

In [None]:
tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])

In [None]:
model(**encoded_input)

## Get embeddings for sentences

In [None]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
 token_embeddings = model_output[0] #First element of model_output contains all token embeddings
 b, s, e = token_embeddings.shape
 sentence_lengths = attention_mask.sum(dim=1).view(-1,1) # get sentence lengths
 token_embeddings = token_embeddings * attention_mask.view(-1, s, 1) # mask padded tokens
 return token_embeddings.sum(dim=1) / (sentence_lengths.float() + 1e-9)


def get_embeddings(sentences, tokenizer, model):
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

 # Compute token embeddings
 with torch.no_grad():
 model_output = model(**encoded_input)

 # Perform pooling. In this case, mean pooling.
 sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

 return sentence_embeddings

In [None]:
sentences = ['This is an example sentence', 'This is a longer example sentence']
get_embeddings(sentences, tokenizer, model)

## Computing cosine-similarity

$$\cos(v_1, v_2) = \frac{<v_1, v_2>}{\|v_1\| \|v_2\|} \in [-1,1]$$

If $v_1$ and $v_2$ are very similar (the angle between them is small), then their cosine similarity is close to $1$.

In [None]:
def cosine_sim(v1, v2):
 sim = (v1*v2).sum() / (v1.norm()*v2.norm() + 1e-8)
 return sim

sentences = ['The cat ate the fish', 
 'No feline would say no to tuna', 
 'Fish populations are threatened by the fishing industry', 
 'John bought an electric bike']
embs = get_embeddings(sentences, tokenizer, model)

print(f"Cosine-similarity between '{sentences[0]}' and '{sentences[1]}': {cosine_sim(embs[0], embs[1]):.2f}")
print(f"Cosine-similarity between '{sentences[0]}' and '{sentences[2]}': {cosine_sim(embs[0], embs[2]):.2f}")
print(f"Cosine-similarity between '{sentences[0]}' and '{sentences[3]}': {cosine_sim(embs[0], embs[3]):.2f}\n")

# Task #2: Named Entity Recognition (NER)

Link to the model description provided [here](https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London).

## Load the model & tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

## Get annotations for a sentence

More informations on pipelines [here](https://huggingface.co/transformers/main_classes/pipelines.html).

In [None]:
from transformers import pipeline

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "John loves Lausanne for its proximity to the lake and the great workshops at EPFL."

ner_results = nlp(example)
ner_results

# Task #3: Text generation with GPT-2

[Here](https://huggingface.co/gpt2) is the link to the model. 

In [None]:
# TODO find the code to load the model and generate some text

You can modify the parameters controling the text generation, check them out [here](https://huggingface.co/transformers/main_classes/model.html#transformers.generation_utils.GenerationMixin.generate).