In [1]:
# Reveal.js
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
 'theme': 'white',
 'transition': 'none',
 'controls': 'false',
 'progress': 'true',
})

{'controls': 'false',
 'progress': 'true',
 'start_slideshow_at': 'selected',
 'theme': 'white',
 'transition': 'none'}

# Representating Words as Vectors 

# Outline

* Representations of Words
 * Motivation
 * Sparse Binary Representations
 * Dense Continuous Representations
* Unsupervised Learning of Word Representations
 * Motivation
 * Sparse Co-occurence Representations
 * Neural Word Representations
* Additional Reading

![Word representations visualised in two dimensions](../img/word_representations.svg)

## Why talk about representations? ##

* Machine Learning, features are representations
* Better representations, better performance
* Representation Learning ("Deep Learning"), trendy

## What makes a good representation? ##

1. Representations are **distinct**
2. **Similar** words have **similar** representations

## Formal Task ##

* Words: $w$
* Vocabulary: $\mathbb{V} (\forall_{i} w_{i} \in \mathbb{V})$
* Find representation function: $f(w_{i}) = r_{i}$

## Sparse Binary Representations ##

## Sparse Binary Representations ##

* Map words to unique positive non-zero integers
* $f_{id}(w) \mapsto \mathbb{N^{*}}$
* $g(w, i) = {\left\lbrace
 \begin{array}{ll}
 1 & \textrm{if }~i = f_{id}(w) \\
 0 & \textrm{otherwise} \\
 \end{array}\right.}$
* "One-hot" vector
* $f_{sb}(w) = (g(w, 1), \ldots, g(w, |V|))$
* $f_{sb}(w) \mapsto \{0,1\}^{|V|}$

## Sparse Binary Example ##

* $\mathbb{V} = \{\textrm{apple}, \textrm{orange}, \textrm{rabbit}\}$
* $f_{id}(\textrm{apple}) = 1, \ldots$, $f_{id}(\textrm{rabbit}) = 3$
* $f_{sb}(\textrm{apple}) = (1, 0, 0)$
* $f_{sb}(\textrm{orange}) = (0, 1, 0)$
* $f_{sb}(\textrm{rabbit}) = (0, 0, 1)$

## Sparse Binary Visualised ##

![Sparse binary representations visualised](../img/sparse_binary.svg)


## Cosine Similarity ##

* $cos(u, v) = \frac{u \cdot v}{||u|| ||v||}$
* $cos(u, v) \mapsto [-1, 1]$
* $cos(u, v) = 1$; identical
* $cos(u, v) = -1$; opposites
* $cos(u, v) = 0$; orthogonal

## Cosine Similarity Visualised ##

![Cosine similarity visualisation](http://blog.christianperone.com/wp-content/uploads/2013/09/cosinesimilarityfq1.png)

http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

## Sparse Binary Similarities ##

* $cos(f_{sb}(\textrm{apple}), f_{sb}(\textrm{rabbit})) = 0$
* $cos(f_{sb}(\textrm{apple}), f_{sb}(\textrm{orange})) = 0$
* $cos(f_{sb}(\textrm{orange}), f_{sb}(\textrm{rabbit})) = 0 $

## Dense Continuous Representations ##

## Dense Continuous Representations ##

* $f_{id}(w) \mapsto \mathbb{N}^{*}$
* "Embed" words as matrix rows
* Dimensionality: $d$ (hyperparameter)
* $W \in \mathbb{R}^{|\mathbb{V}| \times d}$
* $f_{dc}(w) = W_{f_{id}(w), :}$
* $f_{dc}(w) \mapsto \mathbb{R}^{d}$

## Dense Continuous Example ##

* $\mathbb{V} = \{\textrm{apple}, \textrm{orange}, \textrm{rabbit}\}$
* $d = 2$
* $W \in \mathbb{R}^{3 \times 2}$
* $f_{id}(\textrm{apple}) = 1, \ldots, f_{id}(\textrm{rabbit}) = 3$
* $f_{dc}(\textrm{apple}) = (1.0, 1.0)$
* $f_{dc}(\textrm{orange}) = (0.9, 1.0)$
* $f_{dc}(\textrm{rabbit}) = (0.1, 0.5)$

## Dense Continuous Visualised ##

![Visualisation of dense continuous word representations](../img/dense_continuous.svg)

## Dense Continuous Similarities ##

* $cos(f_{dc}(\textrm{apple}),f_{dc}(\textrm{rabbit})) \approx 0.83$
* $cos(f_{dc}(\textrm{apple}),f_{dc}(\textrm{orange})) \approx 1.0$
* $cos(f_{dc}(\textrm{orange}),f_{dc}(\textrm{rabbit})) \approx 0.86$

# Unsupervised Learning of Word Representations #

## Why not supervised? ##

![Size comparison between annotated and unannotated data](../img/annotated_vs_unannotated_data.svg)

* Also, inherent incompleteness...

## Linguistic Inspirations ##

* "Oculist and eye-doctor … occur in almost the same environments. … If $A$ and $B$ have almost identical environments we say that they are synonyms." – Zellig Harris (1954)
* "You shall know a word by the company it keeps." – John Rupert Firth (1957)
* Akin to "meaning is use" – Wittgenstein (1953)

## Sparse Co-occurence Representations ##

## Co-occurences ##

* Collected from a large collection of *raw* text

1. "…comparing an **apple** to an **orange**…"
2. "…an **apple** and **orange** from Florida…"
3. "…my **rabbit** is not shaped like an **orange**…" (yes, there is always **noise** in the data)


## Sparse Co-occurence Representations ##

* The number of times words co-occur in a text collection
* $C \in \mathbb{N}^{|V| \times |V|}$
* $f_{id}(\textrm{apple}) = 1, \ldots, f_{id}(\textrm{rabbit}) = 3$
* $C = \begin{pmatrix}
 2 & 2 & 0 \\
 2 & 3 & 1 \\
 0 & 1 & 1 \\
 \end{pmatrix}$
* $f_{cs}(w) = C_{f_{id}(w), :}$
* $f_{cs}(w) \mapsto \mathbb{N}^{|V|}$

## Sparse Co-occurence Example ##

* $\mathbb{V} = \{\textrm{apple}, \textrm{orange}, \textrm{rabbit}\}$
* $f_{id}(\textrm{apple}) = 1, \ldots, f_{id}(\textrm{rabbit}) = 3$
* $f_{cs}(\textrm{apple}) = (2, 2, 0)$
* $f_{cs}(\textrm{orange}) = (2, 3, 1)$
* $f_{cs}(\textrm{rabbit}) = (0, 1, 1)$

## Sparse Co-occurence Similarities ##

* $cos(f_{cs}(\textrm{apple}), f_{cs}(\textrm{rabbit})) \approx 0.50$
* $cos(f_{cs}(\textrm{apple}), f_{cs}(\textrm{orange})) \approx 0.94$
* $cos(f_{cs}(\textrm{orange}), f_{cs}(\textrm{rabbit})) \approx 0.76$

# Dense Co-occurence Representations #

## Matrix Factorisation ##

![Matrix factorisation visualisation](../img/matrix_factorisation.svg)

* $C \in \mathbb{R}^{|V| \times |V|}$
* $U \in \mathbb{R}^{|V| \times d}$
* $V \in \mathbb{R}^{d \times |V|}$

## Dense Co-occurence Representations ##

* Factorise $C$
* $U \approx \begin{pmatrix}
 -1.26 & 0.65 \\
 -1.72 & -0.24 \\
 -0.46 & -0.89 \\
 \end{pmatrix}$
* $f_{id}(\textrm{apple}) = 1, \ldots, f_{id}(\textrm{rabbit}) = 3$
* $f_{cd}(w) = U_{f_{id}(w), :}$
* $f_{cd}(w) \mapsto \mathbb{R}^{d}$

## Dense Co-occurence Example ##

* $\mathbb{V} = \{\textrm{apple}, \textrm{orange}, \textrm{rabbit}\}$
* $f_{id}(\textrm{apple}) = 1, \ldots, f_{id}(\textrm{rabbit}) = 3$
* $f_{cd}(\textrm{apple}) = (-1.26, 0.65)$
* $f_{cd}(\textrm{orange}) = (-1.72, -0.24)$
* $f_{cd}(\textrm{rabbit}) = (-0.46, -0.89)$

## Dense Co-occurence Visualised ##

![Dene co-occurence representations visualised](../img/dense_cooccurences.svg)

## Dense Co-occurence Similarities ##

* $cos(f_{cd}(\textrm{apple}), f_{cd}(\textrm{rabbit})) \approx 0.00$
* $cos(f_{cd}(\textrm{apple}), f_{cd}(\textrm{orange})) \approx 0.82$
* $cos(f_{cd}(\textrm{orange}), f_{cd}(\textrm{rabbit})) \approx 0.58$

# Neural Word Representations #

## Learning by Slot Filling ##

* "…I had some **_____** for breakfast today…"
* Good: *cereals*
* Bad: *airplanes*

## Unsupervised Loss Function ##

* $w \in \mathbb{V}$; $c \in \mathbb{V}$
* $D = ((c, w),\ldots)$; observed co-occurences
* $D' = ((c, w),\ldots)$; "noise samples"
* $\textrm{max}~p((c, w) \in D | W) - p((c, w) \in D' | W)$


## Neural Skip-Gram Model ##

* $W^{w} \in \mathbb{R}^{|\mathbb{V}| \times d}$; $W^{c} \in \mathbb{R}^{|\mathbb{V}| \times d}$
* $D = ((c, w),\ldots)$; $D' = ((c, w),\ldots)$
* $\sigma(x) = \frac{1}{1 + \textrm{exp}(-x)}$
* $p((c, w) \in D | W^{w}, W^{c}) = \sigma(W^{c}_{f_{id}(c),:} \cdot W^{w}_{f_{id}(w),:})$
* $\arg\max\limits_{W^{w},W^{c}} \sum\limits_{(w,c) \in D} \log \sigma(W^{c}_{f_{id}(c),:} \cdot W^{w}_{f_{id}(w),:}) \\ + \sum\limits_{(w,c) \in D'} \log \sigma(-W^{c}_{f_{id}(c),:} \cdot W^{w}_{f_{id}(w),:})$


## Neural Representation ##

* Learned using [word2vec](https://code.google.com/p/word2vec/)
* Google News data, $~1,000,000,000$ words
* $|\mathbb{V}| = 3,000,000$
* $d = 300$

## Neural Representation Example ##

* $f_{n}(\textrm{apple}) = (-0.06, -0.16, \ldots, 0.34)$
* $f_{n}(\textrm{orange}) = (-0.10, -0.18, \ldots, 0.08)$
* $f_{n}(\textrm{rabbit}) = (0.02, 0.11, \ldots, 0.11)$

## Neural Representation Similarities ##

* $cos(f_{n}(\textrm{apple}), f_{n}(\textrm{rabbit})) \approx 0.34$
* $cos(f_{n}(\textrm{apple}), f_{n}(\textrm{orange})) \approx 0.39$
* $cos(f_{n}(\textrm{orange}), f_{n}(\textrm{rabbit})) \approx 0.20$

## Neural Representations Visualised ##

![Word representations visualised in two dimensions](../img/word_representations.svg)

* Dimensionality reduction using [t-SNE](https://lvdmaaten.github.io/tsne/)

## Neural Representations Visualised (zoomed) ##

![Word representations visualised in two dimensions, zoomed in on a small cluster](../img/word_representations_zoom.svg)

* Dimensionality reduction using [t-SNE](https://lvdmaaten.github.io/tsne/)

## Word Representation Algebra ##

* $f_{n}(\textrm{king}) - f_{n}(\textrm{man}) + f_{n}(\textrm{woman}) \approx f_{n}(\textrm{queen})$
* $f_{n}(\textrm{Paris}) - f_{n}(\textrm{France}) + f_{n}(\textrm{Italy}) \approx f_{n}(\textrm{Rome})$

# Beyond Word Representations #

# Beyond Word Representations #

![TreeRNN composing a simple sentence](../img/sashimi_socher_tree.svg)

* How do we compose phrases and sentences?
* Tree Recurrent Neural Networks ([Socher et al., 2010](http://www.socher.org/uploads/Main/2010SocherManningNg.pdf))
* Long Short-Term Memory Recurrent Neural Networks ([Hochreiter and Schmidhuber, 1997](http://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735))
* The latter is now **much** more popular.
* More about this on Friday with Pasquale!

# Summary #

# Summary #

* Moving from features to "representations"
* Representations limits what we can learn and our generalisation power
* Many ways to learn representations (there are more than what we covered)
* Neural representations
 * Most popular today
 * Only ones to capture analogies (why, we don't know...)
* Adding representations trained unsupervisedly improves performance on supervised tasks (semi-supervised learning)

# Additional Reading #

* ["Word Representations: A Simple and General Method for Semi-Supervised Learning"](http://www.aclweb.org/anthology/P/P10/P10-1040.pdf) by Turian et al. (2010)
* ["Representation Learning: A Review and New Perspectives"](https://arxiv.org/abs/1206.5538) by Bengio et al. (2012)
* ["Linguistic Regularities in Continuous Space Word Representations"](http://www.aclweb.org/anthology/N/N13/N13-1090.pdf) by Mikolov et al. (2013a) ([video](http://techtalks.tv/talks/linguistic-regularities-in-continuous-space-word-representations/58471/))
* ["Distributed Representations of Words and Phrases and their Compositionality"](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality) by Mikolov et al. (2013b)
* ["word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method"](https://arxiv.org/abs/1402.3722) by Goldberg and Levy (2014)
* ["Neural Word Embedding as Implicit Matrix Factorization"](http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization) by Levy and Goldberg (2014)

# Potential MSc Projects #

## MSc Project Supervision ##

* "Great tradition" of the UCLMR group
* This year – so far – three projects turned into papers! (matters in industry too!)
* I intend to supervise up to two students
* We also have more members that supervise projects
* Tim will present project proposals next Monday

## Unsupervisedly Learning Phrase/Sentence Representations ##

![Skip-Thought model visualisation](../img/skip-thought.svg)

* Can we apply the lessons of this lecture to phrases and sentences?
* Can we do so in a computationally efficient manner?
* Will similar patterns arise? If so, can we observe them?
* Can we detect abstractions and transfer them?

## Reading ##

* ["Skip-Thought Vectors"](https://arxiv.org/abs/1402.3722) by Kiros et al. (2015)

* ["Semi-supervised Sequence Learning"](https://papers.nips.cc/paper/5949-semi-supervised-sequence-learning) by Dai and Le (2016)

## Learning to Generate ##

![Visualisation of the idea of learning to generate data](../img/learning_to_generate.png)

* Use Reinforcement Learning to generate data that fits your training distribution.
* Can we enforce diversity in the generated data?
* What happens when training data is abundant?

### Reading ###

* ["Learning to Generate Textual Data"](http://www.aclweb.org/anthology/D/D16/D16-1167.pdf) by Bouchard et al. (2016)
* ["Data Programming: Creating Large Training Sets, Quickly"](http://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly) by Ratner et al. (2016)

# Thank you for your attention #

### ご清聴ありがとうございました ###

### Tack för er uppmärksamhet ###