{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assister Discovery\n", "\n", "`Assister discovery` is a software that maps user requests to executable commands in [Assister Pipeline](https://github.com/keyvan-m-sadeghi/assister/tree/assister-conception/rfcs/text/assister-conception#assister-pipeline), illustrated in the following figure. Using [Natural Language Understanding](https://en.wikipedia.org/wiki/Natural-language_understanding) coupled with Machine Learning over the [Terms and Functions](https://github.com/keyvan-m-sadeghi/assister/tree/assister-conception/rfcs/text/assister-conception#terms-and-functions-language-tfx) contextual annotations embedded in an application, a discovery can translate a request to the corresponding command. The command will then be executed within the Pipeline. [[Link to the full proposal]](https://github.com/keyvan-m-sadeghi/assister/tree/assister-conception/rfcs/text/assister-conception)\n", "\n", "![Assister_pipeline](img/pipeline.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Statement and Solution\n", "\n", "The research task corresponding to the aforementioned real problem at Assister, can be formally defined as:\n", "> `How can we map a user request in natural language to a pre-defined executable command?`\n", "\n", "If we had several mappings from user sentences to the requested commands, we would choose a [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) approach. This is not a realistic assumption at the current stage, although we know the possible commands (classes) at each context that can be selected for execution. So, we propose an [unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning) methology to tacke this probelm, as follows:\n", "1. Representation learning of user requests via [word embeddings](https://en.wikipedia.org/wiki/Word_embedding).\n", "2. Finding the best embedding of each command description.\n", "3. Mapping a request to a command with the shortest [distance](https://en.wikipedia.org/wiki/Distance) in the embedding space based on a [similarity measure](https://en.wikipedia.org/wiki/Similarity_measure).\n", "\n", "It is obvious that word/sentence-level embedding algorithm is the key part of the solution. So, we first review the literature for this [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP) task. Then, we describe the two recent state-of-the-art embedding models and how Assister utilizes them to embed the requests and comments. Finally, we explain the common similarity measures and pick a suitable one for Assister to calculate the distance between any pair of embeddings. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word Embeddings\n", "\n", "Undoubtedly, the year 2018 has been an inflection point for NLP after being relatively stationary for a couple of years. Word embedding is definitely one of the most popular representation of document [vocabulary](https://en.wikipedia.org/wiki/Vocabulary) with the capability of capturing words' [context](https://en.wikipedia.org/wiki/Context_(language_use)) in a document, syntactic and semantic similarity, relation between different words, etc. More formally, embeddings are low-dimensional representations of a data point (sample) in a higher-dimensional [vector space](https://en.wikipedia.org/wiki/Vector_space). In the same way, word embeddings are dense vectorized representations of words in a low-dimensional space. The first [neural network]()-based word embedding model was first proposed by Google in 2013 [[1]](#References). Since then, word embedding has received a lot of attention in almost every NLP model in practice. They are very effective, because by translating a word to an embedding one can model the semantic importance of a word in a numeric form and thus perform many [mathematical operations](https://en.wikipedia.org/wiki/Linear_algebra) on it. To make it more clear, let's take a look at a common example in the literature:\n", "> Let $\\phi$ be a word embedding mapping $W \\rightarrow \\mathbb{R}^n$, where $W$ is the word space and $\\mathbb{R}^n$ is an $n$-dimensional vector space, then we have:\n", ">\n", "> $\\phi(''king'') - \\phi(''man'') + \\phi(''woman'') = \\phi(''queen'')$\n", "\n", "It was first introduced by the [word2vec](https://code.google.com/archive/p/word2vec/) [[2]](#References) model in 2013 that was a great breakthrough. Another fascinating word embedding model was [Glove](https://nlp.stanford.edu/projects/glove/) [[3]](#References) in 2014. Although these two models are powerful, they are __context-free__ in which a single word embedding representation for each word in the vocabulary is generated. So, `bank` would have the same representation in `bank deposit` and `river bank`. Instead, __contextual__ models - including [Semi-supervised Sequence Learning](https://arxiv.org/abs/1511.01432) (2015) [[4]](#References), [Generative Pre-Training](https://openai.com/blog/language-unsupervised/) (2018) [[5]](#References), [ELMO](https://allennlp.org/elmo) (2018) [[6]](#References), [ULMFit](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html) (2018) [[7]](#References) - generate a representation of each word based on the other words in the sentence, so they can capture both a static semantic meaning and a contextualized meaning. For instance, the word `apple` in the two sentences `I like apples` and `I like Apple macbooks` has a different semantic meaning, thus the embedding of this word would have a different vector representation which makes it more powerful for NLP tasks. The two recent state-of-the-art models - [USE](https://ai.google/research/pubs/pub46808) (2018) [[8]](#References) and [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) (2018) [[9]](#References) - use a powerful sequence transductive model for language understanding, called [Transformer](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html) (2017) [[10]](#References). We first review the Transformer model, then describe how the USE and the BERT models take the advantage of using Transformer as a building block." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transformer\n", "\n", "Today's NLP world benefits from the recent advancements of [Deep Learning](https://en.wikipedia.org/wiki/Deep_learning) research. More specifically, Google introduced a novel neural network architecture, called Transformer, in a seminal paper [[10]](#References) which outperformed many traditional [Recurrent Neural Network](https://en.wikipedia.org/wiki/Recurrent_neural_network) (RNN) sequence models (like [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory) and [GRU](https://en.wikipedia.org/wiki/Gated_recurrent_unit)). The main advantages of using transformer as a language understanding unit is that (1) it can effectively model the long-term dependencies among words in a temporal word sequence; and (2) its [model training](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) phase is efficient by eliminating the sequential dependency on previous words [[10]](#References).\n", "\n", "A transformer is an [encoder-decoder architecture](https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346) model that uses [attention mechanisms](https://skymind.ai/wiki/attention-mechanism-memory-network) to forward a complex pattern of the whole sequence to the decoder at once rather than sequentially as depicted in the following figure [[source](http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/)]:\n", "\n", "![Transformer](img/attention_path_length.png)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Universal Sequence Encoder (USE)\n", "\n", "The USE uses a transformer to provide sentence-level embeddings as easy as it has historically been to look up the embeddings for individual words, $e.g.$ word2vec. The universal sentence encoder is a model that encodes a text into 512-dimensional embeddings. The resulted embeddings can then be used as inputs to NLP tasks such as [sentiment classification](https://en.wikipedia.org/wiki/Sentiment_analysis) and [textual similarity](https://en.wikipedia.org/wiki/Semantic_similarity) analysis. Pre-trained word embeddings are considered to be an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch. A pre-trained and optimized USE model on a variety of data sources is publicly available on [TenforFlow Hub](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb#scrollTo=RUymE2l9GZfO). This module is about 800MB and depending on your network speed it might take a whileto load the first time you instantiate it. After that, loading the module should be faster as modules are [cached](https://www.tensorflow.org/hub/basics) by default. Now, we use the universal sequence encoder's TF Hub module to compute a representation (embedding) for user requests and command descriptions in an online spreadsheet application, over [TensorFlow platform](https://www.tensorflow.org/)." ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "I0426 22:25:57.912678 140734900516288 saver.py:1483] Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Sentence: Format cell A12 as date\n", "Embedding size: 512\n", "Embedding: [-0.050871364772319794, -0.0016370579833164811, 0.022183720022439957, ...]\n", "\n", "Sentence: Sum the values in column B and store the result in cell C7\n", "Embedding size: 512\n", "Embedding: [-0.03197444975376129, 0.02532351016998291, 0.010689766146242619, ...]\n", "\n", "Sentence: Delete cell D20\n", "Embedding size: 512\n", "Embedding: [-0.038657110184431076, 0.049386851489543915, 0.000588558497838676, ...]\n", "\n", "Sentence: Delete row 15\n", "Embedding size: 512\n", "Embedding: [-0.04079856723546982, 0.020178884267807007, -0.03739866614341736, ...]\n", "\n", "Sentence: Select cells from B16 to E19\n", "Embedding size: 512\n", "Embedding: [-0.024356713518500328, 0.029383739456534386, 0.012827333062887192, ...]\n", "\n", "Sentence: To Format a cell, you need a cell, like B10, and a type, like date\n", "Embedding size: 512\n", "Embedding: [-0.032455652952194214, 0.04132832959294319, 0.012323660776019096, ...]\n", "\n", "Sentence: To sum a column, you need a column, like C, and a cell, like A5, to store the result\n", "Embedding size: 512\n", "Embedding: [-0.05170544236898422, 0.02925257198512554, -0.007931080646812916, ...]\n", "\n", "Sentence: To delete a cell, you need a cell, like C11\n", "Embedding size: 512\n", "Embedding: [0.03696422278881073, 0.019425714388489723, 0.0004833970742765814, ...]\n", "\n", "Sentence: To delete a row, you need a row, like 12\n", "Embedding size: 512\n", "Embedding: [-0.07475815713405609, 0.038178637623786926, -0.04855307564139366, ...]\n", "\n", "Sentence: To select cells, you need a start cell, like B7, and an end cell, like D17\n", "Embedding size: 512\n", "Embedding: [-0.02912534400820732, 0.047603789716959, 0.00011833444295916706, ...]\n", "\n" ] } ], "source": [ "import tensorflow as tf\n", "import tensorflow_hub as hub\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import re\n", "import os\n", "\n", "# Import the USE's TF Hub module\n", "module_url = \"https://tfhub.dev/google/universal-sentence-encoder-large/3\"\n", "# Another pre-trained model is \"https://tfhub.dev/google/universal-sentence-encoder/2\"\n", "module = hub.Module(module_url)\n", "\n", "# Compute embeddings for sentences (either of user requests or command descriptions)\n", "requests = [\"Format cell A12 as date\",\n", " \"Sum the values in column B and store the result in cell C7\",\n", " \"Delete cell D20\",\n", " \"Delete row 15\",\n", " \"Select cells from B16 to E19\"]\n", "commands = [\"To Format a cell, you need a cell, like B10, and a type, like date\",\n", " \"To sum a column, you need a column, like C, and a cell, like A5, to store the result\",\n", " \"To delete a cell, you need a cell, like C11\",\n", " \"To delete a row, you need a row, like 12\",\n", " \"To select cells, you need a start cell, like B7, and an end cell, like D17\"]\n", "sentences = requests + commands\n", "\n", "# Run the embeddings in a TensorFlow session\n", "with tf.Session() as session:\n", " session.run([tf.global_variables_initializer(), tf.tables_initializer()])\n", " sentence_embeddings = session.run(module(sentences))\n", " for i, sentence_embedding in enumerate(np.array(sentence_embeddings).tolist()):\n", " print(\"Sentence: {}\".format(sentences[i]))\n", " print(\"Embedding size: {}\".format(len(sentence_embedding)))\n", " embedding_short = \", \".join((str(x) for x in sentence_embedding[:3]))\n", " print(\"Embedding: [{}, ...]\\n\".format(embedding_short))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Semantic Similarity between Requests and Commands via the USE\n", "\n", "The embeddings produced by the Universal Sentence Encoder are approximately normalized. The [semantic similarity](https://en.wikipedia.org/wiki/Semantic_similarity) between any pair of user requests and commands descriptions, which can be simply the [inner product](https://en.wikipedia.org/wiki/Inner_product_space) of the encodings, could be an informative analysis in Assister discovery. Inner product space is a proper metric in our case, as it satisfies the three well-known axioms: [Conjugate](https://en.wikipedia.org/wiki/Complex_conjugate) symmetry, [Linearity](https://en.wikipedia.org/wiki/Linearity#In_mathematics) property, and [Positive-definite](https://en.wikipedia.org/wiki/Definite_quadratic_form#Associated_symmetric_bilinear_form)." ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [], "source": [ "def plot_similarity(labels, embeddings, rotation):\n", " inner = np.inner(embeddings[:len(requests)], embeddings[len(requests):])\n", " sns.set(font_scale=1.2)\n", " g = sns.heatmap(\n", " inner,\n", " annot=True,\n", " xticklabels=labels[len(requests):],\n", " yticklabels=labels[:len(requests)],\n", " vmin=0,\n", " vmax=1,\n", " cmap=\"YlGnBu\")\n", " g.set_xticklabels(labels[len(requests):], rotation=rotation)\n", " g.set_yticklabels(labels[:len(requests)])\n", " g.set_title(\"Semantic Similarity based on Inner Product of Embeddings\")\n", "\n", "def run_and_plot(session_, input_tensor_, messages_, encoding_tensor):\n", " message_embeddings_ = session_.run(encoding_tensor, feed_dict={input_tensor_: messages_})\n", " plot_similarity(messages_, message_embeddings_, 90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we plot the similarity in a heat map. It is a matrix with `No of requests` rows and `No of commands` columns, where each entry $[i, j]$ is colored based on the inner product of the embeddings for user request $i$ and command description $j$. In this test, we use the aforementioned examples in the requests and commands lists, which are matched one-to-one for simplicity. The higher the score at each row (request) is, the more similar the corresponding request will be. When we look at each user request (a row), we expect to have a higher similarity score with its related command, which is in our case the diagonal entries of the heat map. For example, the second request (2nd row) about column summation is more similar to the second command description with distance value $[2, 2] = 0.9$." ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "I0426 22:26:58.279740 140734900516288 saver.py:1483] Saver not created because there are no variables in the graph to restore\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))\n", "similarity_message_encodings = module(similarity_input_placeholder)\n", "with tf.Session() as session:\n", " session.run(tf.global_variables_initializer())\n", " session.run(tf.tables_initializer())\n", " run_and_plot(session, similarity_input_placeholder, sentences, similarity_message_encodings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bidirectional Encoder Representations from Transformer (BERT)\n", "\n", "[BERT](https://github.com/google-research/bert) [[9]](#References) is a method of pre-training language representations in which one can train a general-purpose __language understanding__ model on a relatively large text corpus (like Google or Wikipedia), then use the trained model for downstream NLP tasks. BERT is the first _unsupervised_, _deeply bidirectional contextual_ system for pre-training natural languages, as opposed to context-free models (like [word2vec](https://code.google.com/archive/p/word2vec/) and [Glove](https://nlp.stanford.edu/projects/glove/)) and undirectional/shallowly bidirectional contextual models (like [Semi-supervised Sequence Learning](https://arxiv.org/abs/1511.01432), [Generative Pre-Training](https://openai.com/blog/language-unsupervised/), [ELMO](https://allennlp.org/elmo), [ULMFit](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html)). For example, in the sentence `I made a bank deposit` the unidirectional representation of `bank` is only based on `I made a` but not `deposit`. BERT uses deep bidirectional Transformer to represent `bank` based on both its left and right context - `I made a ... deposit`. To overcome the __see itself__ issue, Google's BERT employed [__masked language modeling__](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270), in which it hides 15\\% of the words and uses their position information to infer them. An interesting issue with this methodology is that it has a negative impact on convergence time, but it outperforms the state-of-the-art models before convergence. \n", "\n", "BERT proposes two different model sizes, containing the number of layers (i.e. Transformer blocks) as $L$, the hidden size as $H$, and the number of self-attention heads as $A$:\n", "\n", "- BERTBASE: $L=12$, $H=768$, $A=12$, Total parameters $=110M$\n", "- BERTLARGE: $L=24$, $H=1024$, $A=16$, Total parameters $=340M$\n", "\n", "A very high-level architecture of BERTBASE looks like this:\n", "\n", "![BERT](img/bert.png)\n", "\n", "Each encoder block encapsulates a sophisticated model architecture. A visual representation of BERT input is as follows:\n", "\n", "![BERT_Input](img/bert_input.png)\n", "\n", "\n", "The four embedding layers are [[9]](#References):\n", "\n", "- The __input layer__ is the vector of the sequence tokens along with the special tokens, i.e. _[CLS]_ (first token), _[SEP]_ (sequnce delimiter), and _[MASK]_ (masked words).\n", "- __Token embeddings__ are the vocabulary IDs for each of the tokens.\n", "- __Sequence Embeddings__ are just numerics to distinguish between sentences.\n", "- __Transformer Positional Embeddings__ specify the position of each word in the sequence.\n", "\n", "Now, we use the BERT module to compute a representation (embedding) for user requests and command descriptions in our online spreadsheet application. Very nice implementations of pre-trained BERT contextualized word embeddings exist on Google Colab for both TensorFlow platform [(__here__)](https://colab.research.google.com/drive/1RhmL0BqNe52FEbdSyLpkfVuCZxE7b5ke#forceEdit=true&offline=true&sandboxMode=true) and [Keras](https://keras.io/) platform [(__here__)](https://colab.research.google.com/gist/HighCWu/3a02dc497593f8bbe4785e63be99c0c3/bert-keras-tutorial.ipynb#scrollTo=o_4yp35FuZib). We tested our example over these codes and we got the similar results as the USE." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References\n", "\n", "[1] Mikolov, T., I. Sutskever, K. Chen, G. Corrado and J. Dean, Distributed Representations of Words and Phrases and their Compositionality (2013)\n", "\n", "[2] Mikolov, T., Chen, K., Corrado, G., Dean, J., Sutskever, L. and Zweig, G., Efficient Estimation of Word Representations in Vector Space (2013)\n", "\n", "[3] Pennington, J., Socher, R. and Manning, C., Glove: Global vectors for word representation (2014)\n", "\n", "[4] Dai, A.M. and Le, Q.V., Semi-supervised sequence learning (2015)\n", "\n", "[5] Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., Improving language understanding by generative pre-training (2018)\n", "\n", "[6] Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L., Deep contextualized word representations (2018)\n", "\n", "[7] Howard, J. and Ruder, S., Universal language model fine-tuning for text classification (2018)\n", "\n", "[8] Cer, D., Yang, Y., Kong, S.Y., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C. and Sung, Y.H., Universal sentence encoder (2018)\n", "\n", "[9] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., Bert: Pre-training of deep bidirectional transformers for language understanding (2018)\n", "\n", "[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.Gomez, L. Kaiser and I. Polosukhin, Attention Is All You Need (2017)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }