--- layout: post title: "Learn about N-Grams Interactively!" date: 2022-02-20 09:12:17 +0330 categories: ml --- # Learn N-Grams Interactively! ## The Problem of Encoding Let's face it, if you go deep into it, a computer is really dumb. A computer's software is highly dependent on its hardware. An 8-bit microprocessor like CMOS 6502 can be used to create a computer like Apple II or it can be used to create a small gamebord whose only means of output is a small LED screen. From mid-70s to late 90s, most programmers coded in the Assembly language of the microprocessor they were coding it. And when you code in Assembly what becomes clear to you is, > The only thing that a computer understands are digits! So, when we feed a text to a computer and tell it 'Classify this with this model!', if you just give it the text, as encoded inUTF-8 as a series of 8-bit octets, it will tell you 'W0t?' That's true. Most modern text is encoded in UTF-8, Unicode's official codec for characters. UTF-8 uses an octet of 8 bits, which the peasantry call a byte (but that's actually kinda not true because a byte can be 6 or 12 bits, depending on the architecture of your computer. A byte only came to be defined as an octet because IBM PCs used Intel which defines a byte as a little-endian octet) to encode all the possible characters in the Unicode space. And oh, baby, Unicode's LARGE! So you can't just say, ok, let's do label encoding, let's say the letter `a` is 0 and the emoji 🎉 is 1229 --- why? Because just god knows how many characters are in a large body of text, plus, you will have to account for characters which will be inputted by the end user. Basically you have to accoutn for a gazillion labels! So what do we do? We, of course, could use one-hot-encoding instead of lable encoding but then we will be dealing with an array the size of the unicode space instead of just an inteeger for each characte, or lik eea good boy, we code take the text into a latent space. ## But what is this Latent Space? Do you know what a vector is? Well I bet you do. But let me explain. A vecor is an entity that has a size and a direction in the Carthesian coordinate system. Say par example this: What you see drawn above is this vector: $$ \vec{v} = \begin{bmatrix} 2 \\ 3 \end{bmatrix} $$ Ok, buddy, so what? Now imagine this vector includes attributes instead of coordinates. These attributes can be anything we wish them to be. But what would make these vectors represent a so-called *latent space* is them representing a text encoded in such a way that it's considerably smaller than the actual text. Imagine this piece of Python code ```python my_text = "This is my text! I love it!" print(hash(my_text)) ``` That would print a series of numbers, a hash, in its own unique way, is a latent space. A non-reversible latent space however. A vector is a reversible latent space. You can give a large text, maybe the text to Count of Monte Cristo or the Prayers of the Great Shield, to this encoder, and it will input a small representation of the text. You can always reverse it back. But you can also do stuff with it! A lot of stuff! Know what I'm saying? This latent space is often called an *embedding* --- and, no, my dear Crypto Bro, the latent space is stil too large to be used in a smart contract! ## Embeddings, from TF-IDF, to Bag of Words, to BERT, to... One simple embeddings vector is TF-IDF. It's so simple, Scikit-Learn can do it. TF-IDF stands for **Terms Frequency Times Inverse Document Frequency**. Term frequencies are the coutns of each word in a document, and the inverse meas you'll divide each of these word counts by the number of documents they occur in. It's widely used in shallow ML. A vector that is considered a classical "Bag of Words" is vector of word counts of frequencies. It's the simplest form of latent space. On the deep side of spectrum we have embeddings that are ACTUAL latent spaces, such as BERT encodings. BERT is a neural network developed by Google that takes your text into a latent space represented by of so many attributes, not just frequeny. BERT has an 'attention' layer so it pays attention to the relevancy of the word, and as a result, the redundancy. Like it knows that in this sentence: > I scream, you scream, we all scream for ice cream! There are some redundancies so it factors them in. But how does it do that? How does it know that `you scream` and `I scream` and `we all scream` are all single units made up of smaller units? # Bag of n-grams So, now that we know what a latent space is well and good --- let us talk about them grams! ## What is an n-grams? An n-gram is basically a of 1-ples, 2-ples, 3-ples, or more words that represent a single meaningful unit in computational linguistics. n-grams are computational semantics, not grammar, despite what their name might suggest. > Note: n-grams we consider in this piece are of words, but they are often used for characters as well. n-grams come into play when we are tokenizing a sentence. We can tokenize a sentence on 1-grams, 2-grams, 3-grams or more. But at some point n-grams become so niche that it doesn't make computational sense to do a number further than that. >Tokenization is basically splitting of a sentence into meaningful units So, let's say we want to split the famous sentence "I scream, you scream, we all scream for ice cream" into 1-grams. That would simple be: Now let's get all the 2-, 3-, 4- and 5-grams for a more complex text, shown below: ``` I love Chicago deep dish pizza! New York style pizza is also good. San Francisco pizza, it can be very good ``` After removing the stropwords and punctions and using this code: ``` from nltk.util import ngrams import re from nltk.corpus import stopwords import nltk nltk.download('stopwords') sw_set = list(set(stopwords.words('english'))) sent = ''' I love Chicago deep dish pizza! New York style pizza is also good. San Francisco pizza, it can be very good ''' sent = sent.lower() sent_rem = re.sub(r"(?!\s)\W", "", sent) print("Sentence without puncutation is: ", sent_rem) sent_wo_sw = ' '.join([w for w in sent_rem.split(" ") if w.lower() not in sw_set]) print("W/O stopwords: ", sent_wo_sw) sent_split = set(re.split(r"\s+", sent_wo_sw)) print("2-grams are: ") print('\n'.join(list([', '.join(ng) for ng in tuple(ngrams(sent_split, 2))]))) print("\n\n") print("3-grams are: ") print('\n'.join(list([', '.join(ng) for ng in tuple(ngrams(sent_split, 3))]))) print("\n\n") print("4-grams are: ") print('\n'.join(list([', '.join(ng) for ng in tuple(ngrams(sent_split, 4))]))) print("\n\n") print("5-grams are: ") print('\n'.join(list([', '.join(ng) for ng in tuple(ngrams(sent_split, 5))]))) ``` ``` [nltk_data] Downloading package stopwords to /home/chubak/nltk_data... [nltk_data] Package stopwords is already up-to-date! Sentence without puncutation is: i love chicago deep dish pizza new york style pizza is also good san francisco pizza it can be very good W/O stopwords: i love chicago deep dish pizza new york style pizza also good san francisco pizza good 2-grams are: , francisco francisco, san san, new new, york york, pizza pizza, style style, dish dish, also also, deep deep, love love, good good, chicago chicago, i 3-grams are: , francisco, san francisco, san, new san, new, york new, york, pizza york, pizza, style pizza, style, dish style, dish, also dish, also, deep also, deep, love deep, love, good love, good, chicago good, chicago, i 4-grams are: , francisco, san, new francisco, san, new, york san, new, york, pizza new, york, pizza, style york, pizza, style, dish pizza, style, dish, also style, dish, also, deep dish, also, deep, love also, deep, love, good deep, love, good, chicago love, good, chicago, i 5-grams are: , francisco, san, new, york francisco, san, new, york, pizza san, new, york, pizza, style new, york, pizza, style, dish york, pizza, style, dish, also pizza, style, dish, also, deep style, dish, also, deep, love dish, also, deep, love, good also, deep, love, good, chicago deep, love, good, chicago, i ``` As we get furthr, this formula starts shaping in our minds: $$ N_{n-grams} = \begin{pmatrix} \vert S\vert \\ n \end{pmatrix} $$ Where $$ \vert S \vert$$ is the size of the sentence expresed in words. Basically, number of n-grams is the combination of the number of words and the n itself. Use the controls below to generate n-grams for any sentence you want. This will give you a much more clear idea of what they are.