## Introduction to text analysis

This notebook introduces how to analyse text to identify topic trends in text corpora.

[Scikit-learn](https://scikit-learn.org/) is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.



### Settings up things

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import pickle
import re
import os
from pathlib import Path
import requests
from collections import Counter
import matplotlib.pyplot as plt
from numpy import mean, ones
from scipy.sparse import csr_matrix
from nltk.corpus import stopwords

### CountVectorizer converts a collection of text documents to a matrix of token counts
max_df: when building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
ngram_range: (1,2) includes ngrams of 1 and 2 words, (2,2) includes only ngrams of 2 words.

By default, rows are ngrams that appear per document:

<table>
<tr>
<th></th>
<th>and</th>
<th>and this</th>
<th>document</th>
<th>document is</th>
<th>more terms...</th>
</tr>

<tr>
<td>doc0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>...</td>
<td>...</td>
</tr>

<tr>
<td>doc1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>...</td>
<td>...</td>    
</tr>
    
<tr>
<td>doc2</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>...</td>
<td>...</td>    
</tr>    
</table>


By doing the transpose each row becomes a ngram frequency in all the documents


<table>
<tr>
<th></th>
<th>doc1</th>
<th>doc2</th>
<th>doc3</th>
<th>doc4</th>
</tr>

<tr>
<td>and</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>

<tr>
<td>and this</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
    
<tr>
<td>document</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>    

<tr>
<td>more terms...</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>    
</table>

### Given a text corpora and the years of publication, we can use CountVectorizer to converts a collection of text documents to a matrix of token counts.

According to the [scikit-learn documentation](https://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer):
* The parameter *ngram_range* of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
* The parameter *analyzer* allows to configure Whether the feature should be made of word n-gram or character n-grams.
* The paramenter *stopwords*, allows the definition of a stop word list. If ‘english’, a built-in stop word list for English is used. Other language lists can be configured.

In [None]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'Is this the second document?',
    'A third document is useful for testing purposes',
    'Is this the third document?',
]

year = [2000,2001,2002,2002,2002,2002,2000]

v = CountVectorizer(analyzer='word', ngram_range=(1, 2))

Once we have defined the CountVectorizer object, the method [*fit_transform*](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform) learn the vocabulary dictionary and return the document-term matrix.

In [None]:
X = v.fit_transform(corpus)

The method [*get_feature_names*](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.get_feature_names) returns a list of feature names as an array mapping from feature integer indices to feature name.

In [None]:
terms = v.get_feature_names()
terms

## By default, rows are ngrams that appear per document:

In [None]:
print(v.fit_transform(corpus).toarray())

## By doing the transpose each row becomes a ngram frequency in all the documents

In [None]:
matrix = v.fit_transform(corpus).transpose()
print(matrix.toarray())

### We can obtain the doc frequency by getting the count of explicitly-stored values (nonzeros) per row (axis = 1)

In [None]:
doc_frequencies = matrix.getnnz(axis=1)
print(doc_frequencies)

### We can also obtain the term frequencies by adding the values of each row

In [None]:
frequencies = matrix.sum(axis=1).A1
frequencies

### Hapax legomena are terms of which only one instance of use is recorded. 

We can remove them in order to target our efforts in the most effective way. Firt, we define a class to store the terms.

In [None]:
class MPHash(object):
    # create from iterable 
    def __init__(self, terms):
        self.term = list(terms)
        self.code = {t:n for n, t in enumerate(self.term)}
    
    def __len__(self):
        return len(self.term)
    
    def get_code(self, term):
        return self.code.get(term)
    
    def get_term(self, code):
        return self.term[code]

In [None]:
selected = [m for m, f in enumerate(frequencies) if f > 1]
hapax_rate = 1 - len(selected) / len(frequencies)
print('Removing hapax legomena ({:.1f}%)'.format(100 * hapax_rate))
matrix = matrix[selected, :]      
term_codes = MPHash([terms[m] for m in selected])

## Now we can access codes and terms by means of the MPHash class

* The code 0 corresponds to the term *document*
* The code 1 corresponds to the term *document is*

In [None]:
term_codes.get_code("document")


In [None]:
term_codes.get_term(0)

In [None]:
term_codes.get_term(1)

In [None]:
term_codes.get_code("document is")

### We can also store most common capitalization of terms by configuring the CountVectorizer with lowercase option.

In [None]:
v.lowercase = False
matrix2 = v.fit_transform(corpus).transpose()
terms2 = v.get_feature_names()
frequencies2 = matrix2.sum(axis=1).A1    
forms = dict()
for t, f in zip(terms2, frequencies2):
    low = t.lower()
    if forms.get(low, (None, 0))[1] < f:
        forms[low] = (t, f)
capitals = {k:v[0] for k, v in forms.items()}
capitals

### Now let's compute the average year of documents containing every term

We provide a period of time using years as description and identify the documents from the period provided.

The **Enumerate()** method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.

**enumerate(year)** contains de document id and its year as is shown below:

In [None]:
print(list(enumerate(year)))

Let's filter the documents by the period provided

In [None]:
period = (2000, 2001)

docs = [n for n, y in enumerate(year)\
        if period[0] <= y <= period[1]]

# only documents in the period
print(docs)

Now we extract the documents in the matrix in which each row corresponds to a term and the documents (already filtered by year) in which appears represented by 1.

In [None]:
#print(matrix.toarray())
tf_matrix = matrix[:, docs]
print(tf_matrix.toarray())

### Now we obtain term frequencies and document frequencies

In [None]:
tf_sum = tf_matrix.sum(axis=1).A1
df_sum = tf_matrix.getnnz(axis=1)
print(tf_sum)
print(df_sum)
terms = [m for m, tf in enumerate(tf_sum)]

**Note:** We could use now a term and document threshold frequency. Terms and documents with frequency less than the threshold are discarded.

In [None]:
rows, cols = tf_matrix.nonzero()
print(rows)
print(cols)

We create a [Compressed Sparse Row matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) using the method **csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])** where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k]

CSR matrix is often used to represent sparse matrices in machine learning given the efficient access and matrix multiplication that it supports.

In [None]:
df_matrix = csr_matrix((ones(len(rows)), (rows, cols)))
print(df_matrix.toarray())

### We retrieve the years in the documents

In [None]:
year2 = [year[n] for n in docs]
print(year2)

### The last step consists on retrieving the average year of documents containing every term

First, we show how to multiply the matrix term and years using the operator @ (matrix multiplication)

In [None]:
res = df_matrix @ year2
print(res)

Finally, we compute the average dividing that number by the document frequency

In [None]:
res = df_matrix @ year2 / df_matrix.getnnz(axis=1) # @ operator = matrix multiplication
print(res)

And finally we retrieve the term and the year

In [None]:
result = {term_codes.get_term(terms[m]):res[m] for m in range(len(res))}
result