{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to text analysis\n", "\n", "This notebook introduces how to analyse text to identify topic trends in text corpora.\n", "\n", "[Scikit-learn](https://scikit-learn.org/) is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Settings up things" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "import pandas as pd\n", "import pickle\n", "import re\n", "import os\n", "from pathlib import Path\n", "import requests\n", "from collections import Counter\n", "import matplotlib.pyplot as plt\n", "from numpy import mean, ones\n", "from scipy.sparse import csr_matrix\n", "from nltk.corpus import stopwords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CountVectorizer converts a collection of text documents to a matrix of token counts\n", "max_df: when building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold\n", "ngram_range: (1,2) includes ngrams of 1 and 2 words, (2,2) includes only ngrams of 2 words.\n", "\n", "By default, rows are ngrams that appear per document:\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", "
andand thisdocumentdocument ismore terms...
doc0001......
doc1010......
doc2110......
\n", "\n", "\n", "By doing the transpose each row becomes a ngram frequency in all the documents\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "
doc1doc2doc3doc4
and0010
and this0010
document1101
more terms...
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Given a text corpora and the years of publication, we can use CountVectorizer to converts a collection of text documents to a matrix of token counts.\n", "\n", "According to the [scikit-learn documentation](https://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer):\n", "* The parameter *ngram_range* of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.\n", "* The parameter *analyzer* allows to configure Whether the feature should be made of word n-gram or character n-grams.\n", "* The paramenter *stopwords*, allows the definition of a stop word list. If ‘english’, a built-in stop word list for English is used. Other language lists can be configured." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus = [\n", " 'This is the first document.',\n", " 'This document is the second document.',\n", " 'And this is the third one.',\n", " 'Is this the first document?',\n", " 'Is this the second document?',\n", " 'A third document is useful for testing purposes',\n", " 'Is this the third document?',\n", "]\n", "\n", "year = [2000,2001,2002,2002,2002,2002,2000]\n", "\n", "v = CountVectorizer(analyzer='word', ngram_range=(1, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have defined the CountVectorizer object, the method [*fit_transform*](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform) learn the vocabulary dictionary and return the document-term matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = v.fit_transform(corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The method [*get_feature_names*](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.get_feature_names) returns a list of feature names as an array mapping from feature integer indices to feature name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "terms = v.get_feature_names()\n", "terms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## By default, rows are ngrams that appear per document:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(v.fit_transform(corpus).toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## By doing the transpose each row becomes a ngram frequency in all the documents" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matrix = v.fit_transform(corpus).transpose()\n", "print(matrix.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We can obtain the doc frequency by getting the count of explicitly-stored values (nonzeros) per row (axis = 1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc_frequencies = matrix.getnnz(axis=1)\n", "print(doc_frequencies)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We can also obtain the term frequencies by adding the values of each row" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frequencies = matrix.sum(axis=1).A1\n", "frequencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hapax legomena are terms of which only one instance of use is recorded. \n", "\n", "We can remove them in order to target our efforts in the most effective way. Firt, we define a class to store the terms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class MPHash(object):\n", " # create from iterable \n", " def __init__(self, terms):\n", " self.term = list(terms)\n", " self.code = {t:n for n, t in enumerate(self.term)}\n", " \n", " def __len__(self):\n", " return len(self.term)\n", " \n", " def get_code(self, term):\n", " return self.code.get(term)\n", " \n", " def get_term(self, code):\n", " return self.term[code]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "selected = [m for m, f in enumerate(frequencies) if f > 1]\n", "hapax_rate = 1 - len(selected) / len(frequencies)\n", "print('Removing hapax legomena ({:.1f}%)'.format(100 * hapax_rate))\n", "matrix = matrix[selected, :] \n", "term_codes = MPHash([terms[m] for m in selected])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Now we can access codes and terms by means of the MPHash class\n", "\n", "* The code 0 corresponds to the term *document*\n", "* The code 1 corresponds to the term *document is*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "term_codes.get_code(\"document\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "term_codes.get_term(0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "term_codes.get_term(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "term_codes.get_code(\"document is\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We can also store most common capitalization of terms by configuring the CountVectorizer with lowercase option." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "v.lowercase = False\n", "matrix2 = v.fit_transform(corpus).transpose()\n", "terms2 = v.get_feature_names()\n", "frequencies2 = matrix2.sum(axis=1).A1 \n", "forms = dict()\n", "for t, f in zip(terms2, frequencies2):\n", " low = t.lower()\n", " if forms.get(low, (None, 0))[1] < f:\n", " forms[low] = (t, f)\n", "capitals = {k:v[0] for k, v in forms.items()}\n", "capitals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now let's compute the average year of documents containing every term\n", "\n", "We provide a period of time using years as description and identify the documents from the period provided.\n", "\n", "The **Enumerate()** method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.\n", "\n", "**enumerate(year)** contains de document id and its year as is shown below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(list(enumerate(year)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's filter the documents by the period provided" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "period = (2000, 2001)\n", "\n", "docs = [n for n, y in enumerate(year)\\\n", " if period[0] <= y <= period[1]]\n", "\n", "# only documents in the period\n", "print(docs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we extract the documents in the matrix in which each row corresponds to a term and the documents (already filtered by year) in which appears represented by 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#print(matrix.toarray())\n", "tf_matrix = matrix[:, docs]\n", "print(tf_matrix.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now we obtain term frequencies and document frequencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf_sum = tf_matrix.sum(axis=1).A1\n", "df_sum = tf_matrix.getnnz(axis=1)\n", "print(tf_sum)\n", "print(df_sum)\n", "terms = [m for m, tf in enumerate(tf_sum)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** We could use now a term and document threshold frequency. Terms and documents with frequency less than the threshold are discarded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rows, cols = tf_matrix.nonzero()\n", "print(rows)\n", "print(cols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a [Compressed Sparse Row matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) using the method **csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])** where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k]\n", "\n", "CSR matrix is often used to represent sparse matrices in machine learning given the efficient access and matrix multiplication that it supports." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_matrix = csr_matrix((ones(len(rows)), (rows, cols)))\n", "print(df_matrix.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We retrieve the years in the documents" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "year2 = [year[n] for n in docs]\n", "print(year2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The last step consists on retrieving the average year of documents containing every term\n", "\n", "First, we show how to multiply the matrix term and years using the operator @ (matrix multiplication)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = df_matrix @ year2\n", "print(res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we compute the average dividing that number by the document frequency" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = df_matrix @ year2 / df_matrix.getnnz(axis=1) # @ operator = matrix multiplication\n", "print(res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally we retrieve the term and the year" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = {term_codes.get_term(terms[m]):res[m] for m in range(len(res))}\n", "result" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }