{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# text mining (nlp) with python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com))
\n", "**Last updated:** June 2020 \n", "**Python version:** Python 3.7 \n", "**License:** MIT License " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** Some features (like the ToC) will only work if you run the notebook or if you use nbviewer by clicking this link: \n", "https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *Introduction*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook contains code examples to get you started with Python for Natural Language Processing (NLP) / Text Mining. \n", "\n", "In the large scheme of things there are roughly 4 steps: \n", "\n", "1. Identify a data source \n", "2. Gather the data \n", "3. Process the data \n", "4. Analyze the data \n", "\n", "This notebook only discusses step 3 and 4. If you want to learn more about step 2 see my [Python tutorial](https://github.com/TiesdeKok/LearnPythonforResearch). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Note: companion slides" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook was designed to accompany a PhD course session on NLP techniques in Accounting Research. \n", "The slides of this session are publically availabe here: [Slides](http://www.tiesdekok.com/AccountingNLP_Slides/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *Elements / topics that are discussed in this notebook:*\n", "\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *Table of Contents* " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [Primer on NLP tools](#tool_primer) \n", "* [Process + Clean text](#proc_clean) \n", " * [Normalization](#normalization)\n", " * [Deal with unwanted characters](#unwanted_char)\n", " * [Sentence segmentation](#sentence_seg) \n", " * [Word tokenization](#word_token)\n", " * [Lemmatization & Stemming](#lem_and_stem) \n", " * [Language modeling](#lang_model) \n", " * [Part-of-Speech tagging](#pos_tagging) \n", " * [Uni-Gram & N-Grams](#n_grams) \n", " * [Stop words](#stop_words) \n", "* [Direct feature extraction](#feature_extract) \n", " * [Feature search](#feature_search) \n", " * [Entity recognition](#entity_recognition) \n", " * [Pattern search](#pattern_search) \n", " * [Text evaluation](#text_eval) \n", " * [Language](#language) \n", " * [Dictionary counting](#dict_counting) \n", " * [Readability](#readability) \n", "* [Represent text numerically](#text_numerical) \n", " * [Bag of Words](#bows) \n", " * [TF-IDF](#tfidf) \n", " * [Word Embeddings](#word_embed) \n", " * [Spacy](#spacyEmbedding)\n", " * [Word2Vec](#Word2Vec) \n", "* [Statistical models](#stat_models) \n", " * [\"Traditional\" machine learning](#trad_ml) \n", " * [Supervised](#trad_ml_supervised) \n", " * [Naïve Bayes](#trad_ml_supervised_nb) \n", " * [Support Vector Machines (SVM)](#trad_ml_supervised_svm) \n", " * [Unsupervised](#trad_ml_unsupervised) \n", " * [Latent Dirichilet Allocation (LDA)](#trad_ml_unsupervised_lda) \n", " * [pyLDAvis](#trad_ml_unsupervised_pyLDAvis) \n", "* [Model Selection and Evaluation](#trad_ml_eval) \n", "* [Neural Networks](#nn_ml)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Primer on NLP tools [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many tools available for NLP purposes. \n", "The code examples below are based on what I personally like to use, it is not intended to be a comprehsnive overview. \n", "\n", "Besides build-in Python functionality I will use / demonstrate the following packages:\n", "\n", "**Standard NLP libraries**:\n", "1. `Spacy` \n", "2. `NLTK` and the higher-level wrapper `TextBlob`\n", "\n", "*Note: besides installing the above packages you also often have to download (model) data . Make sure to check the documentation!*\n", "\n", "**Standard machine learning library**:\n", "\n", "1. `scikit learn`\n", "\n", "**Specific task libraries**:\n", "\n", "There are many, just a couple of examples:\n", "\n", "1. `pyLDAvis` for visualizing LDA)\n", "2. `langdetect` for detecting languages\n", "3. `fuzzywuzzy` for fuzzy text matching\n", "4. `Gensim` for topic modelling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Get some example data [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many example datasets available to play around with, see for example this great repository: \n", "https://archive.ics.uci.edu/ml/datasets.php" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data that I will use for most of the examples is the \"Reuter_50_50 Data Set\" that is used for author identification experiments. \n", "\n", "See the details here: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download and load the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can't follow what I am doing here? Please see my [Python tutorial](https://github.com/TiesdeKok/LearnPythonforResearch) (although the `zipfile` and `io` operations are not very relevant)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import requests, zipfile, io, os\n", "from tqdm.notebook import tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note:* for `tqdm` to work in JupyterLab you need to install the `@jupyter-widgets/jupyterlab-manager` using the puzzle icon in the left side bar. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Download and extract the zip file with the data *" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "if not os.path.exists('C50test'):\n", " r = requests.get(\"https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip\")\n", " z = zipfile.ZipFile(io.BytesIO(r.content))\n", " z.extractall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Load the data into memory*" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "folder_dict = {'test' : 'C50test'}\n", "text_dict = {'test' : {}}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "579743db1d6e4759ba60efb6b991da59", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "for label, folder in tqdm(folder_dict.items()):\n", " authors = os.listdir(folder)\n", " for author in authors:\n", " text_files = os.listdir(os.path.join(folder, author))\n", " for file in text_files:\n", " with open(os.path.join(folder, author, file), 'r') as text_file:\n", " text_dict[label].setdefault(author, []).append(' '.join(text_file.readlines()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note: the text comes pre-split per sentence, for the sake of example I undo this through `' '.join(text_file.readlines()`*" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997. The shares fell 6p to 781p on the news.\\n \"The stock is probably dead in the water until March,\" said John Wakley, analyst at Lehman Brothers. \\n Dermott Carr, an analyst at Nikko said, \"the market is going to hang onto them for the moment but until we get a decision they will be held back.\"\\n Whatever the MMC decides many analysts expect Lang to defer a decision until after the next general election which will be called by May 22.\\n \"They will probably try to defer the decision until after the election. I don\\'t think they want the negative PR of having a large number of people fired,\" said Wakley. \\n If the deal does not go through, analysts calculate the maximum loss to Bass of 60 million, with most sums centred on the 30-40 million range.\\n \"It\\'s a maxiumum loss of 60 million for Bass if they fail and, unlike Allied, you would have to compare it to the perceived upside of doing the deal,\" said Wakley.\\n Bass said at the time of the deal it would take a one-off charge of 75 million stg for restructuring the combined business, resulting in expected annual cost savings of 90 million stg within three years. \\n Under the terms of the complex deal, if Bass cannot combine C-T with its own brewing business within 16 months, it has the option to put its whole shareholding to Carlsberg for 110 million stg and Carlsberg has an option to put 15 percent of C-T to Allied Domecq, which would reimburse Bass 30 million stg.\\n Bass is also entitled to receive 50 percent of all profits earnied by C-T until the merger is complete, which should give it some 30-35 million stg in a full year. Carlsberg has agreed to contribute its interests and 20 million stg in exchange for a 20 percent share in the combined Bass Breweries and Carlsberg-Tetley business.\\n C-T was a joint venture between Allied Domecq and Carlsberg formed in 1992 by the merger of their UK brewing and wholesaleing businesses.\\n -- London Newsroom +44 171 542 6437\\n'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_dict['test']['TimFarrand'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Process + Clean text [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert the text into a NLP representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the text directly, but if want to use packages like `spacy` and `textblob` we first have to convert the text into a corresponding object. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Spacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** depending on the way that you installed the language models you will need to import it differently:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "from spacy.en import English\n", "nlp = English()\n", "```\n", "OR\n", "```\n", "import en_core_web_sm\n", "nlp = en_core_web_sm.load()\n", "\n", "import en_core_web_md\n", "nlp = en_core_web_md.load()\n", "\n", "import en_core_web_lg\n", "nlp = en_core_web_lg.load()\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "import en_core_web_md\n", "nlp = en_core_web_md.load()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert all text in the \"test\" sample to a `spacy` `doc` object using `nlp.pipe()`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7819909a0ff74c40b2201cbd78c04984", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "spacy_text = {}\n", "for author, text_list in tqdm(text_dict['test'].items()):\n", " spacy_text[author] = list(nlp.pipe(text_list))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*A note on speed:* This is slow because we didn't disable any compontents, see this note from the documentation: \n", "> Only apply the pipeline components you need. Getting predictions from the model that you don’t actually need adds up and becomes very inefficient at scale. To prevent this, use the disable keyword argument to disable components you don’t need – either when loading a model, or during processing with nlp.pipe. See the section on disabling pipeline components for more details and examples. [link](https://spacy.io/usage/processing-pipelines#disabling)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "spacy.tokens.doc.Doc" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(spacy_text['TimFarrand'][0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NLTK" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can apply basic `nltk` operations directly to the text so we don't need to convert first. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TextBlob" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from textblob import TextBlob" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert all text in the \"test\" sample to a `TextBlob` object using `TextBlob()`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "textblob_text = {}\n", "for author, text_list in text_dict['test'].items():\n", " textblob_text[author] = [TextBlob(text) for text in text_list]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "textblob.blob.TextBlob" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(textblob_text['TimFarrand'][0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalization [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Text normalization** describes the task of transforming the text into a different (more comparable) form. \n", "\n", "This can imply many things, I will show a couple of options below:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deal with unwanted characters [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will often notice that there are characters that you don't want in your text. \n", "\n", "Let's look at this sentence for example:\n", "\n", "> \"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers\"\n", "\n", "You notice that there are some `\\` and `\\n` in there. These are used to define how a string should be displayed, if we print this text we get: " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers\"" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_dict['test']['TimFarrand'][0][:298]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n", " Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers\n" ] } ], "source": [ "print(text_dict['test']['TimFarrand'][0][:298])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These special characters can cause problems in our analyses (and can be hard to debug if you are using `print` statements to inspect the data).\n", "\n", "**So how do we remove them?**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In many cases it is sufficient to simply use the `.replace()` function:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts. Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers\"" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_dict['test']['TimFarrand'][0][:298].replace('\\n', '').replace('\\\\', '')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, however, the problem arrises because of encoding / decoding problems. \n", "\n", "In those cases you can usually do something like: " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is some π text that has to be cleaned…! it's difficult to deal with!\n", "b\"This is some text that has to be cleaned! it's difficult to deal with!\"\n" ] } ], "source": [ "problem_sentence = 'This is some \\u03c0 text that has to be cleaned\\u2026! it\\u0027s difficult to deal with!'\n", "print(problem_sentence)\n", "print(problem_sentence.encode().decode('unicode_escape').encode('ascii','ignore'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An alternative that is better at preserving the unicode characters would be to use `unidecode`" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import unidecode" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "王玉\n" ] } ], "source": [ "print('\\u738b\\u7389')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Wang Yu '" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unidecode.unidecode(u\"\\u738b\\u7389\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"This is some p text that has to be cleaned...! it's difficult to deal with!\"" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unidecode.unidecode(problem_sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentence segmentation [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sentence segmentation refers to the task of splitting up the text by sentence. \n", "\n", "You could do this by splitting on the `.` symbol, but dots are used in many other cases as well so it is not very robust:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts\",\n", " '\\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997',\n", " ' The shares fell 6p to 781p on the news',\n", " '\\n \"The stock is probably dead in the water until March,\" said John Wakley, analyst at Lehman Brothers',\n", " ' \\n Dermott Carr, an analyst at Nikko said, \"the mark']" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_dict['test']['TimFarrand'][0][:550].split('.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is better to use a more sophisticated implementation such as the one by `Spacy`:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "example_paragraph = spacy_text['TimFarrand'][0]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n", " ,\n", " Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997.,\n", " The shares fell 6p to 781p on the news.\n", " ,\n", " \"The stock is probably dead in the water until March,\" said John Wakley, analyst at Lehman Brothers. \n", " ,\n", " Dermott Carr, an analyst at Nikko said, \"the market is going to hang onto them for the moment]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence_list = [s for s in example_paragraph.sents]\n", "sentence_list[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the returned object is still a `spacy` object:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "spacy.tokens.span.Span" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(sentence_list[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note:* `spacy` sentence segmentation relies on the text being capitalized, so make sure you didn't convert it to all lower case before running this operation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply to all texts (for use later on):" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "bd91842ae5e54926b754e5be67f3567f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "spacy_sentences = {}\n", "for author, text_list in tqdm(spacy_text.items()):\n", " spacy_sentences[author] = [list(text.sents) for text in text_list]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n", " ,\n", " Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997.,\n", " The shares fell 6p to 781p on the news.\n", " ]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy_sentences['TimFarrand'][0][:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Word tokenization [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Word tokenization means to split the sentence (or text) up into words." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n", " " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_sentence = spacy_sentences['TimFarrand'][0][0]\n", "example_sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A word is called a `token` in this context (hence `tokenization`), using `spacy`:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Shares,\n", " in,\n", " brewing,\n", " -,\n", " to,\n", " -,\n", " leisure,\n", " group,\n", " Bass,\n", " Plc,\n", " are,\n", " likely,\n", " to,\n", " be,\n", " held]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token_list = [token for token in example_sentence]\n", "token_list[0:15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lemmatization & Stemming [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In some cases you want to convert a word (i.e. token) into a more general representation. \n", "\n", "For example: convert \"car\", \"cars\", \"car's\", \"cars'\" all into the word `car`.\n", "\n", "This is generally done through lemmatization / stemming (different approaches trying to achieve a similar goal). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Spacy**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Space offers build-in functionality for lemmatization:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['share',\n", " 'in',\n", " 'brewing',\n", " '-',\n", " 'to',\n", " '-',\n", " 'leisure',\n", " 'group',\n", " 'Bass',\n", " 'Plc',\n", " 'be',\n", " 'likely',\n", " 'to',\n", " 'be',\n", " 'hold']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lemmatized = [token.lemma_ for token in example_sentence]\n", "lemmatized[0:15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NLTK**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the NLTK libary we can also use the more aggressive Porter Stemmer" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "from nltk.stem.porter import PorterStemmer\n", "stemmer = PorterStemmer()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['share',\n", " 'in',\n", " 'brew',\n", " '-',\n", " 'to',\n", " '-',\n", " 'leisur',\n", " 'group',\n", " 'bass',\n", " 'plc',\n", " 'are',\n", " 'like',\n", " 'to',\n", " 'be',\n", " 'held']" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stemmed = [stemmer.stem(token.text) for token in example_sentence]\n", "stemmed[0:15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Compare**:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Original | Spacy Lemma | NLTK Stemmer\n", "-----------------------------------------\n", " Shares | share | share\n", " in | in | in\n", " brewing | brewing | brew\n", " - | - | -\n", " to | to | to\n", " - | - | -\n", " leisure | leisure | leisur\n", " group | group | group\n", " Bass | Bass | bass\n", " Plc | Plc | plc\n", " are | be | are\n", " likely | likely | like\n", " to | to | to\n", " be | be | be\n", " held | hold | held\n" ] } ], "source": [ "print(' Original | Spacy Lemma | NLTK Stemmer')\n", "print('-' * 41)\n", "for original, lemma, stem in zip(token_list[:15], lemmatized[:15], stemmed[:15]):\n", " print(str(original).rjust(10, ' '), ' | ', str(lemma).rjust(10, ' '), ' | ', str(stem).rjust(10, ' '))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In my experience it is usually best to use lemmatization instead of a stemmer. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Language modeling [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Text is inherently structured in complex ways, we can often use some of this underlying structure. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part-of-Speech tagging [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Part of speech tagging refers to the identification of words as nouns, verbs, adjectives, etc. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `Spacy`:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(Shares, 'NOUN'),\n", " (in, 'ADP'),\n", " (brewing, 'NOUN'),\n", " (-, 'PUNCT'),\n", " (to, 'ADP'),\n", " (-, 'PUNCT'),\n", " (leisure, 'NOUN'),\n", " (group, 'NOUN'),\n", " (Bass, 'PROPN'),\n", " (Plc, 'PROPN')]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos_list = [(token, token.pos_) for token in example_sentence]\n", "pos_list[0:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Uni-Gram & N-Grams [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obviously a sentence is not a random collection of words, the sequence of words has information value. \n", "\n", "A simple way to incorporate some of this sequence is by using what is called `n-grams`. \n", "An `n-gram` is nothing more than a a combination of `N` words into one token (a uni-gram token is just one word). \n", "\n", "So we can convert `\"Sentence about flying cars\"` into a list of bigrams:\n", "\n", "> Sentence-about, about-flying, flying-cars \n", "\n", "See my slide on N-Grams for a more comprehensive example: [click here](http://www.tiesdekok.com/AccountingNLP_Slides/#14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `NLTK`:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['are-likely', 'likely-to', 'to-be', 'be-held', 'held-back']" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bigram_list = ['-'.join(x) for x in nltk.bigrams([token.text for token in example_sentence])]\n", "bigram_list[10:15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `spacy`" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def tokenize_without_punctuation(sen_obj):\n", " return [token.text for token in sen_obj if token.is_alpha]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "def create_ngram(sen_obj, n, sep = '-'):\n", " token_list = tokenize_without_punctuation(sen_obj)\n", " number_of_tokens = len(token_list)\n", " ngram_list = []\n", " for i, token in enumerate(token_list[:-n+1]):\n", " ngram_item = [token_list[i + ii] for ii in range(n)]\n", " ngram_list.append(sep.join(ngram_item))\n", " return ngram_list" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Shares-in', 'in-brewing', 'brewing-to', 'to-leisure', 'leisure-group']" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "create_ngram(example_sentence, 2)[:5]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Shares-in-brewing',\n", " 'in-brewing-to',\n", " 'brewing-to-leisure',\n", " 'to-leisure-group',\n", " 'leisure-group-Bass']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "create_ngram(example_sentence, 3)[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stop words [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Depending on what you are trying to do it is possible that there are many words that don't add any information value to the sentence. \n", "\n", "The primary example are stop words. \n", "\n", "Sometimes you can improve the accuracy of your model by removing stop words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `Spacy`:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "no_stop_words = [token for token in example_sentence if not token.is_stop]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Shares, brewing, -, -, leisure, group, Bass, Plc, likely, held]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_stop_words[:10]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Shares, in, brewing, -, to, -, leisure, group, Bass, Plc]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token_list[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note* we can also remove punctuation in the same way:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Shares, brewing, leisure, group, Bass, Plc, likely, held, Britain, Trade]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[token for token in example_sentence if not token.is_stop and token.is_alpha][:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrap everything into one function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Basic SpaCy text processing function**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Split into sentences\n", "2. Apply lemmatizer, remove top words, remove punctuation\n", "3. Clean up the sentence using `textacy`" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "def process_text_custom(text):\n", " sentences = list(nlp(text, disable=['tagger', 'ner', 'entity_linker', 'textcat', 'entitry_ruler']).sents)\n", " lemmatized_sentences = []\n", " for sentence in sentences:\n", " lemmatized_sentences.append([token.lemma_ for token in sentence if not token.is_stop and token.is_alpha])\n", " return [' '.join(sentence) for sentence in lemmatized_sentences]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1daf8d39277d48adb85050ff35d9101e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "spacy_text_clean = {}\n", "for author, text_list in tqdm(text_dict['test'].items()):\n", " lst = []\n", " for text in text_list:\n", " lst.append(process_text_custom(text))\n", " spacy_text_clean[author] = lst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note:* that this would take quite a long time if we didn't disable some of the components. " ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of sentences: 58973\n" ] } ], "source": [ "count = 0\n", "for author, texts in spacy_text_clean.items():\n", " for text in texts:\n", " count += len(text)\n", "print('Number of sentences:', count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Result" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Shares brew leisure group Bass Plc likely hold Britain Trade Industry secretary Ian Lang decide allow propose merge brewer Carlsberg Tetley say analyst',\n", " 'Earlier Lang announce Bass deal refer Monoplies Mergers Commission report March',\n", " 'share fall news']" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy_text_clean['TimFarrand'][0][:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note:* the quality of the input text is not great, so the sentence segmentation is also not great (without further tweaking)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Direct feature extraction [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have pre-processed our text into something that we can use for direct feature extraction or to convert it to a numerical representation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature search [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Entity recognition [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is often useful / relevant to extract entities that are mentioned in a piece of text. \n", "\n", "SpaCy is quite powerful in extracting entities, however, it doesn't work very well on lowercase text. \n", "\n", "Given that \"token.lemma\\_\" removes capitalization I will use `spacy_sentences` for this example." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"The stock is probably dead in the water until March,\" said John Wakley, analyst at Lehman Brothers. \n", " " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_sentence = spacy_sentences['TimFarrand'][0][3]\n", "example_sentence" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(March, 'DATE'), (John Wakley, 'PERSON'), (Lehman Brothers, 'ORG')]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(i, i.label_) for i in nlp(example_sentence.text).ents]" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "British pub-to-hotel group Greenalls Plc on Thursday reported a 48 percent rise in profits before exceptional items to 148.7 million pounds ($246.4 million), driven by its acquisition of brewer Boddington in November 1995.\n", " " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_sentence = spacy_sentences['TimFarrand'][4][0]\n", "example_sentence" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(British, 'NORP'),\n", " (Greenalls Plc, 'ORG'),\n", " (Thursday, 'DATE'),\n", " (48 percent, 'PERCENT'),\n", " (148.7 million pounds, 'MONEY'),\n", " ($246.4 million, 'MONEY'),\n", " (Boddington, 'ORG'),\n", " (November 1995, 'DATE')]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(i, i.label_) for i in nlp(example_sentence.text).ents]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pattern search [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the build-in `re` (regular expression) library you can pattern match nearly anything you want. \n", "\n", "I will not go into details about regular expressions but see here for a tutorial: \n", "https://regexone.com/references/python " ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TIP**: Use [Pythex.org](https://pythex.org/) to try out your regular expression\n", "\n", "Example on Pythex: click here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 1:** " ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "string_1 = 'Ties de Kok (#IDNUMBER: 123-AZ). Rest of text...'\n", "string_2 = 'Philip Joos (#IDNUMBER: 663-BY). Rest of text...'" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "pattern = r'#IDNUMBER: (\\d\\d\\d-\\w\\w)'" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "123-AZ\n", "663-BY\n" ] } ], "source": [ "print(re.findall(pattern, string_1)[0])\n", "print(re.findall(pattern, string_2)[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 2:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a sentence contains the word 'million' return True, otherwise return False" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Analysts forecast pretax profit range million stg restructure cost million time\n", "restructure cost million anticipate bulk million stem closure small production plant France\n", "Cadbury drink business turn million stg trade profit million half entirely contribution Dr Pepper\n", "Campbell estimate UK beverage contribute million stg operate profit million time\n", "Broadly analyst expect pretty flat performance group confectionery business consensus forecast million stg operate profit\n", "average analyst calculate beverage chip trade profit million\n", "sale percent stake Coca Cola amp Schweppes Beverages CCSB operation Coca Cola Enterprises June million stg analyst want clear statement strategy company\n", "far analyst company say shareholder expect return investment emerge market large far million Russian plant\n", "Cadbury announce investment million stg build new plant Wrocoaw Poland joint venture China cost million\n", "Net debt billion end fall million end result CCSB sale provide acquisition\n" ] } ], "source": [ "for sen in spacy_text_clean['TimFarrand'][2]:\n", " TERM = 'million'\n", " if re.search('million', sen, flags= re.IGNORECASE):\n", " print(sen)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text evaluation [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides feature search there are also many ways to analyze the text as a whole. \n", "\n", "Let's, for example, evaluate the following paragraph:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Soft drink confectionery group Cadbury Schweppes Plc expect report solid percent rise half profit Wednesday face question performance soft drink main question success relaunch brand say Mark Duffy food manufacture analyst SBC Warburg Competitor Sprite own Coca Cola see agressive market push rank fast grow brand Cadbury Dr Pepper Analysts forecast pretax profit range million stg restructure cost million time dividend penny expect restructure cost million anticipate bulk million stem closure small'" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_paragraph = ' '.join([x for x in spacy_text_clean['TimFarrand'][2]])\n", "example_paragraph[:500]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Language [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `spacy-langdetect` package it is easy to detect the language of a piece of text" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "from spacy_langdetect import LanguageDetector\n", "nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'language': 'en', 'score': 0.9999970401265338}\n" ] } ], "source": [ "print(nlp(example_paragraph)._.language)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Readability [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally I'd recommend to calculate the readability metrics by yourself as they don't tend to be that difficult to compute. However, there are packages out there that can help, such as `spacy_readability`" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "from spacy_readability import Readability" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "nlp.add_pipe(Readability(), name='readability', last=True)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8.412857142857145\n", "0\n" ] } ], "source": [ "doc = nlp(\"I am some really difficult text to read because I use obnoxiously large words.\")\n", "print(doc._.flesch_kincaid_grade_level)\n", "print(doc._.smog)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Manual example:** FOG index" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "import syllapy" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "def calculate_fog(document):\n", " doc = nlp(document, disable=['tagger', 'ner', 'entity_linker', 'textcat', 'entitry_ruler'])\n", " sen_list = list(doc.sents)\n", " num_sen = len(sen_list)\n", "\n", " num_words = 0\n", " num_complex_words = 0\n", " for sen_obj in sen_list:\n", " words_in_sen = [token.text for token in sen_obj if token.is_alpha]\n", " num_words += len(words_in_sen)\n", " num_complex = 0\n", " for word in words_in_sen:\n", " num_syl = syllapy.count(word.lower())\n", " if num_syl > 2:\n", " num_complex += 1\n", " num_complex_words += num_complex\n", " \n", " fog = 0.4 * ((num_words / num_sen) + ((num_complex_words / num_words)*100))\n", " return {'fog' : fog, \n", " 'num_sen' : num_sen, \n", " 'num_words' : num_words, \n", " 'num_complex_words' : num_complex_words}" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'fog': 13.42327889849504,\n", " 'num_sen': 36,\n", " 'num_words': 347,\n", " 'num_complex_words': 83}" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "calculate_fog(example_paragraph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text similarity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `fuzzywuzzy`" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "from fuzzywuzzy import fuzz" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "91" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzz.ratio(\"fuzzy wuzzy was a bear\", \"wuzzy fuzzy was a bear\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `spacy`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spacy can provide a similary score based on the semantic similarity ([link](https://spacy.io/usage/vectors-similarity))" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0000000623731768" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens_1 = nlp(\"fuzzy wuzzy was a bear\")\n", "tokens_2 = nlp(\"wuzzy fuzzy was a bear\")\n", "\n", "tokens_1.similarity(tokens_2)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8127869114665882" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens_1 = nlp(\"Tom believes German cars are the best.\")\n", "tokens_2 = nlp(\"Sarah recently mentioned that she would like to go on holiday to Germany.\")\n", "\n", "tokens_1.similarity(tokens_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Term (dictionary) counting [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common technique for basic NLP insights is to create simple metrics based on term counts. \n", "\n", "These are relatively easy to implement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 1:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "word_dictionary = ['soft', 'first', 'most', 'be']" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "soft 2\n", "first 0\n", "most 0\n", "be 7\n" ] } ], "source": [ "for word in word_dictionary:\n", " print(word, example_paragraph.count(word))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 2:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n", "1\n" ] }, { "data": { "text/plain": [ "0.8" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos = ['great', 'agree', 'increase']\n", "neg = ['bad', 'disagree', 'decrease']\n", "\n", "sentence = '''According to the president everything is great, great, \n", "and great even though some people might disagree with those statements.'''\n", "\n", "pos_count = 0\n", "for word in pos:\n", " pos_count += sentence.lower().count(word)\n", "print(pos_count)\n", "\n", "neg_count = 0\n", "for word in neg:\n", " neg_count += sentence.lower().count(word)\n", "print(neg_count)\n", "\n", "pos_count / (neg_count + pos_count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting the total number of words is also easy:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "19" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_tokens = len([token for token in nlp(sentence) if token.is_alpha])\n", "num_tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 3:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also save the count per word" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "pos_count_dict = {}\n", "for word in pos:\n", " pos_count_dict[word] = sentence.lower().count(word)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'great': 3, 'agree': 1, 'increase': 0}" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos_count_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note:* `.lower()` is actually quite slow, if you have a lot of words / sentences it is recommend to minimize the amount of `.lower()` operations that you have to make." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Represent text numerically [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bag of Words [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sklearn includes the `CountVectorizer` and `TfidfVectorizer` function. \n", "\n", "For details, see the documentation: \n", "[TF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) \n", "[TFIDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)\n", "\n", "*Note 1:* these functions also provide a variety of built-in preprocessing options (e.g. ngrames, remove stop words, accent stripper).\n", "\n", "*Note 2:* example based on the following website [click here](http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import TfidfVectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simple example:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "doc_1 = \"The sky is blue.\"\n", "doc_2 = \"The sun is bright today.\"\n", "doc_3 = \"The sun in the sky is bright.\"\n", "doc_4 = \"We can see the shining sun, the bright sun.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate term frequency:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer(stop_words='english')\n", "tf = vectorizer.fit_transform([doc_1, doc_2, doc_3, doc_4])" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['blue', 'bright', 'shining', 'sky', 'sun', 'today'] \n", "\n", "[1 0 0 1 0 0]\n", "[0 1 0 0 1 1]\n", "[0 1 0 1 1 0]\n", "[0 1 1 0 2 0]\n" ] } ], "source": [ "print(vectorizer.get_feature_names(), '\\n')\n", "for doc_tf_vector in tf.toarray():\n", " print(doc_tf_vector)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TF-IDF [(to top)](#toc)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "transformer = TfidfVectorizer(stop_words='english')\n", "tfidf = transformer.fit_transform([doc_1, doc_2, doc_3, doc_4])" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "code_folding": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.78528828 0. 0. 0.6191303 0. 0. ]\n", "[0. 0.47380449 0. 0. 0.47380449 0.74230628]\n", "[0. 0.53256952 0. 0.65782931 0.53256952 0. ]\n", "[0. 0.36626037 0.57381765 0. 0.73252075 0. ]\n" ] } ], "source": [ "for doc_vector in tfidf.toarray():\n", " print(doc_vector)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More elaborate example:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "clean_paragraphs = []\n", "for author, value in spacy_text_clean.items():\n", " for article in value:\n", " clean_paragraphs.append(' '.join([x for x in article]))" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2500" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(clean_paragraphs)" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "transformer = TfidfVectorizer(stop_words='english')\n", "tfidf_large = transformer.fit_transform(clean_paragraphs)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of vectors: 2500\n", "Number of words in dictionary: 21978\n" ] } ], "source": [ "print('Number of vectors:', len(tfidf_large.toarray()))\n", "print('Number of words in dictionary:', len(tfidf_large.toarray()[0]))" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<2500x21978 sparse matrix of type ''\n", "\twith 410121 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_large" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word Embeddings [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Spacy [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `en_core_web_lg` language model comes with GloVe vectors trained on the Common Crawl dataset ([link](https://spacy.io/models/en#en_core_web_lg))" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The True 4.70935 False\n", "Dutch True 6.381229 False\n", "word True 5.8387117 False\n", "for True 4.8435082 False\n", "peanut True 7.085804 False\n", "butter True 7.466713 False\n", "is True 4.890306 False\n", "pindakaas False 0.0 False\n", "did True 5.284421 False\n", "you True 5.1979666 False\n", "know True 5.160699 False\n", "that True 4.8260193 False\n", "This True 5.0461264 False\n", "is True 4.890306 False\n", "a True 5.306696 False\n", "typpo False 0.0 True\n" ] } ], "source": [ "tokens = nlp(\"The Dutch word for peanut butter is 'pindakaas', did you know that? This is a typpo.\")\n", "\n", "for token in tokens:\n", " if token.is_alpha:\n", " print(token.text, token.has_vector, token.vector_norm, token.is_oov)" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The token: \"Car\" has the following vector (dimension: 300)\n" ] }, { "data": { "text/plain": [ "array([ 2.0987e-01, 4.6481e-01, -2.4238e-01, -6.5751e-02, 6.0856e-01,\n", " -3.4698e-01, -2.5331e-01, -4.2590e-01, -2.2277e-01, 2.2913e+00,\n", " -3.3853e-01, 2.3275e-01, -2.7511e-01, 2.4064e-01, -1.0697e+00,\n", " -2.6978e-01, -8.0733e-01, 1.8698e+00, 4.5562e-01, -1.4469e-01,\n", " 1.6246e-02, -3.5473e-01, 7.6152e-01, -6.8589e-02, 1.2156e-02,\n", " 9.0520e-03, 1.1131e-01, -3.0746e-01, 2.4168e-01, 1.1400e-01,\n", " 4.3952e-01, -6.6594e-01, -7.3198e-02, 8.0566e-01, 1.1748e-01,\n", " -3.8758e-01, 1.0691e-01, 3.3697e-01, -1.3188e-01, 1.9364e-01,\n", " 5.5553e-01, -3.4029e-01, 1.7059e-01, 4.0736e-01, -1.6150e-01,\n", " 7.0302e-02, 6.7772e-02, -8.1763e-01, 3.0645e-01, -9.9862e-03,\n", " 9.4606e-02, -5.9763e-01, 1.4192e-01, 1.4857e-01, -3.1535e-01,\n", " 9.9092e-02, 2.0673e-01, -4.4041e-01, 2.1519e-01, -4.1294e-01,\n", " 2.6374e-01, -1.5493e-01, 2.4739e-01, 4.2090e-01, 1.8768e-01,\n", " 4.6904e-02, 9.6848e-02, 2.7431e-02, 1.0633e-01, 3.1926e-01,\n", " -7.6260e-01, -8.8373e-02, 3.7519e-01, 4.7369e-01, -7.3557e-01,\n", " -1.0760e-01, -2.6557e-02, -5.1079e-01, -1.8886e-01, 2.8679e-01,\n", " 6.5798e-02, 5.7129e-01, 2.5056e-01, 7.3858e-02, 8.4700e-03,\n", " 1.5158e-02, 7.3570e-01, 6.2549e-01, 5.1600e-02, -2.5802e-01,\n", " -8.1203e-02, 1.3731e-01, 1.8809e-01, -6.5871e-01, -2.2361e-01,\n", " -3.3318e-01, 1.5853e-01, 5.1523e-01, 5.0259e-01, -1.6894e-01,\n", " -8.6465e-02, 2.5036e-01, -1.7419e-01, -2.7723e-02, 1.1262e-01,\n", " -4.6449e-01, 1.6956e-01, 2.8931e-01, -1.3187e-01, 4.6368e-01,\n", " -2.9348e-01, -3.1244e-01, 6.5886e-01, -4.7842e-01, 1.4754e-01,\n", " -3.0646e-01, 4.3847e-01, 1.7684e-01, -1.1968e-01, -3.1002e-02,\n", " -1.2228e-01, -5.6424e-01, 1.5289e-01, -7.9389e-01, -3.6731e-01,\n", " 1.6918e-01, -9.5210e-02, 1.6490e-01, 1.5936e-01, 1.2460e-01,\n", " 3.8846e-01, 2.3019e-01, -1.3054e-01, -2.1932e-01, -2.6782e-01,\n", " -6.0745e-01, -3.4826e-01, 1.7656e-01, -1.0351e-01, -2.2750e-01,\n", " -1.6111e+00, -4.0504e-01, 1.0872e+00, -1.6391e-01, 6.5586e-02,\n", " -1.0632e-01, -1.4014e-01, 1.7712e-01, 7.1100e-01, 2.0313e-01,\n", " -5.0138e-01, 1.7291e-01, -5.3208e-02, -4.0668e-01, 1.4907e-01,\n", " -3.0631e-01, 4.6572e-01, 3.7977e-01, -1.3336e-01, -7.6937e-02,\n", " 1.1803e-02, -1.1185e-01, 7.0364e-01, -6.8615e-02, 5.8586e-01,\n", " -5.9890e-01, -2.8104e-01, 4.9674e-01, 4.5867e-01, 1.6291e-01,\n", " -2.8317e-01, 3.8870e-01, -3.9882e-01, 2.2407e-01, -2.3704e-01,\n", " -2.3155e-01, 6.7882e-02, 8.3828e-01, 1.3231e-01, 2.9778e-01,\n", " 1.8471e-01, -9.7415e-05, -6.9993e-01, 4.6959e-03, -3.5461e-01,\n", " -9.6413e-02, 1.0312e-01, 8.5293e-02, -2.6909e-01, 4.3886e-01,\n", " 3.1275e-01, 2.2829e-01, 4.8072e-01, 1.8399e-01, -1.7628e-01,\n", " -4.8322e-01, 9.5676e-02, -2.4499e-01, 5.8915e-02, -3.9355e-02,\n", " -4.6954e-01, -2.6272e-01, 1.5462e-01, -1.8055e-01, 1.6881e-03,\n", " 5.7027e-02, -6.7284e-02, 2.4853e-01, 3.5735e-01, 1.4325e-01,\n", " -4.9276e-01, -2.9321e-02, 5.1167e-02, 4.9620e-01, 3.7308e-01,\n", " 4.0203e-01, 9.2905e-02, 7.4061e-01, -3.3765e-01, -3.5641e-01,\n", " 6.1675e-01, -9.5517e-01, -2.7492e-01, 2.2079e-01, -2.8898e-01,\n", " -1.5504e-01, -3.1433e-01, 5.8383e-01, -2.6138e-02, -2.7755e-01,\n", " -4.7184e-02, 1.0504e-01, -4.2419e-01, -1.6414e-01, -5.0711e-01,\n", " 4.2617e-01, 3.6889e-01, 4.3267e-01, -4.4480e-03, 5.6442e-01,\n", " -3.0964e-02, 7.7629e-02, 2.2218e-01, 1.2818e-01, -1.6235e-01,\n", " -2.2912e-01, 4.9174e-01, -5.1937e-01, -2.0793e-01, -3.6868e-01,\n", " -5.5714e-01, -1.9930e-01, 2.9782e-01, 7.5921e-02, -3.9895e-01,\n", " 8.1692e-01, -1.0221e-01, -3.8049e-01, 1.9906e-01, -1.9875e-02,\n", " 7.3431e-02, -1.3882e-01, 2.7914e-01, -4.5367e-01, 2.9227e-01,\n", " -5.5489e-01, -4.2121e-01, 5.5667e-01, -4.5230e-01, -1.1956e-01,\n", " 1.3504e-01, -2.3580e-01, 7.4221e-01, -2.7890e-01, -5.4580e-02,\n", " 3.1944e-01, 3.6717e-01, 1.3430e-01, 1.3629e-01, -9.7458e-02,\n", " -6.0310e-01, -1.7762e-01, 2.5910e-01, 3.3150e-01, 2.2701e-01,\n", " 6.3664e-01, 1.5324e-01, -3.2894e-01, -3.6749e-01, -2.0328e-01,\n", " -1.1924e+00, -4.6395e-01, 6.6984e-01, -4.9404e-01, 4.4154e-01,\n", " -4.3699e-01, 2.3538e-01, 3.2135e-01, 2.6649e-01, 2.2438e-01],\n", " dtype=float32)" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token = nlp('Car')\n", "print('The token: \"{}\" has the following vector (dimension: {})'.format(token.text, len(token.vector)))\n", "token.vector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Word2Vec [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simple example below is from: https://medium.com/@mishra.thedeepak/word2vec-in-minutes-gensim-nlp-python-6940f4e00980" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note:* you might have to run `nltk.download('brown')` to install the NLTK corpus files" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "import gensim\n", "from nltk.corpus import brown" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "sentences = brown.sents()" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "model = gensim.models.Word2Vec(sentences, min_count=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save model" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "model.save('brown_model')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load model" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "model = gensim.models.Word2Vec.load('brown_model')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find words most similar to 'mother':" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('father', 0.9847322702407837), ('husband', 0.9656894207000732), ('wife', 0.9496790170669556), ('friend', 0.9323333501815796), ('son', 0.9279097318649292), ('nickname', 0.9207977652549744), ('eagle', 0.9097722768783569), ('addiction', 0.9071668982505798), ('voice', 0.9051918983459473), ('patient', 0.8966456055641174)]\n" ] } ], "source": [ "print(model.wv.most_similar(\"mother\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the odd one out:" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cereal\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "D:\\anaconda\\envs\\limpergPython\\lib\\site-packages\\gensim\\models\\keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n", " vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)\n" ] } ], "source": [ "print(model.wv.doesnt_match(\"breakfast cereal dinner lunch\".split()))" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "garden\n" ] } ], "source": [ "print(model.wv.doesnt_match(\"pizza pasta garden fries\".split()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve vector representation of the word \"human\"" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-0.40230748, -0.36424252, 0.48136342, 0.37950972, 0.25889766,\n", " -0.10284429, 0.2539682 , -0.557558 , -1.0109891 , -0.30390584,\n", " 0.02308058, -0.9227236 , 0.06572528, -0.24571045, -0.17627639,\n", " 0.35974783, -0.29690138, -0.25977215, 0.465111 , -1.5026555 ,\n", " 0.23448153, -0.79958564, -0.6682266 , -0.51277363, -0.11112369,\n", " -1.4914587 , -0.30484447, 1.3466982 , -0.45936054, 0.02780625,\n", " 0.31517667, -0.12471037, 0.46333146, -0.29451668, 0.28516975,\n", " 1.3195679 , 0.02986159, 0.27836317, -0.5356812 , -0.5574794 ,\n", " 0.55741835, -0.3692916 , 0.3067411 , -0.62016165, 0.6085465 ,\n", " 0.6336735 , 0.9925447 , -0.2553504 , -0.3593044 , -0.29228973,\n", " -0.05774796, -0.22645272, -0.594325 , -0.19128117, 0.13758877,\n", " 0.58251387, -0.12266693, -0.33289537, -0.81493866, 0.64220285,\n", " -0.40921453, 1.7995448 , 0.98320687, 0.66162825, -0.03371862,\n", " 0.30391327, 0.30519032, -0.02499808, 0.46001107, -0.5412774 ,\n", " -0.14508785, 0.47390515, -0.01815019, 0.39801887, 0.33498788,\n", " -0.70357895, 0.80516887, 0.08044272, -0.70585257, -0.7256744 ,\n", " -0.95714486, -0.12571876, -0.20877206, -0.456315 , 0.7478423 ,\n", " -0.25637153, 0.78873783, 0.24834621, 0.4455648 , 0.1293853 ,\n", " 0.17152755, -0.30077967, 0.5803442 , 0.16445744, 0.60369337,\n", " -0.2301575 , -0.19547687, -0.36981392, 0.23723377, 0.24412107],\n", " dtype=float32)" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.wv['human']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistical models [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## \"Traditional\" machine learning [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The library to use for machine learning is scikit-learn ([\"sklearn\"](http://scikit-learn.org/stable/index.html))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Supervised [(to top)](#toc)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score, KFold, train_test_split\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn import metrics\n", "import joblib" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert the data into a pandas dataframe (so that we can input it easier)" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "article_list = []\n", "for author, value in spacy_text_clean.items():\n", " for article in value:\n", " article_list.append((author, ' '.join([x for x in article])))" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "article_df = pd.DataFrame(article_list, columns=['author', 'text'])" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
authortext
1325LydiaZajcCanadian unit Wal Mart Stores Inc bow Canadian...
1516MarkBendeichAustralia labour market watchdog expect award ...
2121SarahDavisonTemperatures rise Hong Kong prepare midsummer ...
2226SimonCowellBritish composite insurer Commercial Union Plc...
1914PierreTranFrench defence electronic firm Thomson CSF soo...
\n", "
" ], "text/plain": [ " author text\n", "1325 LydiaZajc Canadian unit Wal Mart Stores Inc bow Canadian...\n", "1516 MarkBendeich Australia labour market watchdog expect award ...\n", "2121 SarahDavison Temperatures rise Hong Kong prepare midsummer ...\n", "2226 SimonCowell British composite insurer Commercial Union Plc...\n", "1914 PierreTran French defence electronic firm Thomson CSF soo..." ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "article_df.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split the sample into a training and test sample" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(article_df.text, article_df.author, test_size=0.20, random_state=3561)" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2000 500\n" ] } ], "source": [ "print(len(X_train), len(X_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train and evaluate function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simple function to train (i.e. fit) and evaluate the model" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "def train_and_evaluate(clf, X_train, X_test, y_train, y_test):\n", " \n", " clf.fit(X_train, y_train)\n", " \n", " print(\"Accuracy on training set:\")\n", " print(clf.score(X_train, y_train))\n", " print(\"Accuracy on testing set:\")\n", " print(clf.score(X_test, y_test))\n", " \n", " y_pred = clf.predict(X_test)\n", " \n", " print(\"Classification Report:\")\n", " print(metrics.classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Naïve Bayes estimator [(to top)](#toc)" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [], "source": [ "from sklearn.naive_bayes import MultinomialNB" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define pipeline" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [], "source": [ "clf = Pipeline([\n", " ('vect', TfidfVectorizer(strip_accents='unicode',\n", " lowercase = True,\n", " max_features = 1500,\n", " stop_words='english'\n", " )),\n", " \n", " ('clf', MultinomialNB(alpha = 1,\n", " fit_prior = True\n", " )\n", " ),\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train and show evaluation stats" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on training set:\n", "0.8345\n", "Accuracy on testing set:\n", "0.72\n", "Classification Report:\n", " precision recall f1-score support\n", "\n", " AaronPressman 0.82 1.00 0.90 9\n", " AlanCrosby 0.55 0.92 0.69 12\n", " AlexanderSmith 0.86 0.60 0.71 10\n", " BenjaminKangLim 0.75 0.27 0.40 11\n", " BernardHickey 0.75 0.30 0.43 10\n", " BradDorfman 0.80 1.00 0.89 8\n", " DarrenSchuettler 0.58 0.78 0.67 9\n", " DavidLawder 1.00 0.60 0.75 10\n", " EdnaFernandes 1.00 0.67 0.80 9\n", " EricAuchard 0.86 0.67 0.75 9\n", " FumikoFujisaki 1.00 1.00 1.00 10\n", " GrahamEarnshaw 0.59 1.00 0.74 10\n", " HeatherScoffield 0.83 0.56 0.67 9\n", " JanLopatka 0.38 0.33 0.35 9\n", " JaneMacartney 0.33 0.60 0.43 10\n", " JimGilchrist 0.73 1.00 0.84 8\n", " JoWinterbottom 0.90 0.90 0.90 10\n", " JoeOrtiz 0.80 0.89 0.84 9\n", " JohnMastrini 0.83 0.29 0.43 17\n", " JonathanBirt 0.53 1.00 0.70 8\n", " KarlPenhaul 0.87 1.00 0.93 13\n", " KeithWeir 0.69 0.90 0.78 10\n", " KevinDrawbaugh 0.88 0.70 0.78 10\n", " KevinMorrison 0.33 1.00 0.50 3\n", " KirstinRidley 0.86 0.67 0.75 9\n", "KouroshKarimkhany 0.54 0.88 0.67 8\n", " LydiaZajc 0.90 0.90 0.90 10\n", " LynneO'Donnell 0.89 0.73 0.80 11\n", " LynnleyBrowning 0.93 1.00 0.96 13\n", " MarcelMichelson 1.00 0.50 0.67 12\n", " MarkBendeich 0.86 0.55 0.67 11\n", " MartinWolk 0.57 0.80 0.67 5\n", " MatthewBunce 1.00 0.86 0.92 14\n", " MichaelConnor 0.83 0.77 0.80 13\n", " MureDickie 0.44 0.40 0.42 10\n", " NickLouth 0.83 1.00 0.91 10\n", " PatriciaCommins 0.80 0.89 0.84 9\n", " PeterHumphrey 0.35 0.89 0.50 9\n", " PierreTran 0.56 0.83 0.67 6\n", " RobinSidel 1.00 1.00 1.00 12\n", " RogerFillion 1.00 0.88 0.93 8\n", " SamuelPerry 0.78 0.50 0.61 14\n", " SarahDavison 1.00 0.29 0.44 14\n", " ScottHillis 0.44 0.44 0.44 9\n", " SimonCowell 0.91 1.00 0.95 10\n", " TanEeLyn 1.00 0.57 0.73 7\n", " TheresePoletti 0.80 0.73 0.76 11\n", " TimFarrand 1.00 0.77 0.87 13\n", " ToddNissen 0.60 1.00 0.75 9\n", " WilliamKazer 0.00 0.00 0.00 10\n", "\n", " accuracy 0.72 500\n", " macro avg 0.75 0.74 0.71 500\n", " weighted avg 0.77 0.72 0.71 500\n", "\n" ] } ], "source": [ "train_and_evaluate(clf, X_train, X_test, y_train, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save results" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['naive_bayes_results.pkl']" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joblib.dump(clf, 'naive_bayes_results.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predict out of sample:" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "example_y, example_X = y_train[33], X_train[33]" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Actual author: AaronPressman\n", "Predicted author: AaronPressman\n" ] } ], "source": [ "print('Actual author:', example_y)\n", "print('Predicted author:', clf.predict([example_X])[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Support Vector Machines (SVM) [(to top)](#toc)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import SVC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define pipeline" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "clf_svm = Pipeline([\n", " ('vect', TfidfVectorizer(strip_accents='unicode',\n", " lowercase = True,\n", " max_features = 1500,\n", " stop_words='english'\n", " )),\n", " \n", " ('clf', SVC(kernel='rbf' ,\n", " C=10, gamma=0.3)\n", " ),\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note:* The SVC estimator is very sensitive to the hyperparameters!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train and show evaluation stats" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on training set:\n", "0.9975\n", "Accuracy on testing set:\n", "0.83\n", "Classification Report:\n", " precision recall f1-score support\n", "\n", " AaronPressman 0.89 0.89 0.89 9\n", " AlanCrosby 0.79 0.92 0.85 12\n", " AlexanderSmith 1.00 0.70 0.82 10\n", " BenjaminKangLim 0.67 0.36 0.47 11\n", " BernardHickey 1.00 0.50 0.67 10\n", " BradDorfman 0.78 0.88 0.82 8\n", " DarrenSchuettler 0.89 0.89 0.89 9\n", " DavidLawder 1.00 0.60 0.75 10\n", " EdnaFernandes 0.73 0.89 0.80 9\n", " EricAuchard 0.73 0.89 0.80 9\n", " FumikoFujisaki 1.00 1.00 1.00 10\n", " GrahamEarnshaw 0.83 1.00 0.91 10\n", " HeatherScoffield 0.80 0.89 0.84 9\n", " JanLopatka 0.67 0.44 0.53 9\n", " JaneMacartney 0.36 0.50 0.42 10\n", " JimGilchrist 0.88 0.88 0.88 8\n", " JoWinterbottom 1.00 0.90 0.95 10\n", " JoeOrtiz 0.82 1.00 0.90 9\n", " JohnMastrini 0.83 0.88 0.86 17\n", " JonathanBirt 0.80 1.00 0.89 8\n", " KarlPenhaul 0.93 1.00 0.96 13\n", " KeithWeir 0.83 1.00 0.91 10\n", " KevinDrawbaugh 0.82 0.90 0.86 10\n", " KevinMorrison 0.50 1.00 0.67 3\n", " KirstinRidley 1.00 0.56 0.71 9\n", "KouroshKarimkhany 0.88 0.88 0.88 8\n", " LydiaZajc 1.00 1.00 1.00 10\n", " LynneO'Donnell 0.82 0.82 0.82 11\n", " LynnleyBrowning 1.00 1.00 1.00 13\n", " MarcelMichelson 1.00 0.67 0.80 12\n", " MarkBendeich 0.79 1.00 0.88 11\n", " MartinWolk 0.83 1.00 0.91 5\n", " MatthewBunce 1.00 0.86 0.92 14\n", " MichaelConnor 1.00 0.85 0.92 13\n", " MureDickie 0.60 0.60 0.60 10\n", " NickLouth 0.90 0.90 0.90 10\n", " PatriciaCommins 1.00 1.00 1.00 9\n", " PeterHumphrey 0.57 0.89 0.70 9\n", " PierreTran 0.60 1.00 0.75 6\n", " RobinSidel 1.00 1.00 1.00 12\n", " RogerFillion 1.00 1.00 1.00 8\n", " SamuelPerry 0.81 0.93 0.87 14\n", " SarahDavison 1.00 0.71 0.83 14\n", " ScottHillis 0.67 0.44 0.53 9\n", " SimonCowell 1.00 0.90 0.95 10\n", " TanEeLyn 0.83 0.71 0.77 7\n", " TheresePoletti 0.91 0.91 0.91 11\n", " TimFarrand 0.92 0.85 0.88 13\n", " ToddNissen 0.90 1.00 0.95 9\n", " WilliamKazer 0.27 0.30 0.29 10\n", "\n", " accuracy 0.83 500\n", " macro avg 0.84 0.83 0.82 500\n", " weighted avg 0.85 0.83 0.83 500\n", "\n" ] } ], "source": [ "train_and_evaluate(clf_svm, X_train, X_test, y_train, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save results" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['svm_results.pkl']" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joblib.dump(clf_svm, 'svm_results.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predict out of sample:" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "example_y, example_X = y_train[33], X_train[33]" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Actual author: AaronPressman\n", "Predicted author: AaronPressman\n" ] } ], "source": [ "print('Actual author:', example_y)\n", "print('Predicted author:', clf_svm.predict([example_X])[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Selection and Evaluation [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both the `TfidfVectorizer` and `SVC()` estimator take a lot of hyperparameters. \n", "\n", "It can be difficult to figure out what the best parameters are.\n", "\n", "We can use `GridSearchCV` to help figure this out." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.metrics import make_scorer\n", "from sklearn.metrics import f1_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we define the options that should be tried out:" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [], "source": [ "clf_search = Pipeline([\n", " ('vect', TfidfVectorizer()),\n", " ('clf', SVC())\n", "])\n", "parameters = { 'vect__stop_words': ['english'],\n", " 'vect__strip_accents': ['unicode'],\n", " 'vect__max_features' : [1500],\n", " 'vect__ngram_range': [(1,1), (2,2) ],\n", " 'clf__gamma' : [0.2, 0.3, 0.4], \n", " 'clf__C' : [8, 10, 12],\n", " 'clf__kernel' : ['rbf']\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run everything:" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(estimator=Pipeline(steps=[('vect', TfidfVectorizer()),\n", " ('clf', SVC())]),\n", " n_jobs=-1,\n", " param_grid={'clf__C': [8, 10, 12], 'clf__gamma': [0.2, 0.3, 0.4],\n", " 'clf__kernel': ['rbf'], 'vect__max_features': [1500],\n", " 'vect__ngram_range': [(1, 1), (2, 2)],\n", " 'vect__stop_words': ['english'],\n", " 'vect__strip_accents': ['unicode']},\n", " scoring=make_scorer(f1_score, average=micro))" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid = GridSearchCV(clf_search, \n", " param_grid=parameters, \n", " scoring=make_scorer(f1_score, average='micro'), \n", " n_jobs=-1\n", " )\n", "grid.fit(X_train, y_train) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note:* if you are on a powerful (preferably unix system) you can set n_jobs to the number of available threads to speed up the calculation" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The best parameters are {'clf__C': 8, 'clf__gamma': 0.4, 'clf__kernel': 'rbf', 'vect__max_features': 1500, 'vect__ngram_range': (1, 1), 'vect__stop_words': 'english', 'vect__strip_accents': 'unicode'} with a score of 0.79\n", " precision recall f1-score support\n", "\n", " AaronPressman 0.89 0.89 0.89 9\n", " AlanCrosby 0.79 0.92 0.85 12\n", " AlexanderSmith 1.00 0.70 0.82 10\n", " BenjaminKangLim 0.67 0.36 0.47 11\n", " BernardHickey 1.00 0.50 0.67 10\n", " BradDorfman 0.78 0.88 0.82 8\n", " DarrenSchuettler 1.00 0.89 0.94 9\n", " DavidLawder 1.00 0.60 0.75 10\n", " EdnaFernandes 0.73 0.89 0.80 9\n", " EricAuchard 0.73 0.89 0.80 9\n", " FumikoFujisaki 1.00 1.00 1.00 10\n", " GrahamEarnshaw 0.83 1.00 0.91 10\n", " HeatherScoffield 0.82 1.00 0.90 9\n", " JanLopatka 0.60 0.33 0.43 9\n", " JaneMacartney 0.36 0.50 0.42 10\n", " JimGilchrist 0.88 0.88 0.88 8\n", " JoWinterbottom 1.00 0.90 0.95 10\n", " JoeOrtiz 0.82 1.00 0.90 9\n", " JohnMastrini 0.79 0.88 0.83 17\n", " JonathanBirt 0.80 1.00 0.89 8\n", " KarlPenhaul 0.93 1.00 0.96 13\n", " KeithWeir 0.83 1.00 0.91 10\n", " KevinDrawbaugh 0.82 0.90 0.86 10\n", " KevinMorrison 0.60 1.00 0.75 3\n", " KirstinRidley 1.00 0.56 0.71 9\n", "KouroshKarimkhany 0.88 0.88 0.88 8\n", " LydiaZajc 1.00 1.00 1.00 10\n", " LynneO'Donnell 0.82 0.82 0.82 11\n", " LynnleyBrowning 1.00 1.00 1.00 13\n", " MarcelMichelson 1.00 0.67 0.80 12\n", " MarkBendeich 0.73 1.00 0.85 11\n", " MartinWolk 0.83 1.00 0.91 5\n", " MatthewBunce 1.00 0.86 0.92 14\n", " MichaelConnor 1.00 0.85 0.92 13\n", " MureDickie 0.60 0.60 0.60 10\n", " NickLouth 0.90 0.90 0.90 10\n", " PatriciaCommins 1.00 1.00 1.00 9\n", " PeterHumphrey 0.57 0.89 0.70 9\n", " PierreTran 0.60 1.00 0.75 6\n", " RobinSidel 1.00 1.00 1.00 12\n", " RogerFillion 1.00 1.00 1.00 8\n", " SamuelPerry 0.76 0.93 0.84 14\n", " SarahDavison 1.00 0.71 0.83 14\n", " ScottHillis 0.67 0.44 0.53 9\n", " SimonCowell 1.00 0.90 0.95 10\n", " TanEeLyn 0.83 0.71 0.77 7\n", " TheresePoletti 0.90 0.82 0.86 11\n", " TimFarrand 0.92 0.85 0.88 13\n", " ToddNissen 0.90 1.00 0.95 9\n", " WilliamKazer 0.27 0.30 0.29 10\n", "\n", " accuracy 0.83 500\n", " macro avg 0.84 0.83 0.82 500\n", " weighted avg 0.85 0.83 0.82 500\n", "\n" ] } ], "source": [ "print(\"The best parameters are %s with a score of %0.2f\" % (grid.best_params_, grid.best_score_))\n", "y_true, y_pred = y_test, grid.predict(X_test)\n", "print(metrics.classification_report(y_true, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unsupervised [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Latent Dirichilet Allocation (LDA) [(to top)](#toc)" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import LatentDirichletAllocation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vectorizer (using countvectorizer for the sake of example)" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer(strip_accents='unicode',\n", " lowercase = True,\n", " max_features = 1500,\n", " stop_words='english', max_df=0.8)\n", "tf_large = vectorizer.fit_transform(clean_paragraphs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the LDA model" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [], "source": [ "n_topics = 10\n", "n_top_words = 25" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "lda = LatentDirichletAllocation(n_components=n_topics, max_iter=10,\n", " learning_method='online',\n", " n_jobs=-1)\n", "lda_fitted = lda.fit_transform(tf_large)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualize top words" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [], "source": [ "def save_top_words(model, feature_names, n_top_words):\n", " out_list = []\n", " for topic_idx, topic in enumerate(model.components_):\n", " out_list.append((topic_idx+1, \" \".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])))\n", " out_df = pd.DataFrame(out_list, columns=['topic_id', 'top_words'])\n", " return out_df" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "result_df = save_top_words(lda, vectorizer.get_feature_names(), n_top_words)" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
topic_idtop_words
01company pound million share group business bil...
12official south people north northern police ko...
23china kong hong beijing chinese deng year part...
34market percent price year china trade bank sta...
45percent year million profit analyst quarter sh...
56gold bre canada toronto stock busang canadian ...
67ford union oil strike car plant new worker pro...
78tonne year million crop new world export add a...
89company service new network computer internet ...
910bank financial fund nomura company loan firm s...
\n", "
" ], "text/plain": [ " topic_id top_words\n", "0 1 company pound million share group business bil...\n", "1 2 official south people north northern police ko...\n", "2 3 china kong hong beijing chinese deng year part...\n", "3 4 market percent price year china trade bank sta...\n", "4 5 percent year million profit analyst quarter sh...\n", "5 6 gold bre canada toronto stock busang canadian ...\n", "6 7 ford union oil strike car plant new worker pro...\n", "7 8 tonne year million crop new world export add a...\n", "8 9 company service new network computer internet ...\n", "9 10 bank financial fund nomura company loan firm s..." ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### pyLDAvis [(to top)](#toc)" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import pyLDAvis\n", "import pyLDAvis.sklearn\n", "pyLDAvis.enable_notebook()" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= x y topics cluster Freq\n", "topic \n", "0 -0.069400 0.007189 1 1 17.328910\n", "4 -0.194081 -0.076833 2 1 16.906265\n", "8 -0.037245 -0.029790 3 1 14.556541\n", "2 0.185461 -0.015643 4 1 13.805324\n", "9 -0.000539 -0.006789 5 1 8.437634\n", "3 -0.075262 -0.067386 6 1 7.977310\n", "7 0.016381 -0.093135 7 1 7.187180\n", "6 0.030961 -0.014622 8 1 6.407350\n", "1 0.244465 0.041142 9 1 4.057657\n", "5 -0.100740 0.255867 10 1 3.335829, topic_info= Term Freq Total Category logprob loglift\n", "112 bank 3158.000000 3158.000000 Default 30.0000 30.0000\n", "218 china 3599.000000 3599.000000 Default 29.0000 29.0000\n", "731 kong 2394.000000 2394.000000 Default 28.0000 28.0000\n", "632 hong 2379.000000 2379.000000 Default 27.0000 27.0000\n", "959 percent 5501.000000 5501.000000 Default 26.0000 26.0000\n", "... ... ... ... ... ... ...\n", "1221 share 157.697294 3333.806371 Topic10 -4.6351 0.3493\n", "132 billion 130.387172 3254.094879 Topic10 -4.8253 0.1833\n", "1010 president 107.512722 887.953634 Topic10 -5.0182 1.2891\n", "959 percent 109.591783 5501.563943 Topic10 -4.9991 -0.5156\n", "590 government 105.267364 1978.737438 Topic10 -5.0393 0.4667\n", "\n", "[572 rows x 6 columns], token_table= Topic Freq Term\n", "term \n", "2 1 0.238596 abn\n", "2 2 0.079532 abn\n", "2 5 0.670342 abn\n", "5 1 0.013245 access\n", "5 2 0.013245 access\n", "... ... ... ...\n", "1496 7 0.008159 yuan\n", "1498 4 0.991199 zemin\n", "1499 6 0.052513 zinc\n", "1499 7 0.052513 zinc\n", "1499 10 0.886154 zinc\n", "\n", "[1967 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[1, 5, 9, 3, 10, 4, 8, 7, 2, 6])" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.sklearn.prepare(lda, tf_large, vectorizer, n_jobs=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning:** there is a small bug that when you show the `pyLDAvis` visualization it will hide some of the icons of JupyterLab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Neural Networks [(to top)](#toc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interested? Check out the Stanford course CS224n ([Page](http://web.stanford.edu/class/cs224n/index.html#schedule))! " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "48px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }