{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is focused on words. For a specific language, you can:\n", "- [Find all sentences containing a specific word](#specific_word)\n", "- [Checking how many sentences contain a word from a given list of words](#how_many_sentences)\n", "- [Conduct a word analysis](#word_analysis) for a specific language. This analysis will give you the frequency of the words used in the corpus. You can, for example, extract a list of words that appear only once. **/!\\ This is still in beta version and difficult (or impossible) to use out of the box for most languages /!\\**\n", "\n", "Before experimenting with any of the options described above, it is necessary to set and execute the cells under the [Read sentences section](#read_sentences).\n", "\n", "If you're new to Jupyter, please click on `Cell > Run All` from the top menu to see what the notebook does. You should see that cells that are running have an `In[*]` that will become `In[n]` when their execution is finished (`n` is a number). To run a specific cell, click in it and press `Shift + Enter` or click the `Run` button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In any case, to be able to use the notebook correctly, please run the two following cells first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import nltk\n", "import csv\n", "import tarfile\n", "import string\n", "from collections import Counter\n", "from nltk import RegexpTokenizer\n", "nltk.download('stopwords')\n", "nltk.download('punkt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_colwidth', -1) # To display full content of the column\n", "# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Read sentences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading all sentences takes a long time, so let's split the process into two steps. You only need to run the two following cells once." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2\n", "def read_sentences_file():\n", " with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:\n", " csv_path = tar.getnames()[0]\n", " sentences = pd.read_csv(tar.extractfile(csv_path), \n", " sep='\\t', \n", " header=None, \n", " names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],\n", " quoting=csv.QUOTE_NONE)\n", " print(f\"{len(sentences):,} sentences fetched.\")\n", " return sentences" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_sentences = read_sentences_file()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, you can fetch sentences of a specific language using the following cells. If you want to change your target language, you can start again from here.\n", "\n", "Note that by default, we get rid of the `ISO` (that is, ISO 639 three-letter language code), `Date added`, `Date last modified`, and `Username` columns. \n", "If you need any of these columns, you can comment out the lines you need by adding a `#` at the beginning of the corresponding lines of the next cell.\n", "\n", "So run the following cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sentences_of_language(sentences, language):\n", " target_sentences = sentences[sentences['ISO'] == language]\n", " del target_sentences['Date added']\n", " del target_sentences['Date last modified']\n", " del target_sentences['ISO']\n", " del target_sentences['Username']\n", " target_sentences = target_sentences.set_index(\"sentenceID\")\n", " print(f\"{len(target_sentences):,} sentences fetched.\")\n", " return target_sentences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choose your target `language` as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.), and run the next one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "language = 'fra' # <-- Modify this value\n", "sentences = sentences_of_language(all_sentences, language)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the variable `sentences` contains the sentences of the language you specified. Wanna check? The following cell displays five random sentences in your set, just for a quick check." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "sentences.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, you can see two columns: `sentenceID` and `Text`. `sentenceID` is the same as on Tatoeba, so you can easily access that sentence page there." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To only get the text of sentence with a specific `sentenceID`, use the following syntax `sentences.loc[].Text` as in the following cell.\n", "\n", "Note: if `` is not in your `sentences` set, you will get an error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sentences.loc[1115].Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Sentences containing a specific word" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As its name indicates, you can use this section to fetch sentences containing a specific word. \n", "\n", "Run the following cell (you don't have to modify it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_sentences(word, sentences):\n", " frame = sentences[sentences['Text'].str.contains(word)]\n", " frame = frame.append(sentences[sentences['Text'].str.contains(word.capitalize())])\n", " frame = frame.append(sentences[sentences['Text'].str.contains(word.upper())])\n", " frame = frame.append(sentences[sentences['Text'].str.contains(word.lower())])\n", " return frame.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choose the `word` you want to search, run the cell, and all sentences (from your current sentences set) containing your word will be displayed. \n", "\n", "The occurrences that will match are your word with any capitalization AND any words containing it. \n", "\n", "For example, if you look for `beauty`, sentences containing `Beauty` will also match. \n", "If you look for `sho`, sentences containing `shoot`, `short`, `should` and every word containing `sho` will also match. That is useful in some cases, but annoying in some others :)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "word = \"skis\" # <-- Modify this value\n", "get_sentences(word, sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to check (some of) the sentences containing one exact match for the word (that is, the matching is case-sensitive), you can use the following cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "word = \"exemple\" # <-- sentences matching this word EXACTLY will be fetched\n", "sentences[sentences['Text'].str.contains(word)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Checking how many sentences contain words from a list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose that you want to check how many sentences contain a specific word. You could use `get_sentences` above and count the results. However, if you have several words in mind, and you only want to know how many sentences contain them, you can use this section.\n", "\n", "/!\\ Currently, only sentences matching **exactly** your word will be counted (no uppercase, no capitalization, etc.) /!\\" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell (you don't have to modify it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def how_many_sentences(word_list, sentences):\n", " for w in word_list:\n", " print(w + \"\\t\\t\" + str(len(sentences[sentences['Text'].str.contains(w)])))\n", "# if len(sentences[sentences['Text'].str.contains(w)]) <= 10:\n", "# print(w + \"\\t\\t\" + str(len(sentences[sentences['Text'].str.contains(w)])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, replace `word_list` by the words you are interested in. \n", "Do not forget the brackets and the quotes. `word_list` format should be `word_list = [\"word1\", \"word2\", ..., \"wordn\"]`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "word_list = [\"manger\", \"skis\", \"mirage\", \"oasis\"] # <-- Modify these values\n", "how_many_sentences(word_list, sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, suppose that you only want to check the words from your list that appear in fewer than `n` sentences. \n", "\n", "Run the following cell (you don't have to modify it)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def how_many_sentences_under_threshold(word_list, threshold, sentences):\n", " for w in word_list:\n", " nb_occurences = len(sentences[sentences['Text'].str.contains(w)])\n", " if nb_occurences <= threshold:\n", " print(w + \"\\t\\t\" + str(nb_occurences))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write your own list of words (just like before) and set `n` to the number of sentences you want to set as a threshold. \n", "For example, if `n` is set to 10, only words that appear in fewer than 10 sentences will be returned, along with the number of sentences in which they appear." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "words_list = [\"manger\", \"skis\", \"mirage\", \"oasis\"] # <-- Modify these values\n", "n = 10 # <-- Modify this threshold\n", "how_many_sentences_under_threshold(words_list, n, sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Word analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section is a little bit complex and may be challenging to configure as you like. Basically it runs an analysis of word frequency in your (current) set of sentences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to ignore some symbols, like punctuation. \n", "Run the following cell to see some standard symbols to ignore." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "string.punctuation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should add punctuation specific to your target language to `additional_punctuation` in the cell below. The format is a little bit complicated. You should enclose single quotes between double, and double quotes between single. \n", "\n", "Run the following cell, and if an error occurred, it is probably because one of your enclosing quotation marks is not correct." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "additional_punctuation = ['``', \"''\", '``', \"''\", '...', '’', '``', \"''\", '«', '»',]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell will display a list of words that will be ignored. Those are common [stop words](https://en.wikipedia.org/wiki/Stop_words) PLUS all the punctuation symbols defined above. \n", "If you're not happy with this list, you can limit it to only punctuation by removing `nltk.corpus.stopwords.words()`, or extend it by adding another list to `useless_words`\n", "\n", "This list of stop words uses the `stopwords` corpus of the nltk package. Note that a limited number of languages are available. Currently available are \n", "`arabic`, `azerbaijani`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `greek`, `hungarian`, `indonesian`, `italian`, `kazakh`, `nepali`, `norwegian`, `portuguese`, `romanian`, `russian`, `slovenian`, `spanish`, `swedish`, `tajik`, `turkish`\n", "\n", "Run the cell and see what words will be ignored!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "language = \"french\" # <-- Modify this value\n", "useless_words = nltk.corpus.stopwords.words(language)\n", "useless_words += list(string.punctuation)\n", "useless_words += additional_punctuation\n", "useless_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All right! A little bit more preparation :) \n", "\n", "Run the following cell (you don't have to modify it). It creates a list of all non-ignored words present in your (current) set of sentences, so it may take some time!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# List of words in sentences['Text']\n", "texts = [word for word in sentences['Text']]\n", "all_words = [word for text in texts for word in nltk.word_tokenize(text)]\n", "# \"Raw\" number of words\n", "print(f'{len(all_words):,} tokens')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probably the hardest cell to configure correctly. We want to get rid of language-specific symbols that get between words and hinder the analysis. For example, the \"apostrophe\" in French. The best option is to comment out the `toknizer` you don't need and use one specific to your language. \n", "\n", "/!\\ We hope to provide correct `toknizer` for several languages out of the box in the future, but that is unfortunately not the case for now /!\\\n", "\n", "Run the following cell after setting a correct `toknizer`. It will display the number of words after filtering (removing digits, splitting at apostrophes, etc.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using a RegexpTokenizer to improve tokenizing of French sentences.\n", "# We want to split at apostrophes.\n", "toknizer = RegexpTokenizer(r\"''\\w'|\\w+|[^\\w\\s]''\")\n", "filtered_words = [word.lower() for text in texts for word in toknizer.tokenize(text) if not word.lower() in useless_words]\n", "# Filter numbers written with digits\n", "filtered_words = [word for word in filtered_words if not word.isdigit()]\n", "# Number of filtered words\n", "print(f'{len(filtered_words):,} filtered words')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell simply prints the number of *unique* words." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Number of unique words\n", "print(f'{len(set(filtered_words)):,} unique words')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally over!\n", "\n", "The following cells gives you the most common words along with how many times they appear! If you know how to use `Counter` you can probably get more information :)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "word_counter = Counter(filtered_words)\n", "most_common_words = word_counter.most_common()\n", "most_common_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By running the following cell, you can see a frequency by word rank graph. Well, the number of times the most common words appear is given by the cell above, and it's very likely that the number of times the less common words appear is 1 :)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "sorted_word_counts = sorted(list(word_counter.values()), reverse=True)\n", "\n", "plt.loglog(sorted_word_counts)\n", "plt.ylabel(\"Freq\")\n", "plt.xlabel(\"Word Rank\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell puts the list of common words and their occurrence count into a dataframe that is easier to use, and more beautiful to see. Make sure to run it if you want to go to the following section!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df = pd.DataFrame.from_dict(word_counter, orient='index')\n", "df = df.rename(columns={'index':'word', 0:'count'})\n", "df = df.sort_values(by='count', ascending=False)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Words that appear a certain number of times" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the complexity of the previous section, this section will be easy :) Mainly, we play with the dataframe we created above: slice it, display it, from the top, from the bottom, ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell gives you the list of words that appear only once. You don't have to modify it except if you prefer to see words that appear not once but twice, or any number of times." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "unique_words = df[df['count'] == 1] # <-- You may modify the 1 here. You can also use >, <, >= and <= signs instead of ==\n", "unique_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `df.head(n)` for the `n` most used words. \n", "Use `df.tail(n)` for `n` of the less used words. \n", "You can use `df[m:n]` for the words between the m-th and n-th most used.\n", "\n", "For example, you can use this to go through words that are used only once to quickly find typos or erroneous words. First check the words that are used only once by `df.tail(n)`, then use `sentences[sentences['Text'].str.contains(word)]` with the words you fetched. That way, you can quickly check the sentence containing that word." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First ten elements\n", "df.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# From 11th to 20th\n", "df[10:20]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Last ten elements\n", "df.tail(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# From 15th to the last until 10th to the last\n", "df[len(df)-15:len(df)-10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following display the 15 least used words along with the sentences that contain them. Notice however that it is a simplistic approach that may not exactly return what you want. If the word is `cat`, this will return `Cat`, `cats`, and so on.\n", "\n", "You have to run the cell containing the [definition of the `get_sentences` \n", "function](#specific_word) if you want the cell to work. Otherwise, you'll get an error \n", "`NameError: name 'get_sentences' is not defined`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n = 15 # <-- Modify this\n", "target_slice = most_common_words[len(df)-n:len(df)] # <-- Modify this if you need\n", "check_list = [t[0] for t in target_slice] \n", "for word in check_list:\n", " print(word)\n", " display(get_sentences(word, sentences))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }