{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", " LDA Spike 2 - Counting\n", "

\n", "\n", "This notebook counts the occurrences of words in the cleaned the text files. By default the cleaned text files are expected to be found in the folder `Cleaned` and the count files are written into the folder `Counts`. We leave the counting to [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from `sklearn.feature_extraction.text`. The most time is spent for separating the matrix of all counts and storing the counts for each file separately. We invest this time so that the counts may easily be reviewed manually.\n", "\n", "A few examples at the end the notebook illustrate the result of the process.\n", "\n", "

\n", " \n", "__This notebooks writes to and reads from your file system.__ Per default all used directory are within `~/TextData/Abgeordnetenwatch`, where `~` stands for whatever your operating system considers your home directory. To change this configuration either change the default values in the second next cell or edit [LDA Spike - Configuration.ipynb](./LDA%20Spike%20-%20Configuration.ipynb) and run it before you run this notebook.\n", "\n", "

\n", "\n", "This notebooks operates on text files. In our case we retrieved these texts from www.abgeordnetenwatch.de guided by data that was made available under the [Open Database License (ODbL) v1.0](https://opendatacommons.org/licenses/odbl/1.0/) at that site.\n", "\n", "

" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import time\n", "import random as rnd\n", "\n", "from pathlib import Path\n", "import json\n", "\n", "import numpy as np\n", "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%store -r own_configuration_was_read\n", "if not('own_configuration_was_read' in globals()): raise Exception(\n", " '\\nReminder: You might want to run your configuration notebook before you run this notebook.' + \n", " '\\nIf you want to manage your configuration from each notebook, just remove this check.')\n", "\n", "%store -r project_name\n", "if not('project_name' in globals()): project_name = 'AbgeordnetenWatch'\n", "\n", "%store -r text_data_dir\n", "if not('text_data_dir' in globals()): text_data_dir = Path.home() / 'TextData'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "cleaned_dir = text_data_dir / project_name / 'Cleaned'\n", "counts_dir = text_data_dir / project_name / 'Counts'\n", "\n", "assert cleaned_dir.exists(), 'Directory should exist.'\n", "assert cleaned_dir.is_dir(), 'Directory should be a directory.'\n", "assert next(cleaned_dir.iterdir(), None) != None, 'Directory should not be empty.'\n", "\n", "counts_dir.mkdir(parents=True, exist_ok=True) # Creates a local directory!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "update_only_missing_counts = True\n", "\n", "min_df = 3 # Ignore words that do not occure in at least in some documents. Helps to ignore misspelled words.\n", " # 3 is a rather low number that leads to a big vocabulary.\n", "max_df = 0.5 # Ignore words that are in the majority of documents. Helps to ignore regular phrases.\n", " # 0.5 still keeps words that occur in almost every second document." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "notebook_start_time = time.perf_counter()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the content of the cleaned text files" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Read 7696 documents: \"achim-kessler_die-linke_Q0001_2017-08-06_A01_2017-08-11_gesundheit\" ... \"zaklin-nastic_die-linke_Q0008_2017-10-25_A01_2018-09-24_demokratie-und-bürgerrechte\"\"\n" ] } ], "source": [ "filenames = []\n", "texts = []\n", "\n", "files = list(cleaned_dir.glob('*A*.txt')) # Answers\n", "list.sort(files)\n", "\n", "for file in files:\n", " filenames.append(file.stem)\n", " texts.append(file.read_text())\n", " \n", "print('Read {} documents: \"{}\" ... \"{}\"\"'.format(len(filenames), filenames[0], filenames[-1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Count the words\n", "\n", "See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counted 19670 unique words.\n", "Counting took 0.85s.\n" ] } ], "source": [ "counter_start_time = time.perf_counter()\n", "\n", "counter = CountVectorizer(analyzer='word', min_df=min_df, max_df=max_df, lowercase=False)\n", "\n", "word_counts = counter.fit_transform(texts)\n", "words = counter.get_feature_names()\n", "\n", "print('Counted {} unique words.'.format(len(words)))\n", "\n", "counter_end_time = time.perf_counter()\n", "print('Counting took {:.2f}s.'.format(counter_end_time - counter_start_time))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write the word counts into separate files" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Dumping the word counts to files took 0.10s.\n" ] } ], "source": [ "dump_start_time = time.perf_counter()\n", "\n", "for doc, filename in enumerate(filenames):\n", "\n", " target_file = counts_dir / (filename + '.count')\n", " if update_only_missing_counts and target_file.exists(): continue\n", "\n", " counts = {}\n", " doc_word_counts = word_counts[doc, :]\n", " _, word_indices = word_counts[doc, :].nonzero()\n", "\n", " for word in word_indices:\n", " counts[words[word]] = str(doc_word_counts[0, word])\n", "\n", " target_file.write_text(json.dumps(counts, ensure_ascii=False, indent=0, sort_keys=True))\n", " print('\\rWrote ' + filename, end='')\n", "\n", "dump_end_time = time.perf_counter()\n", "print('\\nDumping the word counts to files took {:.2f}s.'.format(dump_end_time - dump_start_time))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Five most frequent words for some random documents" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tobias-pfluger_die-linke_Q0007_2: 2 \"entsprechen 2 \"Traditionse 1 \"bw\" 1 \"http\" 1 \"Mitglied\" \n", "katja-kipping_die-linke_Q0059_20: 4 \"Open\" 3 \"öffentlich\" 2 \"digitale\" 2 \"Source\" 2 \"sehen\" \n", "dr-daniela-de-ridder_spd_Q0002_2: 4 \"neue\" 2 \"gelingen\" 2 \"Mensch\" 2 \"de\" 2 \"finden\" \n", "irene-mihalic_die-grünen_Q0003_2: 12 \"Tierschutz\" 7 \"Tier\" 5 \"Massentierh 5 \"grün\" 5 \"wollen\" \n", "jorg-schneider_afd_Q0002_2017-09: 3 \"Bundeswehr\" 2 \"Ausstattung 2 \"materiell\" 1 \"hören\" 1 \"schnellen\" \n", "bernd-riexinger_die-linke_Q0005_: 4 \"sollen\" 2 \"persoenlich 2 \"gut\" 2 \"Gerechtigke 2 \"denken\" \n", "mahmut-ozdemir_spd_Q0010_2018-06: 3 \"Frage\" 2 \"engagieren\" 2 \"politische\" 2 \"sozialdemok 2 \"Gespräch\" \n" ] } ], "source": [ "# For slice the notation [from:to:step] see the\n", "# reference https://docs.python.org/3/library/stdtypes.html?highlight=slice%20notation#common-sequence-operations or the\n", "# explanation https://stackoverflow.com/questions/509211/understanding-pythons-slice-notation/509295#509295\n", "\n", "# For sorting with argsort see\n", "# https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html\n", "# https://docs.scipy.org/doc/numpy/reference/routines.sort.html\n", "\n", "\n", "sample_documents = rnd.sample(range(len(filenames)), 7)\n", "\n", "for doc in sample_documents:\n", "\n", " filename = filenames[doc]\n", " \n", " print('{:32.32}: '.format(filename), end ='')\n", " \n", " word_count = word_counts[doc, :].toarray().flatten()\n", " most_frequent = np.argsort(word_count)[:-6:-1]\n", " \n", " for word in most_frequent:\n", " print('{:4} {:12.12}'.format(word_counts[doc, word], '\"' + words[word] + '\"'), end = '')\n", " print('')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------------ Cleaned text: ------------------------------\n", "Dank Frage persönlich konkret neu Gesetz freuen vorangetrieben gut Schutz Bewohner Bahnstrecken gemeinsam Bürgerinitiativen entsprechend drucken erreichen laut Güterzüge Schiene fahren dürfen Betroffene Pankow konkret helfen Wahl deutsche Bundestag einsetzen Mieterinnen Mieter Energieversorgung gleich Recht einräumen Hausbesitzer Verabschiedung Mieterstrommodells erreichen mögen Aufstockung Städtebaufördermittel erwähnen SPD durchsetzen können mitteln können Wahlkreis u.a. Schule sanieren Gesetz zeigen Beharrlichkeit ständig Thematisieren Politik auszahlen Gesetz Mittelaufstockung Mehrheit absehbar freuen Gesetz Broschüre finden weit dingen Bundestag Wahlkreis erreichen können finden Download http://www.klaus-mindrup.de/content/pressebilder-downloads Rückfrage stehen Verfügung\n", "------------------------------ Word counts: ------------------------------\n", "{'Aufstockung': '1', 'Bahnstrecken': '1', 'Betroffene': '1', 'Bewohner': '1', 'Broschüre': '1', 'Bundestag': '2', 'Bürgerinitiativen': '1', 'Download': '1', 'Energieversorgung': '1', 'Frage': '1', 'Gesetz': '4', 'Güterzüge': '1', 'Hausbesitzer': '1', 'Mehrheit': '1', 'Mieter': '1', 'Mieterinnen': '1', 'Pankow': '1', 'Politik': '1', 'Recht': '1', 'Rückfrage': '1', 'SPD': '1', 'Schiene': '1', 'Schule': '1', 'Schutz': '1', 'Verabschiedung': '1', 'Verfügung': '1', 'Wahl': '1', 'Wahlkreis': '2', 'absehbar': '1', 'auszahlen': '1', 'content': '1', 'de': '1', 'deutsche': '1', 'dingen': '1', 'downloads': '1', 'drucken': '1', 'durchsetzen': '1', 'dürfen': '1', 'einräumen': '1', 'einsetzen': '1', 'entsprechend': '1', 'erreichen': '3', 'erwähnen': '1', 'fahren': '1', 'finden': '2', 'freuen': '2', 'gemeinsam': '1', 'gleich': '1', 'gut': '1', 'helfen': '1', 'http': '1', 'klaus': '1', 'konkret': '2', 'laut': '1', 'mindrup': '1', 'mitteln': '1', 'mögen': '1', 'neu': '1', 'persönlich': '1', 'sanieren': '1', 'stehen': '1', 'ständig': '1', 'vorangetrieben': '1', 'weit': '1', 'www': '1', 'zeigen': '1'}\n", "------------------------------ Words not counted: ------------------------------\n", "Dank, Mieterstrommodells, Städtebaufördermittel, können, können, u.a., Beharrlichkeit, Thematisieren, Mittelaufstockung, können, http://www.klaus-mindrup.de/content/pressebilder-downloads\n", "(We did not count words that are in less than 3 documents or in more than 50% of the documents.)\n" ] } ], "source": [ "min_len = 400\n", "max_len = 800\n", "example_text = ''\n", "\n", "while (len(example_text) < min_len or len(example_text) > max_len):\n", " example = rnd.randint(0, len(texts))\n", " example_text = texts[example]\n", "\n", "print(30 * '-' + ' Cleaned text: ' + 30 * '-')\n", "print(example_text)\n", "\n", "print(30 * '-' + ' Word counts: ' + 30 * '-')\n", "counts = json.loads((counts_dir / (filenames[example] + '.count')).read_text())\n", "print(counts)\n", "\n", "print(30 * '-' + ' Words not counted: ' + 30 * '-')\n", "print(', '.join([word for word in example_text.split(' ') if not word in counts]))\n", "\n", "def df_to_text(df):\n", " return \"{}\".format(df) if isinstance(df, int) else '{:.0%} of the'.format(df)\n", "\n", "print('(We did not count words that are in less than {} documents or in more than {} documents.)'.format(\n", " df_to_text(min_df), df_to_text(max_df)))\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Runtime of the notebook \n", "-------------------------\n", " 0.85s Counting the words\n", " 0.10s Dumping the word counts to files\n", " 4.00s All calculations together\n" ] } ], "source": [ "notebook_end_time = time.perf_counter()\n", "\n", "print()\n", "print(' Runtime of the notebook ')\n", "print('-------------------------')\n", "print('{:8.2f}s Counting the words'.format(\n", " counter_end_time - counter_start_time))\n", "print('{:8.2f}s Dumping the word counts to files'.format(\n", " dump_end_time - dump_start_time))\n", "print('{:8.2f}s All calculations together'.format(\n", " notebook_end_time - notebook_start_time))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \"Creative\n", " \n", " © D. Speicher, T. Dong
\n", " Licensed under a \n", " \n", " CC BY-NC 4.0\n", " .\n", "
\n", " Acknowledgments:\n", " This material was prepared within the project\n", " \n", " P3ML\n", " \n", " which is funded by the Ministry of Education and Research of Germany (BMBF)\n", " under grant number 01/S17064. The authors gratefully acknowledge this support.\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }