<h1 style="background-color:#0071BD;color:white;text-align:center;padding-top:0.8em;padding-bottom: 0.8em">
  LDA Spike 3 - Latent Dirichlet Allocation
</h1>

This notebook applies Latent Dirichlet Allocation on the word counts. By default the word count files are expected to be found in the folder `Counts`. At the end of this notebook the topic model is written to the folder `Topics`. We use the  Latent Dirichlet Allocation implementation [LatentDirichletAllocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) from `sklearn.decomposition`.

To illustrate the result we show the following:

  * For each topic the words that are most probable within this topic. We limit the number of words and the cummulated probabilty of words that we show. A few topics are characterized by just one to three words.
  * For a few documents the list of topics that make up most of the document. The algorithm typically finds a document to be composed of a few topics.
  * There are documents that are nevertheless dominated by one topic. We list some of these documents for some topics.
  * For each topic we give up to three documents that are strongly focussed on the topic. Only for a few topics we did not find such documents. You may follow the links to the original documents to check, whether they are indeed covering the same topic.

Visualization are found in the next notebook.

<font color="darkred" /><p/>

__This notebooks writes to and reads from your file system.__ Per default all used directory are within `~/TextData/Abgeordnetenwatch`, where `~` stands for whatever your operating system considers your home directory. To change this configuration either change the default values in the second next cell or edit [LDA Spike - Configuration.ipynb](./LDA%20Spike%20-%20Configuration.ipynb) and run it before you run this notebook.

<font color="black" /><p/>

This notebooks operates on word counts extracted from text files. In our case we retrieved these texts from www.abgeordnetenwatch.de guided by data that was made available under the [Open Database License (ODbL) v1.0](https://opendatacommons.org/licenses/odbl/1.0/) at that site.

<p style="background-color:#66A5D1;padding-top:0.2em;padding-bottom: 0.2em" />

In [1]:
import time
import random as rnd

from operator import itemgetter

from pathlib import Path
import json
import joblib

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
%store -r own_configuration_was_read
if not('own_configuration_was_read' in globals()): raise Exception(
    '\nReminder: You might want to run your configuration notebook before you run this notebook.' + 
    '\nIf you want to manage your configuration from each notebook, just remove this check.')

%store -r project_name
if not('project_name' in globals()): project_name = 'AbgeordnetenWatch'

%store -r text_data_dir
if not('text_data_dir' in globals()): text_data_dir = Path.home() / 'TextData'

In [3]:
corpus_dir = text_data_dir / project_name / 'Corpus'
counts_dir = text_data_dir / project_name / 'Counts'
topics_dir = text_data_dir / project_name / 'Topics'

assert corpus_dir.exists(),                      'Directory should exist.'
assert corpus_dir.is_dir(),                      'Directory should be a directory.'
assert next(corpus_dir.iterdir(), None) != None, 'Directory should not be empty.'

assert counts_dir.exists(),                      'Directory should exist.'
assert counts_dir.is_dir(),                      'Directory should be a directory.'
assert next(counts_dir.iterdir(), None) != None, 'Directory should not be empty.'

topics_dir.mkdir(parents=True, exist_ok=True) # Creates a local directory!

In [4]:
n_topics = 100
max_iter = 200 
evaluate_every = 3  # -1 = never
verbosity_level = 1 # 0 = no output, 1 = iteration and perplexity, 2 = plus jobs and timing
n_jobs = 3

In [5]:
notebook_start_time = time.perf_counter()

## Load the word counts

The code in the next cell might look a bit tricky. Please ignore it unless you really want to understand this side aspect. It has nothing to do with LDA but with construction [Compressed Sparse Row matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html).

In [6]:
load_start_time = time.perf_counter()

document_names = []
vocabulary = {}
next_doc_ptr = [0]
word_indices = []
counts = []

files = list(counts_dir.iterdir())
list.sort(files)

for source_file in files:

    print('\rReading {:90.90}'.format(source_file.stem), end='')
    
    document_names.append(source_file.stem)
    doc_word_counts = json.loads(source_file.read_text())
    
    for word, count in doc_word_counts.items():
        word_idx = vocabulary.setdefault(word, len(vocabulary))
        word_indices.append(word_idx)
        counts.append(count)
    
    next_doc_ptr.append(len(word_indices))

word_counts = csr_matrix((counts, word_indices, next_doc_ptr), dtype=np.int64)
words       = [w for w, i in sorted(vocabulary.items(), key=itemgetter(1))]

load_end_time = time.perf_counter()
print('Loading the word counts from {:d} files took {:.2f}s.'.format(len(document_names), load_end_time - load_start_time))

Reading achim-kessler_die-linke_Q0001_2017-08-06_A01_2017-08-11_gesundheit                        Reading achim-kessler_die-linke_Q0002_2017-08-26_A01_2017-08-30_demokratie-und-bürgerrechte       Reading achim-kessler_die-linke_Q0003_2017-08-31_A01_2017-09-17_internationales                   Reading achim-kessler_die-linke_Q0004_2018-06-13_A01_2018-11-29_land--und-forstwirtschaft         Reading achim-kessler_die-linke_Q0005_2018-11-11_A01_2018-11-29_integration                       Reading achim-post_spd_Q0001_2017-08-08_A01_2017-09-22_land--und-forstwirtschaft                  Reading achim-post_spd_Q0003_2017-08-27_A01_2017-09-22_finanzen                                   Reading achim-post_spd_Q0004_2017-08-31_A01_2017-09-18_soziales                                   Reading achim-post_spd_Q0005_2018-06-24_A01_2018-06-26_land--und-forstwirtschaft                  Reading agnieszka-brugger_die-grünen_Q0002_2017-08-09_A01_2018-03-28_kinder-und-jugend            Reading a

Reading andrea-nahles_spd_Q0275_2018-09-21_A02_2018-09-25_inneres-und-justiz                      Reading andrea-nahles_spd_Q0276_2018-09-21_A01_2018-09-25_demokratie-und-bürgerrechte             Reading andrea-nahles_spd_Q0277_2018-09-21_A01_2018-09-25_demokratie-und-bürgerrechte             Reading andrea-nahles_spd_Q0278_2018-09-21_A01_2018-09-25_verwaltung-und-föderalismus             Reading andrea-nahles_spd_Q0279_2018-09-23_A01_2018-11-16_demokratie-und-bürgerrechte             Reading andrea-nahles_spd_Q0280_2018-09-23_A01_2018-11-16_senioren                                Reading andrea-nahles_spd_Q0281_2018-09-23_A01_2018-11-16_demokratie-und-bürgerrechte             Reading andrea-nahles_spd_Q0282_2018-09-24_A01_2018-10-02_familie                                 Reading andrea-nahles_spd_Q0283_2018-09-24_A01_2018-10-04_umwelt                                  Reading andrea-nahles_spd_Q0284_2018-09-26_A01_2018-11-16_demokratie-und-bürgerrechte             Reading a

Reading armin-schuster_cdu_Q0020_2018-01-19_A01_2018-04-27_inneres-und-justiz                     Reading armin-schuster_cdu_Q0023_2018-06-03_A01_2018-06-11_kinder-und-jugend                      Reading armin-schuster_cdu_Q0029_2018-09-12_A01_2018-11-13_inneres-und-justiz                     Reading armin-schuster_cdu_Q0030_2018-09-13_A01_2018-11-13_sicherheit                             Reading arno-klare_spd_Q0001_2017-07-26_A01_2017-08-07_gesundheit                                 Reading arno-klare_spd_Q0002_2017-08-18_A01_2017-08-20_soziales                                   Reading arno-klare_spd_Q0003_2017-08-25_A01_2017-08-26_demokratie-und-bürgerrechte                Reading arno-klare_spd_Q0004_2017-09-11_A01_2017-09-14_arbeit                                     Reading arno-klare_spd_Q0005_2017-09-13_A01_2017-09-14_soziales                                   Reading arno-klare_spd_Q0006_2017-09-14_A01_2017-09-15_verkehr-und-infrastruktur                  Reading a

Reading carsten-muller_cdu_Q0007_2018-10-03_A01_2018-10-05_inneres-und-justiz                     Reading carsten-muller_cdu_Q0008_2018-10-10_A01_2018-10-12_umwelt                                 Reading carsten-muller_cdu_Q0009_2018-11-09_A01_2018-11-19_soziales                               Reading carsten-muller_cdu_Q0010_2018-12-15_A01_2018-12-18_gesundheit                             Reading carsten-schneider_spd_Q0001_2017-07-29_A01_2017-08-25_demokratie-und-bürgerrechte         Reading carsten-schneider_spd_Q0002_2017-08-03_A01_2017-08-25_demokratie-und-bürgerrechte         Reading carsten-schneider_spd_Q0006_2017-11-21_A01_2017-12-07_demokratie-und-bürgerrechte         Reading carsten-trager_spd_Q0001_2017-08-21_A01_2017-08-24_arbeit                                 Reading carsten-trager_spd_Q0002_2017-09-05_A01_2017-09-15_wirtschaft                             Reading carsten-trager_spd_Q0003_2017-09-05_A01_2017-09-15_gesundheit                             Reading c

Reading christoph-matschie-2_spd_Q0003_2017-08-20_A01_2017-09-04_demokratie-und-bürgerrechte      Reading christoph-matschie-2_spd_Q0004_2017-08-23_A01_2017-09-04_internationales                  Reading christoph-matschie-2_spd_Q0005_2017-08-29_A01_2017-09-21_sicherheit                       Reading christoph-matschie-2_spd_Q0009_2018-10-30_A01_2018-11-20_demokratie-und-bürgerrechte      Reading christoph-meyer_fdp_Q0001_2017-07-24_A01_2017-07-27_städtebau-und-stadtentwicklung        Reading christoph-meyer_fdp_Q0002_2017-07-30_A01_2017-07-31_internationales                       Reading christoph-meyer_fdp_Q0003_2017-08-02_A01_2017-08-07_internationales                       Reading christoph-meyer_fdp_Q0004_2017-08-02_A01_2017-08-07_sicherheit                            Reading christoph-meyer_fdp_Q0005_2017-08-09_A01_2017-08-10_städtebau-und-stadtentwicklung        Reading christoph-meyer_fdp_Q0006_2017-08-13_A01_2017-08-22_inneres-und-justiz                    Reading c

Reading dr-anton-hofreiter_die-grünen_Q0067_2018-11-17_A01_2018-11-22_integration                 Reading dr-anton-hofreiter_die-grünen_Q0069_2018-11-21_A01_2019-01-09_internationales             Reading dr-anton-hofreiter_die-grünen_Q0073_2018-11-28_A01_2018-11-29_verkehr-und-infrastruktur   Reading dr-anton-hofreiter_die-grünen_Q0074_2018-12-03_A01_2018-12-11_umwelt                      Reading dr-anton-hofreiter_die-grünen_Q0080_2018-12-14_A01_2019-01-09_demokratie-und-bürgerrechte Reading dr-barbel-kofler_spd_Q0001_2017-07-29_A01_2017-08-10_demokratie-und-bürgerrechte          Reading dr-barbel-kofler_spd_Q0002_2017-09-10_A01_2017-09-18_soziales                             Reading dr-barbel-kofler_spd_Q0004_2018-06-14_A01_2018-07-05_internationales                      Reading dr-barbel-kofler_spd_Q0005_2018-07-13_A01_2018-07-30_internationales                      Reading dr-barbel-kofler_spd_Q0006_2018-08-24_A01_2018-10-08_familie                              Reading d

Reading dr-karl-lamers_cdu_Q0001_2017-07-25_A01_2017-08-03_demokratie-und-bürgerrechte            Reading dr-karl-lamers_cdu_Q0002_2017-07-27_A01_2017-08-07_familie                                Reading dr-karl-lamers_cdu_Q0003_2017-07-29_A01_2017-08-14_demokratie-und-bürgerrechte            Reading dr-karl-lamers_cdu_Q0004_2017-07-31_A01_2017-08-15_demokratie-und-bürgerrechte            Reading dr-karl-lamers_cdu_Q0005_2017-08-21_A01_2017-09-08_internationales                        Reading dr-karl-lamers_cdu_Q0006_2017-08-30_A01_2017-09-08_umwelt                                 Reading dr-karl-lamers_cdu_Q0007_2017-09-02_A01_2017-09-08_demokratie-und-bürgerrechte            Reading dr-karl-lamers_cdu_Q0008_2017-09-15_A01_2017-09-19_demokratie-und-bürgerrechte            Reading dr-karl-lamers_cdu_Q0009_2018-04-02_A01_2018-04-20_sicherheit                             Reading dr-karl-lamers_cdu_Q0010_2018-07-06_A01_2018-07-19_demokratie-und-bürgerrechte            Reading d

Reading dr-wieland-schinnenburg_fdp_Q0004_2018-10-22_A01_2018-10-23_gesundheit                    Reading dr-wieland-schinnenburg_fdp_Q0005_2018-11-03_A01_2018-11-14_gesundheit                    Reading dr-wieland-schinnenburg_fdp_Q0006_2019-01-10_A01_2019-01-11_inneres-und-justiz            Reading eberhard-gienger_cdu_Q0001_2017-07-27_A01_2017-07-28_gesundheit                           Reading eberhard-gienger_cdu_Q0002_2017-08-06_A01_2017-08-08_soziales                             Reading eberhard-gienger_cdu_Q0003_2017-08-07_A01_2017-08-08_demokratie-und-bürgerrechte          Reading eberhard-gienger_cdu_Q0004_2017-08-13_A01_2017-08-14_demokratie-und-bürgerrechte          Reading eberhard-gienger_cdu_Q0005_2017-09-03_A01_2017-09-04_land--und-forstwirtschaft            Reading eberhard-gienger_cdu_Q0006_2018-02-08_A01_2018-02-08_demokratie-und-bürgerrechte          Reading eberhard-gienger_cdu_Q0007_2018-12-12_A01_2018-12-12_umwelt                               Reading e

Reading gabriela-heinrich_spd_Q0012_2018-11-14_A01_2018-11-15_integration                         Reading gabriela-heinrich_spd_Q0013_2018-11-19_A01_2018-11-29_demokratie-und-bürgerrechte         Reading gabriela-heinrich_spd_Q0014_2018-11-25_A01_2018-11-29_demokratie-und-bürgerrechte         Reading gabriele-hiller-ohm_spd_Q0001_2017-07-31_A01_2017-08-10_verkehr-und-infrastruktur         Reading gabriele-hiller-ohm_spd_Q0002_2017-08-02_A01_2017-08-17_verkehr-und-infrastruktur         Reading gabriele-hiller-ohm_spd_Q0003_2017-08-08_A01_2017-08-10_finanzen                          Reading gabriele-hiller-ohm_spd_Q0004_2017-08-15_A01_2017-08-23_verwaltung-und-föderalismus       Reading gabriele-hiller-ohm_spd_Q0005_2017-08-29_A01_2017-09-06_verkehr-und-infrastruktur         Reading gabriele-hiller-ohm_spd_Q0006_2017-09-17_A01_2017-09-21_verkehr-und-infrastruktur         Reading gabriele-hiller-ohm_spd_Q0007_2017-09-19_A01_2017-09-22_schulen                           Reading g

Reading ingo-wellenreuther_cdu_Q0003_2017-08-02_A01_2017-08-03_demokratie-und-bürgerrechte        Reading ingo-wellenreuther_cdu_Q0004_2017-08-02_A01_2017-08-07_sicherheit                         Reading ingo-wellenreuther_cdu_Q0005_2017-08-08_A01_2017-08-10_umwelt                             Reading ingo-wellenreuther_cdu_Q0006_2017-08-14_A01_2017-08-15_demokratie-und-bürgerrechte        Reading ingo-wellenreuther_cdu_Q0007_2017-08-14_A01_2017-08-22_kinder-und-jugend                  Reading ingo-wellenreuther_cdu_Q0008_2017-08-24_A01_2017-09-11_sicherheit                         Reading ingo-wellenreuther_cdu_Q0009_2017-08-31_A01_2017-09-04_soziales                           Reading ingo-wellenreuther_cdu_Q0010_2017-09-01_A01_2017-09-04_verkehr-und-infrastruktur          Reading ingo-wellenreuther_cdu_Q0011_2017-09-01_A01_2017-09-20_demokratie-und-bürgerrechte        Reading ingo-wellenreuther_cdu_Q0012_2017-09-03_A01_2017-09-06_soziales                           Reading i

Reading kai-gehring_die-grünen_Q0005_2017-09-04_A01_2017-09-21_sicherheit                         Reading kai-gehring_die-grünen_Q0006_2017-09-05_A01_2017-09-07_internationales                    Reading kai-gehring_die-grünen_Q0007_2017-09-13_A01_2017-09-14_gesundheit                         Reading kai-gehring_die-grünen_Q0008_2018-01-27_A01_2018-02-07_demokratie-und-bürgerrechte        Reading kai-gehring_die-grünen_Q0009_2018-03-18_A01_2018-10-23_demokratie-und-bürgerrechte        Reading kai-gehring_die-grünen_Q0010_2018-07-22_A01_2018-09-11_städtebau-und-stadtentwicklung     Reading kai-wegner_cdu_Q0001_2017-08-22_A01_2017-08-23_demokratie-und-bürgerrechte                Reading kai-wegner_cdu_Q0002_2017-08-23_A01_2017-08-24_demokratie-und-bürgerrechte                Reading kai-wegner_cdu_Q0003_2017-08-28_A01_2017-08-29_bildung-und-forschung                      Reading kai-wegner_cdu_Q0004_2017-08-29_A01_2017-09-05_verkehr-und-infrastruktur                  Reading k

Reading katrin-staffler_csu_Q0002_2017-08-11_A01_2017-09-08_umwelt                                Reading katrin-staffler_csu_Q0003_2017-08-20_A01_2017-09-10_inneres-und-justiz                    Reading katrin-staffler_csu_Q0004_2017-09-15_A01_2017-09-17_familie                               Reading katrin-staffler_csu_Q0005_2017-11-27_A01_2017-12-12_demokratie-und-bürgerrechte           Reading katrin-staffler_csu_Q0006_2018-02-22_A01_2018-03-05_verkehr-und-infrastruktur             Reading katrin-staffler_csu_Q0007_2018-08-22_A01_2018-10-08_gesundheit                            Reading katrin-staffler_csu_Q0008_2018-08-23_A01_2018-09-11_umwelt                                Reading katrin-staffler_csu_Q0009_2018-08-23_A01_2018-10-05_wirtschaft                            Reading katrin-staffler_csu_Q0011_2018-12-10_A01_2018-12-20_gesundheit                            Reading katrin-werner_die-linke_Q0001_2017-08-27_A01_2017-09-06_soziales                          Reading k

Reading mahmut-ozdemir_spd_Q0009_2018-06-19_A01_2018-06-20_kinder-und-jugend                      Reading mahmut-ozdemir_spd_Q0010_2018-06-19_A01_2018-06-20_demokratie-und-bürgerrechte            Reading mahmut-ozdemir_spd_Q0011_2018-06-24_A01_2018-06-27_demokratie-und-bürgerrechte            Reading mahmut-ozdemir_spd_Q0012_2018-07-04_A01_2018-07-04_finanzen                               Reading mahmut-ozdemir_spd_Q0013_2018-07-25_A01_2018-07-30_gesundheit                             Reading mahmut-ozdemir_spd_Q0014_2018-08-18_A01_2018-08-27_städtebau-und-stadtentwicklung         Reading mahmut-ozdemir_spd_Q0015_2018-08-18_A01_2018-08-27_wirtschaft                             Reading mahmut-ozdemir_spd_Q0016_2018-08-23_A01_2018-09-13_gesundheit                             Reading mahmut-ozdemir_spd_Q0017_2018-09-22_A01_2018-09-27_demokratie-und-bürgerrechte            Reading maik-beermann_cdu_Q0001_2017-08-09_A01_2017-08-10_familie                                 Reading m

Reading martina-stamm-fibich_spd_Q0006_2017-09-05_A01_2017-09-08_arbeit                           Reading martina-stamm-fibich_spd_Q0007_2017-09-13_A01_2017-09-14_demokratie-und-bürgerrechte      Reading martina-stamm-fibich_spd_Q0008_2017-09-21_A01_2017-09-22_familie                          Reading martina-stamm-fibich_spd_Q0009_2017-11-13_A01_2017-11-30_internationales                  Reading martina-stamm-fibich_spd_Q0010_2018-08-13_A01_2018-08-27_demokratie-und-bürgerrechte      Reading martina-stamm-fibich_spd_Q0011_2018-12-21_A01_2019-01-07_land--und-forstwirtschaft        Reading matern-von-marschall_cdu_Q0001_2017-08-02_A01_2017-08-04_sicherheit                       Reading matern-von-marschall_cdu_Q0002_2017-08-03_A01_2017-09-08_demokratie-und-bürgerrechte      Reading matern-von-marschall_cdu_Q0003_2017-08-04_A01_2017-09-07_demokratie-und-bürgerrechte      Reading matern-von-marschall_cdu_Q0004_2017-08-16_A01_2017-09-06_umwelt                           Reading m

Reading niels-annen_spd_Q0021_2018-01-03_A01_2018-01-16_demokratie-und-bürgerrechte               Reading niels-annen_spd_Q0022_2018-01-04_A01_2018-01-18_inneres-und-justiz                        Reading niels-annen_spd_Q0023_2018-01-16_A01_2018-02-02_demokratie-und-bürgerrechte               Reading niels-annen_spd_Q0024_2018-01-19_A01_2018-01-29_internationales                           Reading niels-annen_spd_Q0025_2018-01-26_A01_2018-02-02_internationales                           Reading niels-annen_spd_Q0026_2018-02-06_A01_2018-02-19_senioren                                  Reading niels-annen_spd_Q0027_2018-02-10_A01_2018-02-19_arbeit                                    Reading niels-annen_spd_Q0028_2018-02-17_A01_2018-04-16_internationales                           Reading niels-annen_spd_Q0029_2018-02-19_A01_2018-04-19_sicherheit                                Reading niels-annen_spd_Q0030_2018-02-26_A01_2018-03-29_arbeit                                    Reading n

Reading petra-nicolaisen_cdu_Q0002_2017-07-30_A01_2017-08-02_demokratie-und-bürgerrechte          Reading petra-nicolaisen_cdu_Q0003_2017-08-02_A01_2017-08-03_arbeit                               Reading petra-nicolaisen_cdu_Q0004_2017-08-05_A01_2017-08-08_verbraucherschutz                    Reading petra-nicolaisen_cdu_Q0005_2017-08-22_A01_2017-08-25_soziales                             Reading petra-nicolaisen_cdu_Q0006_2017-08-30_A01_2017-09-04_soziales                             Reading petra-nicolaisen_cdu_Q0007_2017-09-02_A01_2017-09-05_demokratie-und-bürgerrechte          Reading petra-nicolaisen_cdu_Q0008_2017-09-07_A01_2017-09-11_städtebau-und-stadtentwicklung       Reading petra-nicolaisen_cdu_Q0009_2017-09-19_A01_2017-09-22_gesundheit                           Reading petra-nicolaisen_cdu_Q0010_2017-09-19_A01_2017-09-22_familie                              Reading petra-nicolaisen_cdu_Q0011_2018-01-09_A01_2018-01-24_demokratie-und-bürgerrechte          Reading p

Reading rudolf-henke_cdu_Q0023_2018-08-23_A01_2018-09-12_gesundheit                               Reading rudolf-henke_cdu_Q0024_2018-09-22_A01_2018-11-07_verkehr-und-infrastruktur                Reading rudolf-henke_cdu_Q0025_2018-10-22_A01_2018-10-26_gesundheit                               Reading rudolf-henke_cdu_Q0026_2018-11-04_A01_2018-11-23_soziales                                 Reading rudolf-henke_cdu_Q0027_2018-11-08_A01_2018-12-17_demokratie-und-bürgerrechte              Reading rudolf-henke_cdu_Q0028_2018-11-10_A01_2018-12-12_finanzen                                 Reading rudolf-henke_cdu_Q0029_2018-12-04_A01_2018-12-18_land--und-forstwirtschaft                Reading rudolf-henke_cdu_Q0030_2018-12-11_A01_2018-12-17_gesundheit                               Reading sabine-dittmar_spd_Q0001_2017-08-07_A01_2017-08-08_senioren                               Reading sabine-dittmar_spd_Q0002_2017-08-20_A01_2017-09-20_land--und-forstwirtschaft              Reading s

Reading stephan-protschka_afd_Q0004_2017-12-10_A01_2018-02-19_inneres-und-justiz                  Reading stephan-protschka_afd_Q0005_2017-12-21_A01_2018-03-08_land--und-forstwirtschaft           Reading stephan-protschka_afd_Q0006_2018-05-28_A01_2018-12-10_inneres-und-justiz                  Reading stephan-stracke_csu_Q0001_2017-08-02_A01_2017-08-11_finanzen                              Reading stephan-stracke_csu_Q0002_2017-08-17_A01_2017-09-01_demokratie-und-bürgerrechte           Reading stephan-stracke_csu_Q0003_2017-08-22_A01_2017-08-28_inneres-und-justiz                    Reading stephan-stracke_csu_Q0004_2017-08-24_A01_2017-09-04_umwelt                                Reading stephan-stracke_csu_Q0005_2017-08-30_A01_2017-09-01_demokratie-und-bürgerrechte           Reading stephan-stracke_csu_Q0006_2017-08-30_A01_2017-09-04_demokratie-und-bürgerrechte           Reading stephan-stracke_csu_Q0007_2017-09-02_A01_2017-09-04_finanzen                              Reading s

Reading ulla-schmidt_spd_Q0002_2017-08-18_A01_2017-09-15_soziales                                 Reading ulla-schmidt_spd_Q0003_2017-08-27_A01_2017-09-15_soziales                                 Reading ulla-schmidt_spd_Q0004_2017-09-03_A01_2017-09-15_finanzen                                 Reading ulla-schmidt_spd_Q0005_2017-09-13_A01_2017-09-15_finanzen                                 Reading ulla-schmidt_spd_Q0006_2017-09-15_A01_2017-09-15_gesundheit                               Reading ulla-schmidt_spd_Q0007_2017-10-06_A01_2017-10-10_gesundheit                               Reading ulla-schmidt_spd_Q0008_2018-01-29_A01_2018-02-02_senioren                                 Reading ulla-schmidt_spd_Q0009_2018-02-18_A01_2018-03-16_demokratie-und-bürgerrechte              Reading ulla-schmidt_spd_Q0010_2018-07-25_A01_2018-07-26_familie                                  Reading ulla-schmidt_spd_Q0011_2018-10-24_A01_2018-10-25_soziales                                 Reading u

Reading yvonne-magwas_cdu_Q0006_2017-08-17_A01_2017-08-22_demokratie-und-bürgerrechte             Reading yvonne-magwas_cdu_Q0007_2017-08-18_A01_2017-08-22_arbeit                                  Reading yvonne-magwas_cdu_Q0009_2017-08-23_A01_2017-08-24_arbeit                                  Reading yvonne-magwas_cdu_Q0010_2017-09-08_A01_2017-09-19_städtebau-und-stadtentwicklung          Reading yvonne-magwas_cdu_Q0013_2018-04-27_A01_2018-07-06_verbraucherschutz                       Reading yvonne-magwas_cdu_Q0014_2018-06-18_A01_2018-07-06_land--und-forstwirtschaft               Reading yvonne-magwas_cdu_Q0015_2018-08-22_A01_2018-09-05_verkehr-und-infrastruktur               Reading zaklin-nastic_die-linke_Q0001_2017-08-20_A01_2017-09-04_bildung-und-forschung             Reading zaklin-nastic_die-linke_Q0002_2017-08-30_A01_2017-09-11_demokratie-und-bürgerrechte       Reading zaklin-nastic_die-linke_Q0003_2017-09-07_A01_2017-09-18_verbraucherschutz                 Reading z

## Latent Dirichlet Allocation

We instantiate the [LDA algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) passing the configuration parameters. As the given text based does cover quite a lot topics (in the intuitve sense), we did indeed only get reasonable results after looking for at least 100 topics (in the sense of LDA). The results with batch learning were much better than with online learning, but we did not explore whether online learning could be rescued by parameter tuning. We search for the probabilities of the words within topics and the shares of the topics within the documents at the same time. "fit" without transform would do the same, but throw away the topic within document distribution. "transform" after a previous "fit" would find the topic shares without adapting the word in topic destribution. The latter would could even be used for previously unseen documents.

In [7]:
lda_start_time = time.perf_counter()

lda_algorithm = LatentDirichletAllocation(n_components = n_topics, learning_method='batch', max_iter = max_iter, 
                                          n_jobs=n_jobs, evaluate_every=evaluate_every, verbose=verbosity_level)

topic_model   = lda_algorithm.fit_transform(word_counts)

lda_end_time = time.perf_counter()
print('Latent Dirichlet Allocation took {:.2f}s.'.format(lda_end_time - lda_start_time))

iteration: 1 of max_iter: 200
iteration: 2 of max_iter: 200
iteration: 3 of max_iter: 200, perplexity: 6834.2921
iteration: 4 of max_iter: 200
iteration: 5 of max_iter: 200
iteration: 6 of max_iter: 200, perplexity: 4896.6652
iteration: 7 of max_iter: 200
iteration: 8 of max_iter: 200
iteration: 9 of max_iter: 200, perplexity: 4277.6493
iteration: 10 of max_iter: 200
iteration: 11 of max_iter: 200
iteration: 12 of max_iter: 200, perplexity: 3989.7329
iteration: 13 of max_iter: 200
iteration: 14 of max_iter: 200
iteration: 15 of max_iter: 200, perplexity: 3831.3961
iteration: 16 of max_iter: 200
iteration: 17 of max_iter: 200
iteration: 58 of max_iter: 200
iteration: 59 of max_iter: 200
iteration: 60 of max_iter: 200, perplexity: 3454.9064
iteration: 61 of max_iter: 200
iteration: 62 of max_iter: 200
iteration: 63 of max_iter: 200, perplexity: 3451.2249
iteration: 64 of max_iter: 200
iteration: 65 of max_iter: 200
iteration: 66 of max_iter: 200, perplexity: 3448.3414
iteration: 67 of ma

In [8]:
print('The topic model has the shape {}.'.format(topic_model.shape))
print('corresponding to {} documents and {} topics.'.format(len(document_names), n_topics))
print('The rank of the matrix is {}.'.format(np.linalg.matrix_rank(topic_model)))

The topic model has the shape (7696, 100).
corresponding to 7696 documents and 100 topics.
The rank of the matrix is 100.


In [9]:
print('The probabilty distribution of words per topics has shape {}'.format(lda_algorithm.components_.shape))

The probabilty distribution of words per topics has shape (100, 19774)


## Dominant words per topic


In [10]:
show_max_words = 6
show_max_cummulated_probability = 0.15

topic_descriptions = []
words_per_topic    = lda_algorithm.components_ / lda_algorithm.components_.sum(axis=1)[:, np.newaxis]

for topic in range(n_topics):
    
    description = ''

    print('\n{:2}: '.format(topic), end='')
    most_probable = np.argsort(words_per_topic[topic, :])[:-show_max_words-1:-1]
    probabilities = words_per_topic[topic, most_probable]
    
    for word, probability, cummulated in zip(most_probable, probabilities, probabilities.cumsum()):
        the_word = words[word] 
        description = description + the_word + ', '
        print('{:.1%} {} '.format(probability, the_word), end = '')
        if cummulated > show_max_cummulated_probability: break

    description = description + '...'
    topic_descriptions.append(description)


 0: 8.3% EU 4.3% europäisch 3.3% europäische 
 1: 2.1% Energie 1.6% Deutschland 1.1% Klimaschutz 1.1% Energiewende 1.1% wollen 0.9% müssen 
 2: 5.8% Cannabis 1.9% Konsum 1.9% Legalisierung 1.3% Droge 1.1% Jugendliche 0.9% Bundestag 
 3: 3.3% Ehe 1.6% Frage 1.3% Mensch 0.7% Jahr 0.7% Gesellschaft 0.7% Diskriminierung 
 4: 5.9% AfD 5.2% Frage 1.8% Antwort 1.2% Partei 1.2% stellen 
 5: 4.2% Migration 2.8% Pakt 2.1% Deutschland 1.5% Migrationspakt 1.2% Staat 1.2% Migranten 
 6: 4.8% Cum 3.1% Ex 1.4% Geschäft 1.3% Banken 1.2% Skandal 1.1% Staatsanwaltschaft 
 7: 2.7% Maut 2.3% spdfraktion 1.3% geben 1.1% deutsch 1.0% SPD 0.8% Datum 
 8: 4.8% Rente 2.3% Rentenversicherung 2.0% alt 1.9% gesetzlich 1.8% Beitrag 1.6% gesetzliche 
 9: 1.7% Wolf 1.3% Deutschland 1.0% Rückkehr 0.9% gelten 0.9% Leistung 0.9% Wolfes 
10: 2.0% Hartz 1.5% Leistung 1.3% Sozialleistungen 1.2% Bürgergeld 1.2% IV 1.1% Mensch 
11: 5.1% Bundeswehr 2.1% Einsatz 2.0% NATO 2.0% Soldat 1.6% Sicherheit 1.3% Jahr 
12: 3.4% Demok

## Dominant topics for some example documents


In [11]:
show_max_topics = 7
show_max_cummulated_probability = 0.75
sample_documents = rnd.sample(range(len(document_names)), 5)

for doc in sample_documents:

    print('\n', document_names[doc], '\n')
    most_probable = np.argsort(topic_model[doc, :])[:-show_max_topics-1:-1]

    cummulated = 0
    for topic in most_probable:
        probability = topic_model[doc, topic]
        print('{:6.2%} {:2} {}'.format(probability, topic, topic_descriptions[topic]))
        cummulated = cummulated + probability
        if cummulated > show_max_cummulated_probability: break


 andrea-nahles_spd_Q0171_2018-03-27_A01_2018-05-02_kinder-und-jugend 

56.55% 11 Bundeswehr, Einsatz, NATO, Soldat, Sicherheit, Jahr, ...
34.14% 39 de, ...

 dr-matthias-miersch_spd_Q0015_2018-03-13_A01_2018-03-23_demokratie-und-bürgerrechte 

22.14% 99 SPD, CDU, CSU, ...
19.78% 44 Frage, persönlich, Gespräch, finden, Termin, Büro, ...
13.40% 23 Mensch, wollen, gut, brauchen, müssen, Frage, ...
11.15% 18 Inhalt, Medium, Aussage, Äußerung, Meinung, Kritik, ...
11.02% 17 Bundestag, Fraktion, deutsche, Gesetz, Antrag, Jahr, ...

 manuel-sarrazin_die-grünen_Q0002_2017-08-12_A01_2017-08-18_arbeit 

47.80% 65 Bundesminister, Schmidt, Landwirtschaft, Ernährung, Christian, Glyphosat, ...
18.52% 89 Landwirtschaft, wollen, ökologisch, Tier, gut, Produkt, ...
17.28% 59 bundestag, grün, dip21, de, ...

 marc-biadacz_cdu_Q0009_2018-09-04_A01_2018-09-05_demokratie-und-bürgerrechte 

21.41% 77 de, anfragen, direkt, Mail, ...
21.37% 72 Bürgerin, Bürger, bürgern, Abgeordnete, Wahlkreis, Gespräch, ...


## Documents dominated by some example topics

In [12]:
show_max_documents = 20
show_min_probability = 0.75
sample_topics = rnd.sample(range(n_topics), 5)
list.sort(sample_topics)

for topic in sample_topics:
   
    print('{:2} {}'.format(topic, topic_descriptions[topic]))
    most_focussed = np.argsort(topic_model[:, topic])[:-show_max_documents-1:-1]
    most_focussed = [doc for doc in most_focussed if topic_model[doc, topic] >= show_min_probability]

    if not most_focussed:
        print('   The topic contributes to no document {:.0%} or more.'.format(show_min_probability))
        continue
    
    for doc in most_focussed:
        probability = topic_model[doc, topic]
        name, party, _, _, _, date, category = document_names[doc].split('_')
        print('   {:6.2%}  {}  {:24} {:12} {}'.format(probability, date, name, party, category))

 6 Cum, Ex, Geschäft, Banken, Skandal, Staatsanwaltschaft, ...
   99.23%  2018-11-27  soren-bartol             spd          finanzen
   99.12%  2018-11-28  stefan-schwartze         spd          demokratie-und-bürgerrechte
   93.71%  2017-09-11  wolfgang-kubicki         fdp          demokratie-und-bürgerrechte
   93.37%  2017-09-11  wolfgang-kubicki         fdp          demokratie-und-bürgerrechte
   89.74%  2018-12-03  lars-klingbeil           spd          demokratie-und-bürgerrechte
   86.07%  2018-11-13  annalena-baerbock        die-grünen   demokratie-und-bürgerrechte
   82.22%  2018-11-14  margit-stumpp            die-grünen   finanzen
   81.20%  2018-12-12  nicole-westig            fdp          demokratie-und-bürgerrechte
   80.38%  2018-06-05  dr-katarina-barley       spd          verbraucherschutz
   78.74%  2018-11-13  claudia-muller           die-grünen   finanzen
61 CETA, EU, Abkomme, Freihandelsabkommen, öffentlich, Verhandlung, ...
   99.11%  2018-11-22  stefan-sauer       

## Topics with representative documents

In [13]:
show_max_documents = 3
show_min_probability = 0.9
sample_topics = rnd.sample(range(n_topics), 5)
rest = []

def url_for_answer(document_name):
    name, party, q, date, _, _, category = document_names[doc].split('_')
    url_file = corpus_dir / ('_'.join([name, party, q, date, category]) + '.url')
    try:
        return url_file.read_text()
    except:
        return 'URL not found'

for topic in range(n_topics):
  
    most_focussed = np.argsort(topic_model[:, topic])[:-show_max_documents-1:-1]
    most_focussed = [doc for doc in most_focussed if topic_model[doc, topic] > show_min_probability]
    
    if not most_focussed:
        rest.append(topic)
        continue

    print('{:2} {}'.format(topic, topic_descriptions[topic]))

    for doc in most_focussed:
        probability = topic_model[doc, topic]
        name, party, _, _, _, date, category = document_names[doc].split('_')
        print('   {:6.2%}  {}  {:24} {:12} {}'.format(probability, date, name, party, category))
        print(10 * ' ', url_for_answer(document_names[doc]))

        
print('\nTopics that never contribute to a document upto {:.0%}:'.format(show_min_probability))
for topic in rest:
    print('{:2} {}'.format(topic, topic_descriptions[topic]))
    

 1 Energie, Deutschland, Klimaschutz, Energiewende, wollen, müssen, ...
   99.10%  2017-08-11  peter-bleser             cdu          wirtschaft
           https://www.abgeordnetenwatch.de/profile/peter-bleser/question/2017-08-11/283319
   97.31%  2018-12-04  dr-frank-steffel         cdu          umwelt
           https://www.abgeordnetenwatch.de/profile/dr-frank-steffel/question/2018-11-28/307583
   90.75%  2018-12-22  markus-uhl               cdu          umwelt
           https://www.abgeordnetenwatch.de/profile/markus-uhl/question/2018-12-17/308278
 2 Cannabis, Konsum, Legalisierung, Droge, Jugendliche, Bundestag, ...
   99.27%  2018-01-16  sybille-benning          cdu          demokratie-und-bürgerrechte
           https://www.abgeordnetenwatch.de/profile/sybille-benning/question/2017-12-21/295618
   99.25%  2018-03-22  volkmar-klein            cdu          demokratie-und-bürgerrechte
           https://www.abgeordnetenwatch.de/profile/volkmar-klein/question/2018-03-18/297709
   99

           https://www.abgeordnetenwatch.de/profile/dr-fritz-felgentreu/question/2018-11-09/306495
   91.45%  2018-12-19  dr-fritz-felgentreu      spd          finanzen
           https://www.abgeordnetenwatch.de/profile/dr-fritz-felgentreu/question/2018-11-17/307084
27 Einkommen, wollen, mittler, Prozent, entlasten, Steuer, ...
   99.15%  2017-09-23  christine-lambrecht      spd          soziales
           https://www.abgeordnetenwatch.de/profile/christine-lambrecht/question/2017-09-21/293089
   97.89%  2017-08-17  sven-lehmann             die-grünen   finanzen
           https://www.abgeordnetenwatch.de/profile/sven-lehmann/question/2017-08-02/281693
28 Frage, Anfrage, de, politisch, Website, informieren, ...
   98.70%  2017-08-24  hermann-grohe            cdu          gesundheit
           https://www.abgeordnetenwatch.de/profile/hermann-grohe/question/2017-08-21/285293
   98.66%  2017-12-01  hermann-grohe            cdu          demokratie-und-bürgerrechte
           https://www.a

           https://www.abgeordnetenwatch.de/profile/andreas-schwarz/question/2017-12-08/295153
51 Diesel, Fahrzeug, Hersteller, müssen, Grenzwert, muss, ...
   98.17%  2017-08-07  christine-aschenberg-dugnus fdp          demokratie-und-bürgerrechte
           https://www.abgeordnetenwatch.de/profile/christine-aschenberg-dugnus/question/2017-08-04/282036
   90.95%  2018-03-05  dr-rolf-mutzenich        spd          verkehr-und-infrastruktur
           https://www.abgeordnetenwatch.de/profile/dr-rolf-mutzenich/question/2018-03-02/297404
52 offen, Software, digitale, öffentlich, Verwaltung, Open, ...
   98.89%  2018-08-15  dr-katarina-barley       spd          inneres-und-justiz
           https://www.abgeordnetenwatch.de/profile/dr-katarina-barley/question/2018-06-23/299844
53 Kindertagespflege, wollen, Bund, gut, Qualität, Land, ...
   99.60%  2017-08-21  sonja-steffen            spd          kinder-und-jugend
           https://www.abgeordnetenwatch.de/profile/sonja-steffen/question/201

           https://www.abgeordnetenwatch.de/profile/gerhard-zickenheiner/question/2017-09-12/290708
74 Baden, Württemberg, EZB, Deutschland, Stuttgart, Zentralbank, ...
   99.01%  2017-09-06  thomas-jarzombek         cdu          demokratie-und-bürgerrechte
           https://www.abgeordnetenwatch.de/profile/thomas-jarzombek/question/2017-08-03/281750
   99.01%  2017-09-06  thomas-jarzombek         cdu          demokratie-und-bürgerrechte
           https://www.abgeordnetenwatch.de/profile/thomas-jarzombek/question/2017-08-03/281754
   97.25%  2018-11-23  dr-kirsten-kappert-gonther die-grünen   gesundheit
           https://www.abgeordnetenwatch.de/profile/dr-kirsten-kappert-gonther/question/2018-11-16/307050
75 spielen, rollen, ...
   98.38%  2018-08-31  katja-dorner             die-grünen   internationales
           https://www.abgeordnetenwatch.de/profile/katja-dorner/question/2017-11-07/294472
   97.80%  2018-11-29  dr-gregor-gysi           die-linke    inneres-und-justiz
        

           https://www.abgeordnetenwatch.de/profile/daniela-ludwig/question/2018-08-24/301744
   93.85%  2018-02-28  dr-katarina-barley       spd          familie
           https://www.abgeordnetenwatch.de/profile/dr-katarina-barley/question/2018-01-12/296080
   90.10%  2017-09-01  christoph-plos           cdu          familie
           https://www.abgeordnetenwatch.de/profile/christoph-plos/question/2017-08-31/288216

Topics that never contribute to a document upto 90%:
 0 EU, europäisch, europäische, ...
 3 Ehe, Frage, Mensch, Jahr, Gesellschaft, Diskriminierung, ...
11 Bundeswehr, Einsatz, NATO, Soldat, Sicherheit, Jahr, ...
13 Grundeinkommen, bedingungslos, Frau, BGE, Frage, ...
14 Polizei, Straftat, Sicherheit, muss, Rechtsstaat, müssen, ...
15 Russland, Ukraine, Deutschland, Israel, Regierung, Staat, ...
20 Bildung, Schule, gut, Land, wollen, müssen, ...
23 Mensch, wollen, gut, brauchen, müssen, Frage, ...
26 wollen, muss, Transparenz, müssen, Unternehmen, Lobbyregister, ...
36

## Persistence and Timing Report

For recommendations on how to store a learned model see https://scikit-learn.org/stable/modules/model_persistence.html and consequently 
 https://joblib.readthedocs.io/en/latest/persistence.html.
 

In [14]:
dump_start_time = time.perf_counter()

joblib.dump(document_names,  topics_dir / 'document_names.dumb')
joblib.dump(topic_model,     topics_dir / 'topics_per_document.dumb')
joblib.dump(words_per_topic, topics_dir / 'words_per_topic.dumb')
joblib.dump(words,           topics_dir / 'words.dumb')
joblib.dump(lda_algorithm,   topics_dir / 'lda_algorithm.dumb')

print('{:25}  {:8}  {}'.format('file name', 'size', 'modification time'))
for file in topics_dir.iterdir():
    print('{:25}  {:8}  {}'.format(file.name, file.stat().st_size,  
        time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(file.stat().st_mtime))))
    
dump_end_time = time.perf_counter()
print('Dumping all state took {:.2f}s.'.format(dump_end_time - dump_start_time))    

file name                  size      modification time
lda_algorithm.dumb         31642071  2019-01-28 19:56:08
words_per_topic.dumb       15819420  2019-01-28 19:56:00
document_names.dumb          616331  2019-01-28 19:55:54
topics_per_document.dumb    6157020  2019-01-28 19:55:56
words.dumb                   423221  2019-01-28 19:56:00
Dumping all state took 14.44s.


In [15]:
notebook_end_time = time.perf_counter()

print()
print(' Runtime of the notebook ')
print('-------------------------')
print('{:8.2f}s  Loading the word counts from files.'.format(
    load_end_time - load_start_time))
print('{:8.2f}s  Latent Dirichlet Allocation'.format(
    lda_end_time - lda_start_time))
print('{:8.2f}s  All calculations together'.format(
    notebook_end_time - notebook_start_time))


 Runtime of the notebook 
-------------------------
    4.15s  Loading the word counts from files.
  445.14s  Latent Dirichlet Allocation
  465.49s  All calculations together


<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; D. Speicher, T. Dong<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>