<h1 style="background-color:#0071BD;color:white;text-align:center;padding-top:0.8em;padding-bottom: 0.8em">
  LDA Spike 2 - Counting
</h1>

This notebook counts the occurrences of words in the cleaned the text files. By default the cleaned text files are expected to be found in the folder `Cleaned` and the count files are written into the folder `Counts`. We leave the counting to [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from `sklearn.feature_extraction.text`. The most time is spent for separating the matrix of all counts and storing the counts for each file separately. We invest this time so that the counts may easily be reviewed manually.

A few examples at the end the notebook illustrate the result of the process.

<font color="darkred" /><p/>
    
__This notebooks writes to and reads from your file system.__ Per default all used directory are within `~/TextData/Abgeordnetenwatch`, where `~` stands for whatever your operating system considers your home directory. To change this configuration either change the default values in the second next cell or edit [LDA Spike - Configuration.ipynb](./LDA%20Spike%20-%20Configuration.ipynb) and run it before you run this notebook.

<font color="black" /><p/>

This notebooks operates on text files. In our case we retrieved these texts from www.abgeordnetenwatch.de guided by data that was made available under the [Open Database License (ODbL) v1.0](https://opendatacommons.org/licenses/odbl/1.0/) at that site.

<p style="background-color:#66A5D1;padding-top:0.2em;padding-bottom: 0.2em" />

In [1]:
import time
import random as rnd

from pathlib import Path
import json

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
%store -r own_configuration_was_read
if not('own_configuration_was_read' in globals()): raise Exception(
    '\nReminder: You might want to run your configuration notebook before you run this notebook.' + 
    '\nIf you want to manage your configuration from each notebook, just remove this check.')

%store -r project_name
if not('project_name' in globals()): project_name = 'AbgeordnetenWatch'

%store -r text_data_dir
if not('text_data_dir' in globals()): text_data_dir = Path.home() / 'TextData'

In [3]:
cleaned_dir = text_data_dir / project_name / 'Cleaned'
counts_dir  = text_data_dir / project_name / 'Counts'

assert cleaned_dir.exists(),                      'Directory should exist.'
assert cleaned_dir.is_dir(),                      'Directory should be a directory.'
assert next(cleaned_dir.iterdir(), None) != None, 'Directory should not be empty.'

counts_dir.mkdir(parents=True, exist_ok=True) # Creates a local directory!

In [4]:
update_only_missing_counts = True

min_df = 3    # Ignore words that do not occure in at least in some documents. Helps to ignore misspelled words.
              # 3 is a rather low number that leads to a big vocabulary.
max_df = 0.5  # Ignore words that are in the majority of documents. Helps to ignore regular phrases.
              # 0.5 still keeps words that occur in almost every second document.

In [5]:
notebook_start_time = time.perf_counter()

## Load the content of the cleaned text files

In [6]:
filenames = []
texts = []

files = list(cleaned_dir.glob('*A*.txt')) # Answers
list.sort(files)

for file in files:
    filenames.append(file.stem)
    texts.append(file.read_text())
    
print('Read {} documents: "{}" ... "{}""'.format(len(filenames), filenames[0], filenames[-1]))

Read 7696 documents: "achim-kessler_die-linke_Q0001_2017-08-06_A01_2017-08-11_gesundheit" ... "zaklin-nastic_die-linke_Q0008_2017-10-25_A01_2018-09-24_demokratie-und-bürgerrechte""


## Count the words

See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [7]:
counter_start_time = time.perf_counter()

counter = CountVectorizer(analyzer='word', min_df=min_df, max_df=max_df, lowercase=False)

word_counts = counter.fit_transform(texts)
words       = counter.get_feature_names()

print('Counted {} unique words.'.format(len(words)))

counter_end_time = time.perf_counter()
print('Counting took {:.2f}s.'.format(counter_end_time - counter_start_time))

Counted 19670 unique words.
Counting took 0.85s.


## Write the word counts into separate files

In [8]:
dump_start_time = time.perf_counter()

for doc, filename in enumerate(filenames):

    target_file = counts_dir / (filename + '.count')
    if update_only_missing_counts and target_file.exists(): continue

    counts = {}
    doc_word_counts = word_counts[doc, :]
    _, word_indices = word_counts[doc, :].nonzero()

    for word in word_indices:
        counts[words[word]] = str(doc_word_counts[0, word])

    target_file.write_text(json.dumps(counts, ensure_ascii=False, indent=0, sort_keys=True))
    print('\rWrote ' + filename, end='')

dump_end_time = time.perf_counter()
print('\nDumping the word counts to files took {:.2f}s.'.format(dump_end_time - dump_start_time))


Dumping the word counts to files took 0.10s.


## Five most frequent words for some random documents

In [9]:
# For slice the notation [from:to:step] see the
# reference https://docs.python.org/3/library/stdtypes.html?highlight=slice%20notation#common-sequence-operations or the
# explanation https://stackoverflow.com/questions/509211/understanding-pythons-slice-notation/509295#509295

# For sorting with argsort see
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# https://docs.scipy.org/doc/numpy/reference/routines.sort.html


sample_documents = rnd.sample(range(len(filenames)), 7)

for doc in sample_documents:

    filename = filenames[doc]
    
    print('{:32.32}: '.format(filename), end ='')
    
    word_count    = word_counts[doc, :].toarray().flatten()
    most_frequent = np.argsort(word_count)[:-6:-1]
    
    for word in most_frequent:
        print('{:4} {:12.12}'.format(word_counts[doc, word], '"' + words[word] + '"'), end = '')
    print('')

tobias-pfluger_die-linke_Q0007_2:    2 "entsprechen   2 "Traditionse   1 "bw"           1 "http"         1 "Mitglied"  
katja-kipping_die-linke_Q0059_20:    4 "Open"         3 "öffentlich"   2 "digitale"     2 "Source"       2 "sehen"     
dr-daniela-de-ridder_spd_Q0002_2:    4 "neue"         2 "gelingen"     2 "Mensch"       2 "de"           2 "finden"    
irene-mihalic_die-grünen_Q0003_2:   12 "Tierschutz"   7 "Tier"         5 "Massentierh   5 "grün"         5 "wollen"    
jorg-schneider_afd_Q0002_2017-09:    3 "Bundeswehr"   2 "Ausstattung   2 "materiell"    1 "hören"        1 "schnellen" 
bernd-riexinger_die-linke_Q0005_:    4 "sollen"       2 "persoenlich   2 "gut"          2 "Gerechtigke   2 "denken"    
mahmut-ozdemir_spd_Q0010_2018-06:    3 "Frage"        2 "engagieren"   2 "politische"   2 "sozialdemok   2 "Gespräch"  


In [10]:
min_len = 400
max_len = 800
example_text = ''

while (len(example_text) < min_len or len(example_text) > max_len):
    example = rnd.randint(0, len(texts))
    example_text = texts[example]

print(30 * '-' + ' Cleaned text: ' + 30 * '-')
print(example_text)

print(30 * '-' + ' Word counts: ' + 30 * '-')
counts = json.loads((counts_dir / (filenames[example] + '.count')).read_text())
print(counts)

print(30 * '-' + ' Words not counted: ' + 30 * '-')
print(', '.join([word for word in example_text.split(' ') if not word in counts]))

def df_to_text(df):
    return "{}".format(df) if isinstance(df, int) else '{:.0%} of the'.format(df)

print('(We did not count words that are in less than {} documents or in more than {} documents.)'.format(
    df_to_text(min_df), df_to_text(max_df)))


------------------------------ Cleaned text: ------------------------------
Dank Frage persönlich konkret neu Gesetz freuen vorangetrieben gut Schutz Bewohner Bahnstrecken gemeinsam Bürgerinitiativen entsprechend drucken erreichen laut Güterzüge Schiene fahren dürfen Betroffene Pankow konkret helfen Wahl deutsche Bundestag einsetzen Mieterinnen Mieter Energieversorgung gleich Recht einräumen Hausbesitzer Verabschiedung Mieterstrommodells erreichen mögen Aufstockung Städtebaufördermittel erwähnen SPD durchsetzen können mitteln können Wahlkreis u.a. Schule sanieren Gesetz zeigen Beharrlichkeit ständig Thematisieren Politik auszahlen Gesetz Mittelaufstockung Mehrheit absehbar freuen Gesetz Broschüre finden weit dingen Bundestag Wahlkreis erreichen können finden Download http://www.klaus-mindrup.de/content/pressebilder-downloads Rückfrage stehen Verfügung
------------------------------ Word counts: ------------------------------
{'Aufstockung': '1', 'Bahnstrecken': '1', 'Betroffene': '1', 

In [11]:
notebook_end_time = time.perf_counter()

print()
print(' Runtime of the notebook ')
print('-------------------------')
print('{:8.2f}s  Counting the words'.format(
    counter_end_time - counter_start_time))
print('{:8.2f}s  Dumping the word counts to files'.format(
    dump_end_time - dump_start_time))
print('{:8.2f}s  All calculations together'.format(
    notebook_end_time - notebook_start_time))


 Runtime of the notebook 
-------------------------
    0.85s  Counting the words
    0.10s  Dumping the word counts to files
    4.00s  All calculations together


<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; D. Speicher, T. Dong<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>