<h1 style="background-color:#0071BD;color:white;text-align:center;padding-top:0.8em;padding-bottom: 0.8em">
  LDA Spike 0 - Create a Corpus of Answers on abgeordnetenwatch.de
</h1>

Creation and update of a corpus of answers by deputies to questions by citizens on www.abgeordnetenwatch.de. For more information see

  * Information about the deputies of the Bundestag: https://www.abgeordnetenwatch.de/bundestag
  * Application Programming interface to access the data: https://www.abgeordnetenwatch.de/api

<font color="darkred" />
    
__This notebooks writes to and reads from your file system.__ Per default all used directory are within `~/TextData/Abgeordnetenwatch`, where `~` stands for whatever your operating system considers your home directory. To change this configuration either change the default values in the second next cell or edit [LDA Spike - Configuration.ipynb](./LDA%20Spike%20-%20Configuration.ipynb) and run it before you run this notebook.

__This notebooks sends HTTP requests to www.abgeordnetenwatch.de.__ You might want to check what kind of question this notebook asks in your name and what you are allowed to do with the data according to 
the [Open Database License (ODbL) v1.0](https://opendatacommons.org/licenses/odbl/1.0/). If you need to configure a proxy, you can do so below or as well in [LDA Spike - Configuration.ipynb](./LDA%20Spike%20-%20Configuration.ipynb). In the latter case make sure to run that notebook before this one.

<font color="black" />

<p style="background-color:#66A5D1;padding-top:0.2em;padding-bottom: 0.2em" />

In [1]:
from pathlib import Path
import json

import requests
from lxml import html

In [2]:
%store -r own_configuration_was_read
if not('own_configuration_was_read' in globals()): raise Exception(
    '\nReminder: You might want to run your configuration notebook before you run this notebook.' + 
    '\nIf you want to manage your configuration from each notebook, just remove this check.')

%store -r proxies
if not('proxies' in globals()): proxies = {}

%store -r project_name
if not('project_name' in globals()): project_name = 'AbgeordnetenWatch'

%store -r text_data_dir
if not('text_data_dir' in globals()): text_data_dir = Path.home() / 'TextData'

In [3]:
corpus_dir = text_data_dir / project_name / 'Corpus'
corpus_dir.mkdir(parents=True, exist_ok=True) # Creates a local directory!

In [4]:
# Set these variables to False to force an update on all deputies of all known questions
# To force an update on a deputy otherwise, just delete the json file.
# Additional questions are only recognized if the deputy gets updated.

update_only_missing_deputies = True
update_only_missing_answers = True

In [5]:
statistics = [
    {'name': 'number of deputy files',                    'pattern': '*.json',    'before': -1, 'after': -1},
    {'name': 'number of questions',                       'pattern': '*.url',     'before': -1, 'after': -1},
    {'name': 'number of answers',                         'pattern': '*.txt',     'before': -1, 'after': -1},
    {'name': 'number of questions with multiple answers', 'pattern': '*A02*.txt', 'before': -1, 'after': -1},
]

for statistic in statistics:
    value = len(list(corpus_dir.glob(statistic['pattern'])))
    statistic['before'] = value
    print('{:>42}: {:7}'.format(statistic['name'], statistic['before']))

                    number of deputy files:     714
                       number of questions:   10143
                         number of answers:    7674
 number of questions with multiple answers:      50


## Utility function guessing the schema from given JSON data

In [6]:
def print_json_schema(json, indentation=-1):

    if isinstance(json, dict):
        print()
        indentation += 1
        for key, value in json.items():
            print(indentation * '    ', key, end=': ')
            print_json_schema(value, indentation)
        indentation -= 1
        print()
    
    elif isinstance(json, list):
        length = len(json)
        if length:
            print(length, end='x ')
            print_json_schema(json[0], indentation)
        else:
            print('0x ???')
    
    else:
        print(type(json))

In [7]:
print_json_schema({'countries': ['Germany', 'France'], 'politicians': [{'first': 'Angela', 'last': 'Merkel'}, {}, {}]})


 countries: 2x <class 'str'>
 politicians: 3x 
     first: <class 'str'>
     last: <class 'str'>




In [8]:
print_json_schema([{'first': 'Angela', 'middle': 'Dorothea ', 'last': 'Merkel'}, {'first': 'Emmanuel', 'last': 'Macron'}])

2x 
 first: <class 'str'>
 middle: <class 'str'>
 last: <class 'str'>



In [9]:
print_json_schema([{'first': 'Emmanuel', 'last': 'Macron'}, {'first': 'Angela', 'middle': 'Dorothea ', 'last': 'Merkel'}])

2x 
 first: <class 'str'>
 last: <class 'str'>



## Retrieve the list of parliaments

In [10]:
parliaments_url  = 'https://www.abgeordnetenwatch.de/api/parliaments.json'
parliaments_json = requests.get(parliaments_url, proxies=proxies).json() # Request to abgeordnetenwatch.de!

In [11]:
print_json_schema(parliaments_json)


 meta: 
     license: 
         name: <class 'str'>
         url: <class 'str'>

     contributer: 0x ???
     subsets: 0x ???

 parliaments: 78x 
     name: <class 'str'>
     meta: 
         uuid: <class 'str'>

     dates: 
         start: <class 'str'>
         end: <class 'str'>
         election: <class 'str'>

     datasets: 
         deputies: 
             by-name: <class 'str'>
             by-uuid: <class 'str'>

         candidates: 
             by-name: <class 'str'>
             by-uuid: <class 'str'>

         constituencies: 
             by-name: <class 'str'>
             by-uuid: <class 'str'>

         polls: 
             by-name: <class 'str'>
             by-uuid: <class 'str'>

         committees: 
             by-name: <class 'str'>
             by-uuid: <class 'str'>






In [12]:
print(parliaments_json['meta']['license']['name'])
print(parliaments_json['meta']['license']['url'])

Open Database License (ODbL) v1.0
https://opendatacommons.org/licenses/odbl/1.0/


In [13]:
parliaments = parliaments_json['parliaments']

for parliament in parliaments[:19]:
    print(parliament['name'], end=', ')

Baden-Württemberg, Baden-Württemberg 2006-2011, Baden-Württemberg 2011-2016, Bayern, Bayern 2008-2013, Bayern 2013-2018, Berlin, Berlin 2006-2011, Berlin 2011-2016, Brandenburg, Brandenburg 2009-2014, Bremen, Bremen 2007-2011, Bremen 2011-2015, Bundestag, Bundestag 2005-2009, Bundestag 2009-2013, Bundestag 2013-2017, Bürgermeisterwahlen Nordrhein-Westfalen 2009, 

### Example: Bundestag

In [14]:
search_parliament_name = 'Bundestag'

parliament = next(p for p in parliaments if p['name'] == search_parliament_name)

deputies_url = parliament['datasets']['deputies']['by-name']
parliament_name = deputies_url.split('/')[-2]

print('Parliament  :', parliament['name'])
print('  in URLs   :', parliament_name)
print('Dates       :', parliament['dates'])
print('Deputies URL:', deputies_url)

Parliament  : Bundestag
  in URLs   : bundestag
Dates       : {'start': '2017-07-20', 'end': '2021-10-23', 'election': '2017-09-24'}
Deputies URL: https://www.abgeordnetenwatch.de/api/parliament/bundestag/deputies.json


## Retrieve the list of all deputies of the Bundestag

In [15]:
print(deputies_url)

https://www.abgeordnetenwatch.de/api/parliament/bundestag/deputies.json


In [16]:
deputies_json = requests.get(deputies_url, proxies=proxies).json() # Request to abgeordnetenwatch.de!

In [17]:
print_json_schema(deputies_json)


 meta: 
     license: 
         name: <class 'str'>
         url: <class 'str'>

     contributer: 1x <class 'str'>
     subsets: 3x <class 'str'>

 profiles: 716x 
     meta: 
         status: <class 'str'>
         edited: <class 'str'>
         uuid: <class 'str'>
         username: <class 'str'>
         questions: <class 'int'>
         answers: <class 'int'>
         standard_replies: <class 'int'>
         url: <class 'str'>

     personal: 
         degree: <class 'NoneType'>
         first_name: <class 'str'>
         last_name: <class 'str'>
         gender: <class 'str'>
         birthyear: <class 'str'>
         education: <class 'str'>
         profession: <class 'str'>
         location: 
             country: <class 'str'>
             state: <class 'str'>
             city: <class 'str'>
             postal_code: <class 'str'>

         picture: 
             url: <class 'str'>
             copyright: <class 'str'>


     party: <class 'str'>
     fraction: <class 'str

In [18]:
print(deputies_json['meta']['license']['name'])
print(deputies_json['meta']['license']['url'])

Open Database License (ODbL) v1.0
https://opendatacommons.org/licenses/odbl/1.0/


In [19]:
deputies = deputies_json ['profiles']

for deputy in deputies[:23]:
    print(deputy['meta']['username'], end=', ')

alexander-graf-lambsdorff, martin-schulz-1, michael-theurer, fabio-de-masi, sarah-ryglewski, anke-domscheit-berg, beatrix-von-storch, konstantin-kuhle, johannes-schraps, armin-paul-hampel, petr-bystron, waldemar-herdt, manfred-todtenhausen, norbert-muller-4, alexander-krauss, dr-juergen-martens, alexander-gauland, steffen-kotre, frauke-petry, lars-herrmann, christoph-neumann, siegbert-droese, detlev-spangenberg, 

### Example: "Ulrich Kelber" in the response about all deputies

In [20]:
# API URL for a deputy file based on the structure of parliament['datasets']['deputies']['by-name'] and 
# the example https://www.abgeordnetenwatch.de/api/parliament/bundestag/profile/angela-merkel/profile.json
# as given on https://www.abgeordnetenwatch.de/api

def deputy_api_url(deputies_url, deputy_name):
    return '/'.join(deputies_url.split('/')[:-1] + ['profile', deputy_name, 'profile.json'] )

In [21]:
search_first_name = 'Ulrich'
search_last_name  = 'Kelber'

deputy = next(a for a in deputies 
                  if a['personal']['first_name'] == search_first_name 
                      and a['personal']['last_name']  == search_last_name)

deputy_name = deputy['meta']['username']
deputy_url = deputy_api_url(deputies_url, deputy_name)

print('Deputy       :', deputy['personal']['first_name'], deputy['personal']['last_name'], '('+deputy['party']+')')
print('  in URLs    :', deputy_name)
print('Profile URL  :', deputy['meta']['url'])
print('  API URL    :', deputy_url)
print('Year of birth:', deputy['personal']['birthyear'])
print('Education    :', deputy['personal']['education'])
print('Election     :', deputy['constituency']['result'] + '%', 'in', deputy['constituency']['name'])
for i, committee in enumerate(deputy['committees']):
    print('Commitee {}   :'.format(i), committee['name'])
print('Answers      :', deputy['meta']['answers'], 'regular,', deputy['meta']['standard_replies'], 
      'standard, for', deputy['meta']['questions'], 'questions')

Deputy       : Ulrich Kelber (SPD)
  in URLs    : ulrich-wolfgang-kelber
Profile URL  : https://www.abgeordnetenwatch.de/profile/ulrich-wolfgang-kelber
  API URL    : https://www.abgeordnetenwatch.de/api/parliament/bundestag/profile/ulrich-wolfgang-kelber/profile.json
Year of birth: 1968
Education    : Diplom-Informatiker
Election     : 34,9% in Bonn
Commitee 0   : Ausschuss Digitale Agenda
Answers      : 18 regular, 0 standard, for 19 questions


### Example: Questions to "Ulrich Kelber" listed in his profile

In [22]:
print(deputy_url)

https://www.abgeordnetenwatch.de/api/parliament/bundestag/profile/ulrich-wolfgang-kelber/profile.json


In [23]:
deputy_json = requests.get(deputy_url, proxies=proxies).json() # Request to abgeordnetenwatch.de!

In [24]:
print_json_schema(deputy_json)


 profile: 
     meta: 
         status: <class 'str'>
         edited: <class 'str'>
         uuid: <class 'str'>
         username: <class 'str'>
         questions: <class 'int'>
         answers: <class 'int'>
         standard_replies: <class 'int'>
         url: <class 'str'>

     personal: 
         degree: <class 'NoneType'>
         first_name: <class 'str'>
         last_name: <class 'str'>
         gender: <class 'str'>
         birthyear: <class 'str'>
         education: <class 'str'>
         profession: <class 'str'>
         location: 
             country: <class 'str'>
             state: <class 'str'>
             city: <class 'str'>
             postal_code: <class 'str'>

         picture: 
             url: <class 'str'>
             copyright: <class 'str'>


     party: <class 'str'>
     fraction: <class 'str'>
     parliament: 
         name: <class 'str'>
         uuid: <class 'str'>
         retired: <class 'str'>

     roles: 0x ???
     constituency: 
   

In [25]:
questions = deputy_json['profile']['questions'] 

for question in questions[:11]:
    print(question['date'], question['category'], len(question['answers']), end = ', ')

2018-12-21 Land- und Forstwirtschaft 0, 2018-12-05 Demokratie und Bürgerrechte 1, 2018-11-02 Demokratie und Bürgerrechte 1, 2018-09-25 Umwelt 1, 2018-09-20 Umwelt 1, 2018-09-14 Umwelt 1, 2018-08-23 Gesundheit 1, 2018-07-24 Gesundheit 1, 2018-07-24 Demokratie und Bürgerrechte 1, 2018-06-26 Internationales 1, 2018-06-18 Land- und Forstwirtschaft 1, 

## Utility functions for naming files and extracting text from HTML


In [26]:
def deputy_file_name_part(deputy):
    return '_'.join([deputy['meta']['username'], deputy['party'].lower().replace(' ', '-')])
                     
def question_file_name_parts(q, question):
    question_nr = 'Q{:04}'.format(q + 1) # Maximum in 12/18: 344 questions (Andrea Nahles)
    question_id = '_'.join([question_nr, question['date']])
    category    = question['category'].lower().replace(' ', '-')
    return question_id, category
    
def answer_file_name_part(a, answer):
    answer_nr   = 'A{:02}'.format(a + 1) # Maximum in 12/18: 2 answer for one question (often)
    return '_'.join([answer_nr, answer['date']])

In [27]:
print(deputy_file_name_part(deputy))

if questions:
    oldest = len(questions) - 1
    print(question_file_name_parts(oldest, questions[oldest]))
    
    answers = questions[oldest]['answers']
    if answers:
        print(answer_file_name_part(0, answers[0]))

ulrich-wolfgang-kelber_spd
('Q0037_2017-07-24', 'inneres-und-justiz')
A01_2017-07-25


In [28]:
def extract_answers_as_text(html_text):

    page = html.fromstring(html_text)
    for nocontent in page.find_class('robots-nocontent'): nocontent.clear()
    
    answers  = page.find_class('question__answer')

    # Names of citizens asking a question are "encrypted" but still unique.
    # To keep them even more private, we replace all these names by 'N.N.'.
    for answer in answers:
        for name in answer.find_class('crypto-font'): name.text = ' N.N. '
    
    for author in page.find_class('question__question__author'): author.clear()
    for author in page.find_class('question__answer__author'): author.clear()
    
    return [answer.text_content() for answer in answers]

In [29]:
encrypted_name = 'Hijklmn'

html_text = '''
<!DOCTYPE html>
<html lang="de" dir="ltr">
  <head></head>
  <body>
    <main id="content">

<div class="container-small">
  <div class="question__question__title">    
    <p>... Antrag der FDP gestimmt, ... dass Deutschland eine Abschaffung der Sommerzeit wünscht ... </p>
  </div>
    <p class="question__question__author">Von: 
      <span class="robots-nocontent">
        <span class="crypto-font">Abcdefg Hijklmn</span>
      </span>
    </p>
</div>

<div class="question__answer-wrapper">
  <div class="question__answer">
    <p class="question__answer__author">
      Antwort von <strong>Ulrich Kelber (SPD)</strong>
      <span>26. März. 2018 - 14:48<br>
        <small>Dauer bis zur Antwort: 1 Tag 6 Stunden</small>
      </span>
    </p>
    <p>Sehr geehrter Herr <span class="crypto-font">Hijklmn</span>,</p>
    <p>vielen Dank für Ihre Anfrage zur Sommerzeit.<br /> Ich denke, ...</p>
    <p>Mit freundlichem Gruß <br />Ulrich Kelber</p>      
  </div>
</div>

    </main>
  </body>
</html>
'''

print('Original HTML contains', 'a' if encrypted_name in html_text else 'no', 'reference to the encrypted name.')
for a, answer_text in enumerate(extract_answers_as_text(html_text)):
    print('Answer {}:'.format(a+1))
    for line in answer_text.split('\n'):
        print(line)
    print('Extracted answer contains', 'a' if encrypted_name in answer_text else 'no', 'reference to the encrypted name.')

Original HTML contains a reference to the encrypted name.
Answer 1:

    Sehr geehrter Herr  N.N. ,
    vielen Dank für Ihre Anfrage zur Sommerzeit. Ich denke, ...
    Mit freundlichem Gruß Ulrich Kelber      
  
Extracted answer contains no reference to the encrypted name.


## Create corpus of all answers of all deputies of the Bundestag

### Create or upate deputy files (JSON) and question files (URL)

In [30]:
success = []
failure = []

for d, deputy in enumerate(deputies):

    deputy_prefix = deputy_file_name_part(deputy)
    deputy_file = corpus_dir / (deputy_prefix + '.json')

    try:
        if update_only_missing_deputies and deputy_file.exists(): continue

        deputy_url = deputy_api_url(deputies_url, deputy['meta']['username'])
        deputy_json = requests.get(deputy_url, proxies=proxies).json() # Request to abgeordnetenwatch.de!
        deputy_file.write_text(json.dumps(deputy_json))
        success.append(deputy_file.name)
        
        questions = deputy_json['profile']['questions']
        for q, question in enumerate(reversed(questions)): # Oldest question first

            question_infix, question_suffix = question_file_name_parts(q, question)
            url_filename = '_'.join([deputy_prefix, question_infix, question_suffix]) + '.url'
            url_file = corpus_dir / url_filename
            url_file.write_text(question['url'])
            success.append(url_file.name)   
        
    except Exception as exception:
        failure.append((deputy_file.name, exception))

    finally:
        print('\r{} of {}. {} files successfully created. {} files failed. Latest: {:30.30}'.format(
                 d+1, len(deputies), len(success), len(failure), deputy_file.stem), end='')

716 of 716. 73 files successfully created. 0 files failed. Latest: gyde-jensen_fdp               

In [31]:
for deputy_filename, exception in failure:
    print('Exception while processing deputy {}:'.format(deputy_filename))
    print(exception)
    print()

if not(failure):
    print('No exception while updating the deputies and questions :-)')
    print()

print('{} files created or updated:'.format(len(success)))
print(', '.join(success))

No exception while updating the deputies and questions :-)

73 files created or updated:
frauke-petry_die-blauen.json, frauke-petry_die-blauen_Q0001_2017-07-20_verbraucherschutz.url, frauke-petry_die-blauen_Q0002_2017-07-31_demokratie-und-bürgerrechte.url, frauke-petry_die-blauen_Q0003_2017-08-27_demokratie-und-bürgerrechte.url, frauke-petry_die-blauen_Q0004_2017-08-29_bildung-und-forschung.url, frauke-petry_die-blauen_Q0005_2017-09-14_familie.url, frauke-petry_die-blauen_Q0006_2017-09-15_umwelt.url, frauke-petry_die-blauen_Q0007_2017-09-15_demokratie-und-bürgerrechte.url, frauke-petry_die-blauen_Q0008_2017-09-17_inneres-und-justiz.url, frauke-petry_die-blauen_Q0009_2017-09-20_finanzen.url, frauke-petry_die-blauen_Q0010_2017-10-12_demokratie-und-bürgerrechte.url, frauke-petry_die-blauen_Q0011_2017-10-27_finanzen.url, frauke-petry_die-blauen_Q0012_2017-11-16_demokratie-und-bürgerrechte.url, frauke-petry_die-blauen_Q0013_2017-12-06_demokratie-und-bürgerrechte.url, frauke-petry_die-blauen

### Create or upate answer files (TXT)

In [32]:
success = []
failure = []

for d, deputy in enumerate(deputies):

    deputy_prefix = deputy_file_name_part(deputy)
    deputy_file = corpus_dir / (deputy_prefix + '.json')
    
    questions = json.loads(deputy_file.read_text())['profile']['questions']
    
    for q, question in enumerate(reversed(questions)): # Oldest question first

        a = -1
        try:
            question_infix, question_suffix = question_file_name_parts(q, question)
            answer_files = []

            answers = question['answers']
            for a, answer in enumerate(answers):
                answer_infix = answer_file_name_part(a, answer)
                answer_filename = '_'.join([deputy_prefix, question_infix, answer_infix, question_suffix]) + '.txt'
                answer_files.append(corpus_dir / answer_filename)
                
            # Even if there is just one new answer, we need to fetch the whole page again
            if update_only_missing_answers and all(file.exists() for file in answer_files): continue
                
            question_page = requests.get(question['url'], proxies=proxies).text # Request to abgeordnetenwatch.de!

            answer_texts = extract_answers_as_text(question_page)
                
            for file, text in zip(answer_files, answer_texts):
                file.write_text(text)
                success.append(file.name)
            
        except Exception as exception:
             failure.append((deputy_prefix, q, a, exception))

        finally:
             print('\rDeputy {} of {}. Question {} of {}. {} files created. {} files failed. Latest: {:30.30}'.format(
                 d+1, len(deputies), q+1, len(questions), len(success), len(failure), deputy_prefix), end='')

Deputy 1 of 716. Question 1 of 20. 0 files created. 0 files failed. Latest: alexander-graf-lambsdorff_fdp Deputy 1 of 716. Question 2 of 20. 0 files created. 0 files failed. Latest: alexander-graf-lambsdorff_fdp Deputy 1 of 716. Question 3 of 20. 0 files created. 0 files failed. Latest: alexander-graf-lambsdorff_fdp Deputy 1 of 716. Question 4 of 20. 0 files created. 0 files failed. Latest: alexander-graf-lambsdorff_fdp Deputy 1 of 716. Question 5 of 20. 0 files created. 0 files failed. Latest: alexander-graf-lambsdorff_fdp Deputy 1 of 716. Question 6 of 20. 0 files created. 0 files failed. Latest: alexander-graf-lambsdorff_fdp Deputy 1 of 716. Question 7 of 20. 0 files created. 0 files failed. Latest: alexander-graf-lambsdorff_fdp Deputy 1 of 716. Question 8 of 20. 0 files created. 0 files failed. Latest: alexander-graf-lambsdorff_fdp Deputy 1 of 716. Question 9 of 20. 0 files created. 0 files failed. Latest: alexander-graf-lambsdorff_fdp Deputy 1 of 716. Question 10 of 20. 

Deputy 716 of 716. Question 1 of 1. 48 files created. 0 files failed. Latest: gyde-jensen_fdp                   

In [33]:
for deputy_prefix, q, a, exception in failure:
    print('Exception while processing answer {} for question {} for deputy {}:'.format(a+1, q+1, deputy_prefix))
    print(exception)
    print()

if not(failure):
    print('No exception while updating the answers :-)')
    print()
    
print('{} files created or updated:'.format(len(success)))
print(', '.join(success))    

No exception while updating the answers :-)

48 files created or updated:
marco-bulow_parteilos_Q0001_2017-07-25_A01_2017-08-29_finanzen.txt, marco-bulow_parteilos_Q0002_2017-07-30_A01_2017-08-14_demokratie-und-bürgerrechte.txt, marco-bulow_parteilos_Q0003_2017-08-03_A01_2017-08-29_demokratie-und-bürgerrechte.txt, marco-bulow_parteilos_Q0004_2017-09-07_A01_2017-09-21_wirtschaft.txt, marco-bulow_parteilos_Q0005_2017-09-14_A01_2017-09-21_familie.txt, marco-bulow_parteilos_Q0006_2017-09-20_A01_2017-09-22_demokratie-und-bürgerrechte.txt, marco-bulow_parteilos_Q0007_2017-10-20_A01_2017-11-07_arbeit.txt, marco-bulow_parteilos_Q0008_2017-10-26_A01_2017-11-21_bildung-und-forschung.txt, marco-bulow_parteilos_Q0009_2017-11-24_A01_2017-12-12_frauen.txt, marco-bulow_parteilos_Q0010_2017-11-29_A01_2017-12-11_umwelt.txt, marco-bulow_parteilos_Q0011_2018-01-04_A01_2018-01-17_integration.txt, marco-bulow_parteilos_Q0012_2018-01-19_A01_2018-01-31_soziales.txt, marco-bulow_parteilos_Q0013_2018-02-08_A01

### Review of the corpus: counts, answers without questions, questions without answers

In [34]:
for statistic in statistics:
    value = len(list(corpus_dir.glob(statistic['pattern'])))
    statistic['after'] = value
    print('{:>42}: {:7}  (was {:7})'.format(statistic['name'], statistic['after'], statistic['before']))

                    number of deputy files:     718  (was     714)
                       number of questions:   10212  (was   10143)
                         number of answers:    7722  (was    7674)
 number of questions with multiple answers:      50  (was      50)


In [35]:
def unique_filename_parts(pattern, name_slice):
    files = list(corpus_dir.glob(pattern))
    parts = sorted(['_'.join(f.stem.split('_')[name_slice]) for f in files])
    return parts

for d, deputy in enumerate(deputies):

    deputy_prefix = deputy_file_name_part(deputy)
    
    questions = unique_filename_parts(deputy_prefix + '*.url', slice(4))
    answered  = unique_filename_parts(deputy_prefix + '*.txt', slice(4))

    answer_without_question = [q for q in answered if not q in questions]

    print('{}: {}/{},'.format(deputy_prefix, len(answered), len(questions)), end=' ')

    if answer_without_question:
        print()
        print('Following questions are answered, but the question itself is not known:')
        print(', '.join(answer_without_question))
        print()      


alexander-graf-lambsdorff_fdp: 10/20, martin-schulz-1_spd: 0/62, michael-theurer_fdp: 8/9, fabio-de-masi_die-linke: 19/17, sarah-ryglewski_spd: 8/13, anke-domscheit-berg_die-linke: 17/18, beatrix-von-storch_afd: 8/23, konstantin-kuhle_fdp: 2/5, johannes-schraps_spd: 5/6, armin-paul-hampel_afd: 0/7, petr-bystron_afd: 0/12, waldemar-herdt_afd: 3/7, manfred-todtenhausen_fdp: 5/7, norbert-muller-4_die-linke: 12/12, alexander-krauss_cdu: 20/23, dr-juergen-martens_fdp: 4/4, alexander-gauland_afd: 0/25, steffen-kotre_afd: 1/1, frauke-petry_die-blauen: 0/18, lars-herrmann_afd: 2/3, christoph-neumann_afd: 0/4, siegbert-droese_afd: 0/3, detlev-spangenberg_afd: 4/5, torsten-herbst_fdp: 0/0, thomas-kemmerich_fdp: 1/2, stephan-brandner_afd: 132/130, christoph-de-vries_cdu: 19/16, christoph-plos_cdu: 35/36, dr-bernd-baumann_afd: 0/8, kay-gottschalk_afd: 1/7, zaklin-nastic_die-linke: 7/10, katja-suding_fdp: 14/15, dr-wieland-schinnenburg_fdp: 6/6, frank-magnitz_afd: 2/4, dr-kirsten-kappert-gonther_di

stephan-kuhn_die-grünen: 13/13, anette-kramme_spd: 10/11, jutta-krellmann_die-linke: 3/5, gunther-krichbaum_cdu: 14/15, gunter-krings_cdu: 4/6, oliver-krischer_die-grünen: 28/29, rudiger-kruse_cdu: 9/10, jens-koeppen_cdu: 14/14, dr-barbel-kofler_spd: 7/10, daniela-kolbe_spd: 8/9, markus-koob_cdu: 6/11, carsten-korber_cdu: 1/4, axel-knoerig_cdu: 7/8, maria-klein-schmeink_die-grünen: 8/9, lars-klingbeil_spd: 61/69, dr-georg-kippels_cdu: 12/15, katja-kipping_die-linke: 81/83, cansel-kiziltepe_spd: 33/38, arno-klare_spd: 9/10, roderich-kiesewetter_cdu: 61/61, katja-keul_die-grünen: 13/15, anja-karliczek_cdu: 16/17, kerstin-kassner_die-linke: 8/9, gabriele-katzmarek_spd: 19/19, volker-kauder_cdu: 16/25, stefan-kaufmann_cdu: 13/14, uwe-kekeritz_die-grünen: 9/9, ulrich-wolfgang-kelber_spd: 36/37, ralf-kapschack_spd: 9/9, alois-karl_csu: 0/5, johannes-kahrs_spd: 34/34, josip-juratovic_spd: 8/8, thomas-jurk_spd: 4/4, frank-junge_spd: 8/8, erich-irlstorfer_csu: 1/16, dieter-janecek_die-grünen: 9

heiko-hessenkemper_afd: 0/1, ulrich-oehme_afd: 1/5, jan-nolte_afd: 11/12, albrecht-glaser_afd: 8/18, uwe-schulz-2_afd: 3/6, mariana-harder-kuhnel_afd: 0/5, jurgen-pohl_afd: 2/6, robby-schlund_afd: 12/13, andreas-bleck_afd: 6/7, nicole-hochst_afd: 5/11, sebastian-munzenmaier_afd: 0/9, heiko-wildberg_afd: 5/7, johannes-huber_afd: 6/14, wolfgang-wiehle_afd: 4/9, gerold-otten_afd: 3/8, martin-hebner_afd: 0/4, hansjorg-muller_afd: 12/16, peter-boehringer_afd: 4/10, paul-podolay_afd: 4/5, martin-sichert_afd: 30/30, rainer-kraft_afd: 5/7, peter-felser_afd: 8/10, dirk-spaniel_afd: 3/6, lothar-maier-2_afd: 3/7, jurgen-braun_afd: 1/4, martin-hess_afd: 1/5, melanie-bernstein_cdu: 15/17, claudia-schmidtke_cdu: 3/5, philipp-amthor_cdu: 19/32, silvia-breher_cdu: 24/24, dietlind-tiemann_cdu: 4/7, eckhard-gnodtke_cdu: 2/3, sepp-muller_cdu: 7/8, christoph-bernstiel_cdu: 3/3, torsten-schweiger_cdu: 3/3, carsten-brodesser_cdu: 12/17, hermann-josef-tebroke_cdu: 5/11, stefan-rouenhoff_cdu: 5/7, marc-henric

<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; D. Speicher<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>