## Generate KGTK Person Abbreviations file
This notebook relates to [KGTK Issue 260](https://github.com/usc-isi-i2/kgtk/issues/260)

Example command to run the notebook using papermill:

```papermill abbreviate_human_labels.ipynb abbr_output.ipynb -p data_folder /Users/rijulvohra/Documents/work/Novartis-ISI/global_data_folder/kgtk_edge_files \
                                                          -p wikidata_item_filename claims.wikibase-item.tsv.gz \
                                                          -p wikidata_label_filename labels.en.tsv.gz \
                                                          -p wikidata_alias_filename aliases.en.tsv.gz```

In [35]:
# Parameters
data_folder = '/data/amandeep/wikidata-20210215-dwd'
output_folder = 'derived_files_for_es'
wikidata_item_filename = 'claims.wikibase-item.tsv.gz'
wikidata_label_filename = 'labels.en.tsv.gz'
wikidata_alias_filename = 'aliases.en.tsv.gz'
output_file_name = 'derived.Q5.abbreviations.tsv.gz'

In [36]:
import os
import pandas as pd
import gzip
import shutil
from collections import defaultdict
import json
import ast
import time

In [37]:
os.environ['FILE'] = data_folder
os.environ['OUT'] = f'{data_folder}/{output_folder}'
os.environ['GRAPH_CACHE'] = '/data/amandeep/temp.wikidata-20210215-dwd/wikidata.sqlite3.db'
os.environ['CLAIMS'] = f'{data_folder}/parts/{wikidata_item_filename}'
os.environ['LABELS'] = f'{data_folder}/{wikidata_label_filename}'
os.environ['ALIASES'] = f'{data_folder}/{wikidata_alias_filename}'
os.environ['OUTPUT_FILE'] = output_file_name

### Filter Human Labels

In [12]:
!echo {wikibase_item_file} 
!echo {label_file}
!echo {human_label_output}
!echo  $GRAPH_CACHE

/data/amandeep/wikidata-20210215-dwd/parts/claims.wikibase-item.tsv.gz
/data/amandeep/wikidata-20210215-dwd/labels.en.tsv.gz
/data/amandeep/wikidata-20210215-dwd/derived_files_for_es/human_label_edge.tsv.gz
/data/amandeep/temp.wikidata-20210215-dwd/wikidata.sqlite3.db


In [15]:
!kgtk --debug query --graph-cache "$GRAPH_CACHE" \
-i "$CLAIMS"  \
-i "$LABELS" \
-o "$OUT/human_label_edge.tsv.gz" \
--match 'claims: (n1)-[:P31]->(:Q5), labels: (n1)-[l]->(n2)' \
--return 'n1, l.label, n2'

[2021-03-10 15:24:48 sqlstore]: IMPORT graph directly into table graph_6 from /data/amandeep/wikidata-20210215-dwd/parts/claims.wikibase-item.tsv.gz ...
[2021-03-10 15:30:33 sqlstore]: IMPORT graph directly into table graph_7 from /data/amandeep/wikidata-20210215-dwd/labels.en.tsv.gz ...
[2021-03-10 15:31:22 query]: SQL Translation:
---------------------------------------------
  SELECT graph_6_c1."node1", graph_7_c2."label", graph_7_c2."node2"
     FROM graph_6 AS graph_6_c1, graph_7 AS graph_7_c2
     WHERE graph_6_c1."label"=?
     AND graph_6_c1."node2"=?
     AND graph_6_c1."node1"=graph_7_c2."node1"
  PARAS: ['P31', 'Q5']
---------------------------------------------
[2021-03-10 15:31:22 sqlstore]: CREATE INDEX on table graph_6 column node2 ...
[2021-03-10 15:34:57 sqlstore]: ANALYZE INDEX on table graph_6 column node2 ...
[2021-03-10 15:35:08 sqlstore]: CREATE INDEX on table graph_6 column node1 ...
[2021-03-10 15:37:09 sqlstore]: ANALYZE INDEX on table graph_6 column node1 ...


In [16]:
!zcat "$OUT/human_label_edge.tsv.gz" | head

node1	label	node2
Q10000001	label	'Tatyana Kolotilshchikova'@en
Q1000002	label	'Claus Hammel'@en
Q1000005	label	'Karel Matěj Čapek-Chod'@en
Q1000006	label	'Florian Eichinger'@en
Q100000811	label	'Paul Mohr'@en
Q100000814	label	'Walther Reichardt'@en
Q100000817	label	'Michael Flemisch'@en
Q100000831	label	'Otto Kramer'@en
Q100000832	label	'Maria Magdalena Lebreton'@en

gzip: stdout: Broken pipe


In [18]:
!kgtk --debug query --graph-cache "$GRAPH_CACHE" \
-i "$CLAIMS"  \
-i "$ALIASES" \
-o "$OUT/human_alias_edge.tsv.gz" \
--match 'claims: (n1)-[:P31]->(:Q5), aliases: (n1)-[l]->(n2)' \
--return 'n1, l.label, n2'

[2021-03-10 15:48:15 sqlstore]: IMPORT graph directly into table graph_8 from /data/amandeep/wikidata-20210215-dwd/aliases.en.tsv.gz ...
[2021-03-10 15:48:25 query]: SQL Translation:
---------------------------------------------
  SELECT graph_8_c2."node1", graph_8_c2."label", graph_8_c2."node2"
     FROM graph_6 AS graph_6_c1, graph_8 AS graph_8_c2
     WHERE graph_6_c1."label"=?
     AND graph_6_c1."node2"=?
     AND graph_6_c1."node1"=graph_8_c2."node1"
  PARAS: ['P31', 'Q5']
---------------------------------------------
[2021-03-10 15:48:25 sqlstore]: CREATE INDEX on table graph_8 column node1 ...
[2021-03-10 15:48:27 sqlstore]: ANALYZE INDEX on table graph_8 column node1 ...


In [19]:
!zcat "$OUT/human_alias_edge.tsv.gz" | head

node1	label	node2
Q10000001	alias	'Tatyana Serafimovna Kolotilshchikova'@en
Q1000005	alias	'Karel Matej Capek-Chod'@en
Q100000832	alias	'Marie Madeleine Lebreton'@en
Q100000832	alias	'M. M. Lebreton'@en
Q100001260	alias	'Hendrikus Johannes Rigters'@en
Q1000023	alias	'Emmi Agathe Karola Margarete Wiltraut Rupp-von Brünneck'@en
Q1000051	alias	'Joseph Christopher O\'Mahoney'@en
Q1000051	alias	'J. Christopher O\'Mahoney'@en
Q1000051	alias	'Joseph O\'Mahoney'@en

gzip: stdout: Broken pipe


### Concat the human label and alias file

In [20]:
!kgtk cat -i "$OUT/human_label_edge.tsv.gz" "$OUT/human_alias_edge.tsv.gz" \
/ sort -c node1 -o "$OUT/human_label_alias.tsv.gz"


### Functions to generate Abbreviations

Algorithm for generating the person abbreviations:
Abbreviations for label property:
 * If the label has 2 words, abbreviate the first word. 
 * If label has more than two words(eg. Michael Jeffrey Jordan), then: <br/>
    a) Generate Abbreviation for all words leading upto the end word. (eg. M. J. Jordan) <br/>
    b) Generate Abbreviations for the middle words. (eg. Michael J. Jordan). <br/>
    c) If the generated abbreviations are present in the alias, then leave them. <br/>
	
Abbreviations for alias property:

Alias may have new words other than the words present in the label. For new words present at the start and end leave them as it is. Generate abbreviations for new words in the middle.

In [24]:
def generate_abbreviations(name_split,word_index):
    '''
    Helper function to generate the abbreviation.
    Input: name_split: List of the words in a name
    Output: Abbreviated Name
    '''
    abbr_label = ''
    if word_index is None:
        for word in name_split[:-1]:
            abbr_label += word[0].upper() + '.' + ' '
        abbr_label += name_split[-1]
        if len(name_split) >= 2:
            abbr_label_end = name_split[-1] + ',' + ' '
            for word in name_split[:-1]:
                abbr_label_end += word[0].upper() + '.' + ' '
            
            return abbr_label, abbr_label_end.strip()
            
        return abbr_label, None
    else:
        for i in range(len(name_split) - 1):
            if i != word_index:
                abbr_label += name_split[i] + ' '
            else:
                abbr_label += name_split[i][0].upper() + '.' + ' '
        abbr_label += name_split[-1]
        return abbr_label

In [44]:
def abbreviate_human_labels(human_label_file,output_file):
    '''
    Traverses the concatenated human labels and aliases, creates the abbreviations for the labels and aliases
    '''
    with gzip.open(human_label_file,'rt') as file:
        prev = None
        lines_to_write = list()

        first_line = file.readline().replace('\n','').replace('\r','')
        columns = first_line.split('\t')
        prop_index = columns.index('label')
        node1_index = columns.index("node1")
#         id_index = columns.index("id")
        node2_index = columns.index("node2")
        flag = False
        st = time.time()
        for i,line in enumerate(file):
            if i%1000000 == 0:
                print(f"DONE {i} rows!",)                
#                 print("Time taken for {} is {}".format(i,time.time() - st))
#                 print("Previous Qnode is:",prev)
            vals = line.split('\t')
            prop_label = vals[prop_index]
            node1 = vals[node1_index]
            id_val = 'id'
            node2 = vals[node2_index]
            if node1.startswith('Q'):
                if prev is None:
                    prev = node1
                    abbr_dict = defaultdict(set)
                    alias_dict = defaultdict(set)
                    label_dict = defaultdict(list)
                    
                if not prev.strip() == node1.strip():
                    if len(label_dict[prev]) == 0:
                        prev = node1
                        continue
                    node_label_list = label_dict[prev][0].split()
                    abbr_str, abbr_str_end = generate_abbreviations(node_label_list,None)
                    abbr_dict[prev].add(abbr_str)
                    if abbr_str_end is not None:
                        abbr_dict[prev].add(abbr_str_end)
                    if len(node_label_list) > 2:
                        for i in range(1,len(node_label_list) - 1):
                            abbr_str = generate_abbreviations(node_label_list,i)
                            abbr_dict[prev].add(abbr_str)

                    #alias
                    if prev in alias_dict:
                        for alias in alias_dict[prev]:
                            node_alias_split = alias.split()
                            #check if first and last word of label and alias are the same. Generate abbreviation 
                            #for new middle words
                            if node_alias_split[0] == node_label_list[0]:
                                abbr_str, abbr_str_end = generate_abbreviations(node_alias_split,None)
                                abbr_dict[prev].add(abbr_str)
                                if abbr_str_end is not None:
                                    abbr_dict[prev].add(abbr_str_end)
                                
                                if len(node_alias_split) > 2:
                                    for i in range(1,len(node_alias_split) - 1):
                                        abbr_str = generate_abbreviations(node_alias_split,i)
                                        abbr_dict[prev].add(abbr_str)
                                continue

                            if node_alias_split[0] != node_label_list[0]:
                                if len(node_alias_split) > 2:
                                    for i in range(1,len(node_alias_split) - 1):
                                        abbr_str = generate_abbreviations(node_alias_split,i)
                                        abbr_dict[prev].add(abbr_str)



                    #unique abbreviation edges to write
                    for lab in abbr_dict[prev]:
                        if prev in alias_dict:
                            if lab in alias_dict[prev]:
                                continue
                        lines_to_write.append(prev + '\t' + 'abbreviated_name' + '\t' + "\'" + lab + "\'" +'@en')
                    prev = node1
                
                if prev.strip() == node1.strip():
                    if prop_label == 'alias':
                        alias_dict[node1].add(ast.literal_eval(node2.split('@en')[0]))
                    
                    if prop_label == 'label':
                        label_dict[node1].append(ast.literal_eval(node2.split('@en')[0]))
                        
                        
                if len(lines_to_write) > 100000:
                    with gzip.open(output_file,'a') as writer:
                        if flag == False:
                            header = first_line + '\n'
                            writer.write(header.encode('utf8'))
                            flag = True
                        
                        writer.write('\n'.join(lines_to_write).encode('utf8'))
                        writer.write('\n'.encode('utf8')) 
                        lines_to_write = list()
         
        for lab in abbr_dict[prev]:
            lines_to_write.append(prev + '\t' + 'abbreviated_name' + '\t' + "\'" + lab + "\'" +'@en')
        #print(lines_to_write)                
        if len(lines_to_write) > 0:
            #print(lines_to_write)
            with gzip.open(output_file,'a') as writer:
                if flag == False:
                    header = first_line + '\n'
                    writer.write(header.encode('utf8'))
                    flag = True
                writer.write('\n'.join(lines_to_write).encode('utf8'))
                writer.write('\n'.encode('utf8'))               

In [45]:
abbreviate_human_labels(f'{data_folder}/{output_folder}/human_label_alias.tsv.gz', 
                        f'{data_folder}/{output_folder}/derived.Q5.abbreviations.1.tsv.gz')

DONE 0 rows!
DONE 1000000 rows!
DONE 2000000 rows!
DONE 3000000 rows!
DONE 4000000 rows!
DONE 5000000 rows!
DONE 6000000 rows!
DONE 7000000 rows!
DONE 8000000 rows!
DONE 9000000 rows!


In [47]:
!kgtk add-id -i "$OUT/derived.Q5.abbreviations.1.tsv.gz" \
--id-style wikidata \
-o "$OUT/$OUTPUT_FILE"

In [48]:
!zcat "$OUT/$OUTPUT_FILE" | head

node1	label	node2	id
Q10000001	abbreviated_name	'T. Kolotilshchikova'@en	Q10000001-abbreviated_name-5d05f1
Q10000001	abbreviated_name	'Kolotilshchikova, T.'@en	Q10000001-abbreviated_name-ee909d
Q10000001	abbreviated_name	'Tatyana S. Kolotilshchikova'@en	Q10000001-abbreviated_name-5106b9
Q10000001	abbreviated_name	'T. S. Kolotilshchikova'@en	Q10000001-abbreviated_name-021bee
Q10000001	abbreviated_name	'Kolotilshchikova, T. S.'@en	Q10000001-abbreviated_name-5d9384
Q1000002	abbreviated_name	'C. Hammel'@en	Q1000002-abbreviated_name-90b023
Q1000002	abbreviated_name	'Hammel, C.'@en	Q1000002-abbreviated_name-2ab003
Q1000005	abbreviated_name	'K. M. Čapek-Chod'@en	Q1000005-abbreviated_name-6123f0
Q1000005	abbreviated_name	'Čapek-Chod, K. M.'@en	Q1000005-abbreviated_name-9781fd

gzip: stdout: Broken pipe


### CleanUp Temporary files

In [49]:
!rm "$OUT/human_label_edge.tsv.gz"
!rm "$OUT/human_alias_edge.tsv.gz"
!rm "$OUT/human_label_alias.tsv.gz"
!rm "$OUT/derived.Q5.abbreviations.1.tsv.gz"