## Step 0: Install KGTK

Only run the following cell if KGTK is not installed.
 For example, if running in [Google Colab](https://colab.research.google.com/)

In [None]:
!pip install kgtk

**Run the following cell, `gensim` is not installed with kgtk**

In [None]:
!pip install gensim

In [None]:
import os
from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher
from gensim.models import KeyedVectors
import tempfile
import h5py, torch
from torchbiggraph.model import ComplexDiagonalDynamicOperator, DotComparator, CosComparator
import json
import pandas as pd

In [None]:
# Parameters

# Folder on local machine where to create the output and temporary folders
input_path = None
output_path = "/tmp/projects"
project_name = "tutorial-graph-embeddings"

In [None]:
files = [
 "all",
 "label",
 "alias",
 "description",
 "item",
 "qualifiers",
 "p31",
 "p279star"
]
additional_files = {
 'p31x': 'derived.P31x.tsv'
}
input_files_url = "https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold-profiled"
ck = ConfigureKGTK(files,
 input_files_url=input_files_url)

ck.configure_kgtk(input_graph_path=input_path,
 output_path=output_path,
 project_name=project_name,
 additional_files=additional_files)

In [None]:
ck.print_env_variables()

In [None]:
ck.load_files_into_cache()

In [None]:
vector_dimension = 30
vector_output_path = f"{os.environ['OUT']}/arnold.embeddings.augmented.{vector_dimension}.tsv"
vector_output_w2v_path = f"{os.environ['OUT']}/arnold.embeddings.augmented.{vector_dimension}.w2v.tsv"
os.environ['VECTOR_DIMENSION'] = str(vector_dimension)

## Compute ComplEx Graph Embeddings

In this notebook we will compute graph embeddings using `kgtk graph-embeddings` command for the `arnold` subgraph and demonstrate a few applications.

First step is to augment the `claims.wikibase-item.tsv.gz` file with `derived.P31x.tsv` file which contains occupations for humans as `instance of (P31)`

- `claims.wikibase-item.tsv.gz`: KGTK claims file non literal edges only
- `derived.P31x.tsv`: file with additional P31x links, adding occupation as `instance of` (computed)

In [None]:
!kgtk cat -i $item \
-i $GRAPH/derived.P31x.tsv \
-o $GRAPH/claims.wikibase-item.augmented.tsv.gz

### Run `kgtk graph-embeddings`

The `kgtk graph-embeddings` command takes as input a KGTK edge file and computes graph embeddings of user specified type, producing vectors of user specified dimensions.

The following parameters are used in this instance:

- `-op ComplEx`: compute ComplEx graph embeddings
- `--dimension 30`: desired dimension of the vectors
- `-ot kgtk`: output format - kgtk
- `--retain_temporary_data True`: retain the byproduct files, which we will use in subsequent steps
- `-T `: temporary folder where the temporary files will be stored
- `-i `: input file
- `-o `: output file
- `--log `: log file

In [None]:
kgtk(f""" --debug graph-embeddings
 -op ComplEx 
 --dimension $VECTOR_DIMENSION
 -ot kgtk
 --retain_temporary_data True
 -T $TEMP
 -w 1
 -i $GRAPH/claims.wikibase-item.augmented.tsv.gz
 -o {vector_output_path}
 --log $TEMP/ge.log.txt
 """)

#### Take a peek at the embeddings file.

In [None]:
kgtk(f"""head -i {vector_output_path}""")

### The output is in `kgtk` format. Convert it to `word2vec` format for `gensim` similarity computation


For reference: 
- [gensim](https://radimrehurek.com/gensim/)
- [word2vec](https://en.wikipedia.org/wiki/Word2vec)

In [None]:
def convert_kgtk_to_w2v(input_path, output_path):
 """
 Convert a KGTK file (node1/label/node2) that contains embeddings to the w2v format
 """
 vector_count = 0

 # Read the file once to count the lines as we need to put them at the top of the w2v file
 with open(input_path, "r") as kgtk_file:
 next(kgtk_file)
 for line in kgtk_file:
 vector_count += 1
 kgtk_file.close()

 with open(output_path, "w") as w2v_file:
 w2v_file.write("{} {}\n".format(vector_count, vector_dimension))
 with open(input_path, "r") as kgtk_file:
 next(kgtk_file)
 for line in kgtk_file:
 items = line.split("\t")
 qnode = items[0]
 vector = items[2].replace(",", " ")
 w2v_file.write(qnode + " " + vector)
 kgtk_file.close()
 w2v_file.close()

In [None]:
convert_kgtk_to_w2v(f"{vector_output_path}", f"{vector_output_w2v_path}")

### Load the vectors into `gensim`

To find similar vectors based on cosine similarity

In [None]:
ge_vectors = KeyedVectors.load_word2vec_format(f"{vector_output_w2v_path}", binary=False)

Define a function to compute the `topn` similar vectors, and get the labels and descriptions of the matching Qnodes.

In [None]:
def kgtk_most_similar(
 vectors,
 positive,
 relation_label="similarity_score",
 add_label_description=True,
 output_path=None,
 topn=25,
):
 """
 find topn similar Qnodes, add label and decription for the Qnodes
 
 :param vectors: vector space loaded into gensim KeyedVectors model
 :param positive: vector(s) or Qnode(s) to find similar entities for
 :param relation_label: name of the property to be used for the output file
 :param add_label_description: boolean parameter to add label and description for matched entities
 :param output_path: path to store the output file
 :param topn: desirednumber of similar entities
 """
 result = []
 if add_label_description:
 fp = tempfile.NamedTemporaryFile(
 mode="w", suffix=".tsv", delete=False, encoding="utf-8"
 )
 fp.write("node1\tlabel\tnode2\n")
 for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
 fp.write("{}\t{}\t{}\n".format(qnode, relation_label, similarity))
 filename = fp.name
 fp.close()

 os.environ["_temp_file"] = filename

 result = !$kypher -i label -i description -i "$_temp_file" --as sim \
--match 'sim: (n1)-[]->(similarity), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, lab as `node1;label`, des as `node1;description`' \
--order-by 'cast(similarity, float) desc' 
 
 os.remove(filename)
 
 else:
 result.append("node1\tlabel\tnode2\n")
 for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
 result.append("{}\t{}\t{}\n".format(qnode, relation_label, similarity))

 if output_path:
 handle = open(output_path, "w")
 for line in result:
 handle.write(line)
 handle.write("\n")
 handle.close()
 else:
 columns = result[0].split("\t")
 data = []
 for line in result[1:]:
 data.append(line.split("\t"))
 return pd.DataFrame(data, columns=columns)

### Link Prediction

The following code reads the vectors for Qnodes as `head` and Properties as `relation`.

The files used in the code are produced by `kgtk graph-embeddings` code as a byproduct, in the folder specified by the `-T` option

In [None]:
relation_names_list = json.load(open(f"{os.environ['TEMP']}/output/dynamic_rel_names.json"))
entity_names_list = json.load(open(f"{os.environ['TEMP']}/output/entity_names_all_0.json"))
prop_count = len(relation_names_list)

# operators
operator_lhs = ComplexDiagonalDynamicOperator(vector_dimension, prop_count)
operator_rhs = ComplexDiagonalDynamicOperator(vector_dimension, prop_count)
comparator = DotComparator()
cos_comparator = CosComparator()
with h5py.File(f"{os.environ['TEMP']}/output/model/model.v100.h5", "r") as hf:
 operator_state_dict_lhs = {
 "real": torch.from_numpy(hf["model/relations/0/operator/lhs/real"][...]),
 "imag": torch.from_numpy(hf["model/relations/0/operator/lhs/imag"][...]),
 }
 operator_state_dict_rhs = {
 "real": torch.from_numpy(hf["model/relations/0/operator/rhs/real"][...]),
 "imag": torch.from_numpy(hf["model/relations/0/operator/rhs/imag"][...]),
 }
 
operator_lhs.load_state_dict(operator_state_dict_lhs)
operator_rhs.load_state_dict(operator_state_dict_rhs)

# Load the embeddings
with h5py.File(f"{os.environ['TEMP']}/output/model/embeddings_all_0.v100.h5", "r") as hf:
 arnold_embedding = torch.from_numpy(hf["embeddings"][...])


entity_to_index = {}
for i, entity in enumerate(entity_names_list):
 entity_to_index[entity] = i
 

rel_index = {}
for i, rel in enumerate(relation_names_list):
 rel_index[rel] = i

The following function takes as input a `Qnode` and a `Property`, and outputs a vector which should be similar to the value of the relation.

For example, Qnode: `Q37079` = Tom Cruise, Property: `P166` = awards received and output a vector similar to awards. We will see this equation in action in the subsequent examples.

In [None]:
def get_embed(head, relation=None):
 ''' This function generate the embeddings for the tail entities:
 Head entities: Obtained from the model
 Head + relation: Obtained using torch
 :param head: subject Qnode
 :param relation: optional property
 '''
 if relation is None:
 return arnold_embedding[entity_to_index[head], :].detach().numpy()
 return operator_lhs(
 arnold_embedding[entity_to_index[head], :].view(1, vector_dimension),
 torch.tensor([rel_index[relation]])
 ).detach().numpy()[0]

#### Get the vector for `Q37079` (Tom Cruise) + `P166` (award received), then find most similar entities

In [None]:
_vector = get_embed('Q37079', 'P166')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)

#### Get the vector for `Q170564` (Terminator 2: Judgement Day) + `P161` (cast member), then find most similar entities

In [None]:
_vector = get_embed('Q170564', 'P161')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)

#### Get the vector for `Q104123` (Pulp Fiction) + `P161` (cast member), then find most similar entities

In [None]:
_vector = get_embed('Q104123', 'P161')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)

#### Get the vector for `Q2685` (Arnold Schwarzenegger), then find most similar entities

In [None]:
_vector = get_embed('Q2685')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)

#### Get the vector for `Q103148` (Lahn River), then find most similar entities

In [None]:
_vector = get_embed('Q103148')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)

## Prepare files for Google Projector

In this section, we will prepare `vectors` and `metadata` files for google projector.

We are focusing on the following types:

- `Q11424` (film)
- `Q33999` (actor)
- `Q4022` (river)
- `Q82955` (politician)

First step is to create a file with the following information ,

1. node1 :- Qnode
2. label :- name of the property
3. node2 :- embedding vector for node1
4. node1;label :- label for node1
5. type :- `instance of` for node1
6. type;label :- label for type

In [None]:
%%time
kgtk(f""" query -i $GRAPH/claims.wikibase-item.augmented.tsv.gz 
 -i p279star 
 -i label 
 -i {vector_output_path} 
 -i $GRAPH/derived.P31x.tsv 
 --match 'item: (n1)-[]->(), 
 P31x: (n1)-[]->(c), 
 p279star: (c)-[]->(class), 
 label: (n1)-[]->(n1_label), 
 label: (class)-[]->(class_label), embeddings: (n1)-[l]->(embedding)'
 --where 'class in ["Q11424", "Q33999", "Q4022", "Q82955"]' 
 --return 'distinct n1, 
 l.label as label,
 embedding as node2,
 kgtk_lqstring_text(n1_label) as `node1;label`, 
 group_concat(distinct class) as type, 
 group_concat(distinct kgtk_lqstring_text(class_label)) as `type;label`'
 -o $TEMP/arnold.embeddings.google.projector.tsv
""")

#### Take a peek at the file

In [None]:
kgtk("""head -i $TEMP/arnold.embeddings.google.projector.tsv""")

#### Define a function to build the required files for google projector

In [None]:
def build_embedding_projector_metadata(gp_embeddings_path, metadata_path, vectors_path):
 """
 build the vector and metadata files required for google projector
 
 :param gp_embeddings_path: file path which has the embeddings and metadata in kgtk format
 :param metadata_path: output file path for metadata
 :param vectors_path: output file path for vectors
 """
 metadata_file = open(metadata_path, "w")
 metadata_file.write("tag\tqnode\ttype\ttype_label\n")

 vectors_file = open(vectors_path, "w")

 with open(gp_embeddings_path) as qnodes_file:
 next(qnodes_file)
 for line in qnodes_file:
 vals = line.split('\t')
 qnode = vals[0]
 qnode_label = vals[3]
 _type = vals[4] 
 ftype_label = vals[5]
 embeddings = "\t".join(vals[2].strip().split(","))

 if qnode.startswith("Q"):
 metadata_file.write("{}\t{}\t{}\t{}\n".format(qnode_label, qnode, _type, ftype_label.strip()))
 vectors_file.write(embeddings)
 vectors_file.write('\n')

 metadata_file.close()
 vectors_file.close()

In [None]:
build_embedding_projector_metadata(f"{os.environ['TEMP']}/arnold.embeddings.google.projector.tsv",
 f"{os.environ['OUT']}/arnold.metadata.{vector_dimension}.tsv",
 f"{os.environ['OUT']}/arnold.vectors.{vector_dimension}.tsv")

#### Peek at the metadata file

In [None]:
kgtk(f"""head -i $OUT/arnold.metadata.{vector_dimension}.tsv""")

#### Peek at the vectors file

In [None]:
!head -2 $OUT/arnold.vectors.$VECTOR_DIMENSION.tsv

## Google embedding projector
- open https://projector.tensorflow.org
- Load the vect files using the load button
- configure the visualization

Here we searched on the right for arnold, and we see the closest vecotrs as well as the cluster where it belongs:
![Google embedding projector](https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/tutorial/assets/gp-arnold.png "Google embedding projector")

#### PCA visualization of the embeddings, colored by `instance of`

![PCA Color by Type](https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/tutorial/assets/gp-color-map-types.png "PCA Color by Type")