{ "cells": [ { "cell_type": "markdown", "id": "6112aef3-3eee-4046-b39b-d469d114f092", "metadata": {}, "source": [ "## Step 0: Install KGTK" ] }, { "cell_type": "markdown", "id": "683a914d-cd73-4902-b1fc-aaf79f94fa6c", "metadata": {}, "source": [ "Only run the following cell if KGTK is not installed.\n", " For example, if running in [Google Colab](https://colab.research.google.com/)" ] }, { "cell_type": "code", "execution_count": null, "id": "60ed48b8-1a35-4ae3-a17c-97174b35c04c", "metadata": {}, "outputs": [], "source": [ "!pip install kgtk" ] }, { "cell_type": "markdown", "id": "e7ce4502-5bfe-4343-bb59-06b82a01a8d2", "metadata": {}, "source": [ "**Run the following cell, `gensim` is not installed with kgtk**" ] }, { "cell_type": "code", "execution_count": null, "id": "f6d3ed7d-00de-4520-b7f5-2719cb5aa18a", "metadata": {}, "outputs": [], "source": [ "!pip install gensim" ] }, { "cell_type": "code", "execution_count": null, "id": "stuffed-forge", "metadata": {}, "outputs": [], "source": [ "import os\n", "from kgtk.configure_kgtk_notebooks import ConfigureKGTK\n", "from kgtk.functions import kgtk, kypher\n", "from gensim.models import KeyedVectors\n", "import tempfile\n", "import h5py, torch\n", "from torchbiggraph.model import ComplexDiagonalDynamicOperator, DotComparator, CosComparator\n", "import json\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "id": "valued-space", "metadata": {}, "outputs": [], "source": [ "# Parameters\n", "\n", "# Folder on local machine where to create the output and temporary folders\n", "input_path = None\n", "output_path = \"/tmp/projects\"\n", "project_name = \"tutorial-graph-embeddings\"" ] }, { "cell_type": "code", "execution_count": null, "id": "dramatic-marker", "metadata": {}, "outputs": [], "source": [ "files = [\n", " \"all\",\n", " \"label\",\n", " \"alias\",\n", " \"description\",\n", " \"item\",\n", " \"qualifiers\",\n", " \"p31\",\n", " \"p279star\"\n", "]\n", "additional_files = {\n", " 'p31x': 'derived.P31x.tsv'\n", "}\n", "input_files_url = \"https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold-profiled\"\n", "ck = ConfigureKGTK(files,\n", " input_files_url=input_files_url)\n", "\n", "ck.configure_kgtk(input_graph_path=input_path,\n", " output_path=output_path,\n", " project_name=project_name,\n", " additional_files=additional_files)" ] }, { "cell_type": "code", "execution_count": null, "id": "operational-boost", "metadata": {}, "outputs": [], "source": [ "ck.print_env_variables()" ] }, { "cell_type": "code", "execution_count": null, "id": "typical-mustang", "metadata": {}, "outputs": [], "source": [ "ck.load_files_into_cache()" ] }, { "cell_type": "code", "execution_count": null, "id": "appointed-holly", "metadata": {}, "outputs": [], "source": [ "vector_dimension = 30\n", "vector_output_path = f\"{os.environ['OUT']}/arnold.embeddings.augmented.{vector_dimension}.tsv\"\n", "vector_output_w2v_path = f\"{os.environ['OUT']}/arnold.embeddings.augmented.{vector_dimension}.w2v.tsv\"\n", "os.environ['VECTOR_DIMENSION'] = str(vector_dimension)" ] }, { "cell_type": "markdown", "id": "sensitive-shade", "metadata": {}, "source": [ "## Compute ComplEx Graph Embeddings" ] }, { "cell_type": "markdown", "id": "specified-garbage", "metadata": {}, "source": [ "In this notebook we will compute graph embeddings using `kgtk graph-embeddings` command for the `arnold` subgraph and demonstrate a few applications.\n", "\n", "First step is to augment the `claims.wikibase-item.tsv.gz` file with `derived.P31x.tsv` file which contains occupations for humans as `instance of (P31)`\n", "\n", "- `claims.wikibase-item.tsv.gz`: KGTK claims file non literal edges only\n", "- `derived.P31x.tsv`: file with additional P31x links, adding occupation as `instance of` (computed)" ] }, { "cell_type": "code", "execution_count": null, "id": "confused-motorcycle", "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i $item \\\n", "-i $GRAPH/derived.P31x.tsv \\\n", "-o $GRAPH/claims.wikibase-item.augmented.tsv.gz" ] }, { "cell_type": "markdown", "id": "associate-italic", "metadata": {}, "source": [ "### Run `kgtk graph-embeddings`" ] }, { "cell_type": "markdown", "id": "ranging-guide", "metadata": {}, "source": [ "The `kgtk graph-embeddings` command takes as input a KGTK edge file and computes graph embeddings of user specified type, producing vectors of user specified dimensions.\n", "\n", "The following parameters are used in this instance:\n", "\n", "- `-op ComplEx`: compute ComplEx graph embeddings\n", "- `--dimension 30`: desired dimension of the vectors\n", "- `-ot kgtk`: output format - kgtk\n", "- `--retain_temporary_data True`: retain the byproduct files, which we will use in subsequent steps\n", "- `-T `: temporary folder where the temporary files will be stored\n", "- `-i `: input file\n", "- `-o `: output file\n", "- `--log `: log file" ] }, { "cell_type": "code", "execution_count": null, "id": "stone-monaco", "metadata": {}, "outputs": [], "source": [ "kgtk(f\"\"\" --debug graph-embeddings\n", " -op ComplEx \n", " --dimension $VECTOR_DIMENSION\n", " -ot kgtk\n", " --retain_temporary_data True\n", " -T $TEMP\n", " -w 1\n", " -i $GRAPH/claims.wikibase-item.augmented.tsv.gz\n", " -o {vector_output_path}\n", " --log $TEMP/ge.log.txt\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "accepted-merchandise", "metadata": {}, "source": [ "#### Take a peek at the embeddings file." ] }, { "cell_type": "code", "execution_count": null, "id": "extended-preference", "metadata": {}, "outputs": [], "source": [ "kgtk(f\"\"\"head -i {vector_output_path}\"\"\")" ] }, { "cell_type": "markdown", "id": "pretty-summary", "metadata": {}, "source": [ "### The output is in `kgtk` format. Convert it to `word2vec` format for `gensim` similarity computation\n", "\n", "\n", "For reference: \n", "- [gensim](https://radimrehurek.com/gensim/)\n", "- [word2vec](https://en.wikipedia.org/wiki/Word2vec)" ] }, { "cell_type": "code", "execution_count": null, "id": "packed-minority", "metadata": {}, "outputs": [], "source": [ "def convert_kgtk_to_w2v(input_path, output_path):\n", " \"\"\"\n", " Convert a KGTK file (node1/label/node2) that contains embeddings to the w2v format\n", " \"\"\"\n", " vector_count = 0\n", "\n", " # Read the file once to count the lines as we need to put them at the top of the w2v file\n", " with open(input_path, \"r\") as kgtk_file:\n", " next(kgtk_file)\n", " for line in kgtk_file:\n", " vector_count += 1\n", " kgtk_file.close()\n", "\n", " with open(output_path, \"w\") as w2v_file:\n", " w2v_file.write(\"{} {}\\n\".format(vector_count, vector_dimension))\n", " with open(input_path, \"r\") as kgtk_file:\n", " next(kgtk_file)\n", " for line in kgtk_file:\n", " items = line.split(\"\\t\")\n", " qnode = items[0]\n", " vector = items[2].replace(\",\", \" \")\n", " w2v_file.write(qnode + \" \" + vector)\n", " kgtk_file.close()\n", " w2v_file.close()" ] }, { "cell_type": "code", "execution_count": null, "id": "pursuant-beach", "metadata": {}, "outputs": [], "source": [ "convert_kgtk_to_w2v(f\"{vector_output_path}\", f\"{vector_output_w2v_path}\")" ] }, { "cell_type": "markdown", "id": "optimum-anchor", "metadata": {}, "source": [ "### Load the vectors into `gensim`" ] }, { "cell_type": "markdown", "id": "fossil-cinema", "metadata": {}, "source": [ "To find similar vectors based on cosine similarity" ] }, { "cell_type": "code", "execution_count": null, "id": "suffering-myrtle", "metadata": {}, "outputs": [], "source": [ "ge_vectors = KeyedVectors.load_word2vec_format(f\"{vector_output_w2v_path}\", binary=False)" ] }, { "cell_type": "markdown", "id": "hollywood-likelihood", "metadata": {}, "source": [ "Define a function to compute the `topn` similar vectors, and get the labels and descriptions of the matching Qnodes." ] }, { "cell_type": "code", "execution_count": null, "id": "photographic-pearl", "metadata": {}, "outputs": [], "source": [ "def kgtk_most_similar(\n", " vectors,\n", " positive,\n", " relation_label=\"similarity_score\",\n", " add_label_description=True,\n", " output_path=None,\n", " topn=25,\n", "):\n", " \"\"\"\n", " find topn similar Qnodes, add label and decription for the Qnodes\n", " \n", " :param vectors: vector space loaded into gensim KeyedVectors model\n", " :param positive: vector(s) or Qnode(s) to find similar entities for\n", " :param relation_label: name of the property to be used for the output file\n", " :param add_label_description: boolean parameter to add label and description for matched entities\n", " :param output_path: path to store the output file\n", " :param topn: desirednumber of similar entities\n", " \"\"\"\n", " result = []\n", " if add_label_description:\n", " fp = tempfile.NamedTemporaryFile(\n", " mode=\"w\", suffix=\".tsv\", delete=False, encoding=\"utf-8\"\n", " )\n", " fp.write(\"node1\\tlabel\\tnode2\\n\")\n", " for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):\n", " fp.write(\"{}\\t{}\\t{}\\n\".format(qnode, relation_label, similarity))\n", " filename = fp.name\n", " fp.close()\n", "\n", " os.environ[\"_temp_file\"] = filename\n", "\n", " result = !$kypher -i label -i description -i \"$_temp_file\" --as sim \\\n", "--match 'sim: (n1)-[]->(similarity), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \\\n", "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, lab as `node1;label`, des as `node1;description`' \\\n", "--order-by 'cast(similarity, float) desc' \n", " \n", " os.remove(filename)\n", " \n", " else:\n", " result.append(\"node1\\tlabel\\tnode2\\n\")\n", " for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):\n", " result.append(\"{}\\t{}\\t{}\\n\".format(qnode, relation_label, similarity))\n", "\n", " if output_path:\n", " handle = open(output_path, \"w\")\n", " for line in result:\n", " handle.write(line)\n", " handle.write(\"\\n\")\n", " handle.close()\n", " else:\n", " columns = result[0].split(\"\\t\")\n", " data = []\n", " for line in result[1:]:\n", " data.append(line.split(\"\\t\"))\n", " return pd.DataFrame(data, columns=columns)" ] }, { "cell_type": "markdown", "id": "improving-candle", "metadata": {}, "source": [ "### Link Prediction" ] }, { "cell_type": "markdown", "id": "assisted-british", "metadata": {}, "source": [ "The following code reads the vectors for Qnodes as `head` and Properties as `relation`.\n", "\n", "The files used in the code are produced by `kgtk graph-embeddings` code as a byproduct, in the folder specified by the `-T` option" ] }, { "cell_type": "code", "execution_count": null, "id": "spoken-association", "metadata": {}, "outputs": [], "source": [ "relation_names_list = json.load(open(f\"{os.environ['TEMP']}/output/dynamic_rel_names.json\"))\n", "entity_names_list = json.load(open(f\"{os.environ['TEMP']}/output/entity_names_all_0.json\"))\n", "prop_count = len(relation_names_list)\n", "\n", "# operators\n", "operator_lhs = ComplexDiagonalDynamicOperator(vector_dimension, prop_count)\n", "operator_rhs = ComplexDiagonalDynamicOperator(vector_dimension, prop_count)\n", "comparator = DotComparator()\n", "cos_comparator = CosComparator()\n", "with h5py.File(f\"{os.environ['TEMP']}/output/model/model.v100.h5\", \"r\") as hf:\n", " operator_state_dict_lhs = {\n", " \"real\": torch.from_numpy(hf[\"model/relations/0/operator/lhs/real\"][...]),\n", " \"imag\": torch.from_numpy(hf[\"model/relations/0/operator/lhs/imag\"][...]),\n", " }\n", " operator_state_dict_rhs = {\n", " \"real\": torch.from_numpy(hf[\"model/relations/0/operator/rhs/real\"][...]),\n", " \"imag\": torch.from_numpy(hf[\"model/relations/0/operator/rhs/imag\"][...]),\n", " }\n", " \n", "operator_lhs.load_state_dict(operator_state_dict_lhs)\n", "operator_rhs.load_state_dict(operator_state_dict_rhs)\n", "\n", "# Load the embeddings\n", "with h5py.File(f\"{os.environ['TEMP']}/output/model/embeddings_all_0.v100.h5\", \"r\") as hf:\n", " arnold_embedding = torch.from_numpy(hf[\"embeddings\"][...])\n", "\n", "\n", "entity_to_index = {}\n", "for i, entity in enumerate(entity_names_list):\n", " entity_to_index[entity] = i\n", " \n", "\n", "rel_index = {}\n", "for i, rel in enumerate(relation_names_list):\n", " rel_index[rel] = i" ] }, { "cell_type": "markdown", "id": "seeing-attitude", "metadata": {}, "source": [ "The following function takes as input a `Qnode` and a `Property`, and outputs a vector which should be similar to the value of the relation.\n", "\n", "For example, Qnode: `Q37079` = Tom Cruise, Property: `P166` = awards received and output a vector similar to awards. We will see this equation in action in the subsequent examples." ] }, { "cell_type": "code", "execution_count": null, "id": "retired-attraction", "metadata": {}, "outputs": [], "source": [ "def get_embed(head, relation=None):\n", " ''' This function generate the embeddings for the tail entities:\n", " Head entities: Obtained from the model\n", " Head + relation: Obtained using torch\n", " :param head: subject Qnode\n", " :param relation: optional property\n", " '''\n", " if relation is None:\n", " return arnold_embedding[entity_to_index[head], :].detach().numpy()\n", " return operator_lhs(\n", " arnold_embedding[entity_to_index[head], :].view(1, vector_dimension),\n", " torch.tensor([rel_index[relation]])\n", " ).detach().numpy()[0]" ] }, { "cell_type": "markdown", "id": "retired-corrections", "metadata": {}, "source": [ "#### Get the vector for `Q37079` (Tom Cruise) + `P166` (award received), then find most similar entities" ] }, { "cell_type": "code", "execution_count": null, "id": "molecular-supplement", "metadata": {}, "outputs": [], "source": [ "_vector = get_embed('Q37079', 'P166')\n", "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)" ] }, { "cell_type": "markdown", "id": "colonial-bolivia", "metadata": {}, "source": [ "#### Get the vector for `Q170564` (Terminator 2: Judgement Day) + `P161` (cast member), then find most similar entities" ] }, { "cell_type": "code", "execution_count": null, "id": "armed-federation", "metadata": {}, "outputs": [], "source": [ "_vector = get_embed('Q170564', 'P161')\n", "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)" ] }, { "cell_type": "markdown", "id": "driving-insight", "metadata": {}, "source": [ "#### Get the vector for `Q104123` (Pulp Fiction) + `P161` (cast member), then find most similar entities" ] }, { "cell_type": "code", "execution_count": null, "id": "helpful-secret", "metadata": {}, "outputs": [], "source": [ "_vector = get_embed('Q104123', 'P161')\n", "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)" ] }, { "cell_type": "markdown", "id": "indian-joseph", "metadata": {}, "source": [ "#### Get the vector for `Q2685` (Arnold Schwarzenegger), then find most similar entities" ] }, { "cell_type": "code", "execution_count": null, "id": "burning-green", "metadata": {}, "outputs": [], "source": [ "_vector = get_embed('Q2685')\n", "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)" ] }, { "cell_type": "markdown", "id": "romance-adrian", "metadata": {}, "source": [ "#### Get the vector for `Q103148` (Lahn River), then find most similar entities" ] }, { "cell_type": "code", "execution_count": null, "id": "diverse-museum", "metadata": {}, "outputs": [], "source": [ "_vector = get_embed('Q103148')\n", "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)" ] }, { "cell_type": "markdown", "id": "beautiful-branch", "metadata": {}, "source": [ "## Prepare files for Google Projector" ] }, { "cell_type": "markdown", "id": "together-paper", "metadata": {}, "source": [ "In this section, we will prepare `vectors` and `metadata` files for google projector.\n", "\n", "We are focusing on the following types:\n", "\n", "- `Q11424` (film)\n", "- `Q33999` (actor)\n", "- `Q4022` (river)\n", "- `Q82955` (politician)\n", "\n", "First step is to create a file with the following information ,\n", "\n", "1. node1 :- Qnode\n", "2. label :- name of the property\n", "3. node2 :- embedding vector for node1\n", "4. node1;label :- label for node1\n", "5. type :- `instance of` for node1\n", "6. type;label :- label for type" ] }, { "cell_type": "code", "execution_count": null, "id": "simplified-implementation", "metadata": {}, "outputs": [], "source": [ "%%time\n", "kgtk(f\"\"\" query -i $GRAPH/claims.wikibase-item.augmented.tsv.gz \n", " -i p279star \n", " -i label \n", " -i {vector_output_path} \n", " -i $GRAPH/derived.P31x.tsv \n", " --match 'item: (n1)-[]->(), \n", " P31x: (n1)-[]->(c), \n", " p279star: (c)-[]->(class), \n", " label: (n1)-[]->(n1_label), \n", " label: (class)-[]->(class_label), embeddings: (n1)-[l]->(embedding)'\n", " --where 'class in [\"Q11424\", \"Q33999\", \"Q4022\", \"Q82955\"]' \n", " --return 'distinct n1, \n", " l.label as label,\n", " embedding as node2,\n", " kgtk_lqstring_text(n1_label) as `node1;label`, \n", " group_concat(distinct class) as type, \n", " group_concat(distinct kgtk_lqstring_text(class_label)) as `type;label`'\n", " -o $TEMP/arnold.embeddings.google.projector.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "civilian-metro", "metadata": {}, "source": [ "#### Take a peek at the file" ] }, { "cell_type": "code", "execution_count": null, "id": "raising-drink", "metadata": {}, "outputs": [], "source": [ "kgtk(\"\"\"head -i $TEMP/arnold.embeddings.google.projector.tsv\"\"\")" ] }, { "cell_type": "markdown", "id": "respective-employer", "metadata": {}, "source": [ "#### Define a function to build the required files for google projector" ] }, { "cell_type": "code", "execution_count": null, "id": "formed-metabolism", "metadata": {}, "outputs": [], "source": [ "def build_embedding_projector_metadata(gp_embeddings_path, metadata_path, vectors_path):\n", " \"\"\"\n", " build the vector and metadata files required for google projector\n", " \n", " :param gp_embeddings_path: file path which has the embeddings and metadata in kgtk format\n", " :param metadata_path: output file path for metadata\n", " :param vectors_path: output file path for vectors\n", " \"\"\"\n", " metadata_file = open(metadata_path, \"w\")\n", " metadata_file.write(\"tag\\tqnode\\ttype\\ttype_label\\n\")\n", "\n", " vectors_file = open(vectors_path, \"w\")\n", "\n", " with open(gp_embeddings_path) as qnodes_file:\n", " next(qnodes_file)\n", " for line in qnodes_file:\n", " vals = line.split('\\t')\n", " qnode = vals[0]\n", " qnode_label = vals[3]\n", " _type = vals[4] \n", " ftype_label = vals[5]\n", " embeddings = \"\\t\".join(vals[2].strip().split(\",\"))\n", "\n", " if qnode.startswith(\"Q\"):\n", " metadata_file.write(\"{}\\t{}\\t{}\\t{}\\n\".format(qnode_label, qnode, _type, ftype_label.strip()))\n", " vectors_file.write(embeddings)\n", " vectors_file.write('\\n')\n", "\n", " metadata_file.close()\n", " vectors_file.close()" ] }, { "cell_type": "code", "execution_count": null, "id": "pointed-lunch", "metadata": {}, "outputs": [], "source": [ "build_embedding_projector_metadata(f\"{os.environ['TEMP']}/arnold.embeddings.google.projector.tsv\",\n", " f\"{os.environ['OUT']}/arnold.metadata.{vector_dimension}.tsv\",\n", " f\"{os.environ['OUT']}/arnold.vectors.{vector_dimension}.tsv\")" ] }, { "cell_type": "markdown", "id": "moved-mercy", "metadata": {}, "source": [ "#### Peek at the metadata file" ] }, { "cell_type": "code", "execution_count": null, "id": "acoustic-oliver", "metadata": {}, "outputs": [], "source": [ "kgtk(f\"\"\"head -i $OUT/arnold.metadata.{vector_dimension}.tsv\"\"\")" ] }, { "cell_type": "markdown", "id": "alpine-estonia", "metadata": {}, "source": [ "#### Peek at the vectors file" ] }, { "cell_type": "code", "execution_count": null, "id": "patient-mozambique", "metadata": {}, "outputs": [], "source": [ "!head -2 $OUT/arnold.vectors.$VECTOR_DIMENSION.tsv" ] }, { "cell_type": "markdown", "id": "under-angola", "metadata": {}, "source": [ "## Google embedding projector\n", "- open https://projector.tensorflow.org\n", "- Load the vect files using the load button\n", "- configure the visualization\n", "\n", "Here we searched on the right for arnold, and we see the closest vecotrs as well as the cluster where it belongs:\n", "![Google embedding projector](https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/tutorial/assets/gp-arnold.png \"Google embedding projector\")" ] }, { "cell_type": "markdown", "id": "rolled-falls", "metadata": {}, "source": [ "#### PCA visualization of the embeddings, colored by `instance of`" ] }, { "cell_type": "markdown", "id": "respective-buying", "metadata": {}, "source": [ "![PCA Color by Type](https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/tutorial/assets/gp-color-map-types.png \"PCA Color by Type\")" ] }, { "cell_type": "code", "execution_count": null, "id": "9c0fc0bb-5dc4-4bdd-baaf-8a617f71348c", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kgtk-env", "language": "python", "name": "kgtk-env" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }