{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6112aef3-3eee-4046-b39b-d469d114f092",
   "metadata": {},
   "source": [
    "## Step 0: Install KGTK"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "683a914d-cd73-4902-b1fc-aaf79f94fa6c",
   "metadata": {},
   "source": [
    "Only run the following cell if KGTK is not installed.\n",
    " For example, if running in [Google Colab](https://colab.research.google.com/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60ed48b8-1a35-4ae3-a17c-97174b35c04c",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install kgtk"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7ce4502-5bfe-4343-bb59-06b82a01a8d2",
   "metadata": {},
   "source": [
    "**Run the following cell, `gensim` is not installed with kgtk**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6d3ed7d-00de-4520-b7f5-2719cb5aa18a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install gensim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "stuffed-forge",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from kgtk.configure_kgtk_notebooks import ConfigureKGTK\n",
    "from kgtk.functions import kgtk, kypher\n",
    "from gensim.models import KeyedVectors\n",
    "import tempfile\n",
    "import h5py, torch\n",
    "from torchbiggraph.model import ComplexDiagonalDynamicOperator, DotComparator, CosComparator\n",
    "import json\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "valued-space",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parameters\n",
    "\n",
    "# Folder on local machine where to create the output and temporary folders\n",
    "input_path = None\n",
    "output_path = \"/tmp/projects\"\n",
    "project_name = \"tutorial-graph-embeddings\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dramatic-marker",
   "metadata": {},
   "outputs": [],
   "source": [
    "files = [\n",
    "    \"all\",\n",
    "    \"label\",\n",
    "    \"alias\",\n",
    "    \"description\",\n",
    "    \"item\",\n",
    "    \"qualifiers\",\n",
    "    \"p31\",\n",
    "    \"p279star\"\n",
    "]\n",
    "additional_files = {\n",
    "    'p31x': 'derived.P31x.tsv'\n",
    "}\n",
    "input_files_url = \"https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold-profiled\"\n",
    "ck = ConfigureKGTK(files,\n",
    "                  input_files_url=input_files_url)\n",
    "\n",
    "ck.configure_kgtk(input_graph_path=input_path,\n",
    "                  output_path=output_path,\n",
    "                  project_name=project_name,\n",
    "                 additional_files=additional_files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "operational-boost",
   "metadata": {},
   "outputs": [],
   "source": [
    "ck.print_env_variables()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "typical-mustang",
   "metadata": {},
   "outputs": [],
   "source": [
    "ck.load_files_into_cache()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "appointed-holly",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_dimension = 30\n",
    "vector_output_path = f\"{os.environ['OUT']}/arnold.embeddings.augmented.{vector_dimension}.tsv\"\n",
    "vector_output_w2v_path = f\"{os.environ['OUT']}/arnold.embeddings.augmented.{vector_dimension}.w2v.tsv\"\n",
    "os.environ['VECTOR_DIMENSION'] = str(vector_dimension)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sensitive-shade",
   "metadata": {},
   "source": [
    "## Compute ComplEx Graph Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "specified-garbage",
   "metadata": {},
   "source": [
    "In this notebook we will compute graph embeddings using `kgtk graph-embeddings` command for the `arnold` subgraph and demonstrate a few applications.\n",
    "\n",
    "First step is to augment the `claims.wikibase-item.tsv.gz` file with `derived.P31x.tsv` file which contains occupations for humans as `instance of (P31)`\n",
    "\n",
    "- `claims.wikibase-item.tsv.gz`: KGTK claims file non literal edges only\n",
    "- `derived.P31x.tsv`: file with additional P31x links, adding occupation as `instance of` (computed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "confused-motorcycle",
   "metadata": {},
   "outputs": [],
   "source": [
    "!kgtk cat -i $item \\\n",
    "-i $GRAPH/derived.P31x.tsv \\\n",
    "-o $GRAPH/claims.wikibase-item.augmented.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "associate-italic",
   "metadata": {},
   "source": [
    "### Run `kgtk graph-embeddings`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ranging-guide",
   "metadata": {},
   "source": [
    "The `kgtk graph-embeddings` command takes as input a KGTK edge file and computes graph embeddings of user specified type, producing vectors of user specified dimensions.\n",
    "\n",
    "The following parameters are used in this instance:\n",
    "\n",
    "- `-op ComplEx`: compute ComplEx graph embeddings\n",
    "- `--dimension 30`: desired dimension of the vectors\n",
    "- `-ot kgtk`: output format - kgtk\n",
    "- `--retain_temporary_data True`: retain the byproduct files, which we will use in subsequent steps\n",
    "- `-T <folder path>`: temporary folder where the temporary files will be stored\n",
    "- `-i <file>`: input file\n",
    "- `-o <file>`: output file\n",
    "- `--log <file>`: log file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "stone-monaco",
   "metadata": {},
   "outputs": [],
   "source": [
    "kgtk(f\"\"\" --debug graph-embeddings\n",
    "            -op ComplEx \n",
    "            --dimension $VECTOR_DIMENSION\n",
    "            -ot kgtk\n",
    "            --retain_temporary_data True\n",
    "            -T $TEMP\n",
    "            -w 1\n",
    "            -i $GRAPH/claims.wikibase-item.augmented.tsv.gz\n",
    "            -o {vector_output_path}\n",
    "            --log $TEMP/ge.log.txt\n",
    "    \"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "accepted-merchandise",
   "metadata": {},
   "source": [
    "#### Take a peek at the embeddings file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "extended-preference",
   "metadata": {},
   "outputs": [],
   "source": [
    "kgtk(f\"\"\"head -i {vector_output_path}\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "pretty-summary",
   "metadata": {},
   "source": [
    "### The output is in `kgtk` format. Convert it to `word2vec` format for `gensim` similarity computation\n",
    "\n",
    "\n",
    "For reference: \n",
    "- [gensim](https://radimrehurek.com/gensim/)\n",
    "- [word2vec](https://en.wikipedia.org/wiki/Word2vec)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "packed-minority",
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_kgtk_to_w2v(input_path, output_path):\n",
    "    \"\"\"\n",
    "    Convert a KGTK file (node1/label/node2) that contains embeddings to the w2v format\n",
    "    \"\"\"\n",
    "    vector_count = 0\n",
    "\n",
    "    # Read the file once to count the lines as we need to put them at the top of the w2v file\n",
    "    with open(input_path, \"r\") as kgtk_file:\n",
    "        next(kgtk_file)\n",
    "        for line in kgtk_file:\n",
    "            vector_count += 1\n",
    "        kgtk_file.close()\n",
    "\n",
    "    with open(output_path, \"w\") as w2v_file:\n",
    "        w2v_file.write(\"{} {}\\n\".format(vector_count, vector_dimension))\n",
    "        with open(input_path, \"r\") as kgtk_file:\n",
    "            next(kgtk_file)\n",
    "            for line in kgtk_file:\n",
    "                items = line.split(\"\\t\")\n",
    "                qnode = items[0]\n",
    "                vector = items[2].replace(\",\", \" \")\n",
    "                w2v_file.write(qnode + \" \" + vector)\n",
    "            kgtk_file.close()\n",
    "        w2v_file.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "pursuant-beach",
   "metadata": {},
   "outputs": [],
   "source": [
    "convert_kgtk_to_w2v(f\"{vector_output_path}\", f\"{vector_output_w2v_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "optimum-anchor",
   "metadata": {},
   "source": [
    "### Load the vectors into `gensim`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fossil-cinema",
   "metadata": {},
   "source": [
    "To find similar vectors based on cosine similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "suffering-myrtle",
   "metadata": {},
   "outputs": [],
   "source": [
    "ge_vectors = KeyedVectors.load_word2vec_format(f\"{vector_output_w2v_path}\", binary=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hollywood-likelihood",
   "metadata": {},
   "source": [
    "Define a function to compute the `topn` similar vectors, and get the labels and descriptions of the matching Qnodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "photographic-pearl",
   "metadata": {},
   "outputs": [],
   "source": [
    "def kgtk_most_similar(\n",
    "    vectors,\n",
    "    positive,\n",
    "    relation_label=\"similarity_score\",\n",
    "    add_label_description=True,\n",
    "    output_path=None,\n",
    "    topn=25,\n",
    "):\n",
    "    \"\"\"\n",
    "    find topn similar Qnodes, add label and decription for the Qnodes\n",
    "    \n",
    "    :param vectors: vector space loaded into gensim KeyedVectors model\n",
    "    :param positive: vector(s) or Qnode(s) to find similar entities for\n",
    "    :param relation_label: name of the property to be used for the output file\n",
    "    :param add_label_description: boolean parameter to add label and description for matched entities\n",
    "    :param output_path: path to store the output file\n",
    "    :param topn: desirednumber of similar entities\n",
    "    \"\"\"\n",
    "    result = []\n",
    "    if add_label_description:\n",
    "        fp = tempfile.NamedTemporaryFile(\n",
    "            mode=\"w\", suffix=\".tsv\", delete=False, encoding=\"utf-8\"\n",
    "        )\n",
    "        fp.write(\"node1\\tlabel\\tnode2\\n\")\n",
    "        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):\n",
    "            fp.write(\"{}\\t{}\\t{}\\n\".format(qnode, relation_label, similarity))\n",
    "        filename = fp.name\n",
    "        fp.close()\n",
    "\n",
    "        os.environ[\"_temp_file\"] = filename\n",
    "\n",
    "        result = !$kypher -i label -i description -i \"$_temp_file\" --as sim \\\n",
    "--match 'sim: (n1)-[]->(similarity), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \\\n",
    "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, lab as `node1;label`, des as `node1;description`' \\\n",
    "--order-by 'cast(similarity, float) desc' \n",
    "        \n",
    "        os.remove(filename)\n",
    "        \n",
    "    else:\n",
    "        result.append(\"node1\\tlabel\\tnode2\\n\")\n",
    "        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):\n",
    "            result.append(\"{}\\t{}\\t{}\\n\".format(qnode, relation_label, similarity))\n",
    "\n",
    "    if output_path:\n",
    "        handle = open(output_path, \"w\")\n",
    "        for line in result:\n",
    "            handle.write(line)\n",
    "            handle.write(\"\\n\")\n",
    "        handle.close()\n",
    "    else:\n",
    "        columns = result[0].split(\"\\t\")\n",
    "        data = []\n",
    "        for line in result[1:]:\n",
    "            data.append(line.split(\"\\t\"))\n",
    "        return pd.DataFrame(data, columns=columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "improving-candle",
   "metadata": {},
   "source": [
    "### Link Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "assisted-british",
   "metadata": {},
   "source": [
    "The following code reads the vectors for Qnodes as `head` and Properties as `relation`.\n",
    "\n",
    "The files used in the code are produced by `kgtk graph-embeddings` code as a byproduct, in the folder specified by the `-T` option"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "spoken-association",
   "metadata": {},
   "outputs": [],
   "source": [
    "relation_names_list = json.load(open(f\"{os.environ['TEMP']}/output/dynamic_rel_names.json\"))\n",
    "entity_names_list = json.load(open(f\"{os.environ['TEMP']}/output/entity_names_all_0.json\"))\n",
    "prop_count = len(relation_names_list)\n",
    "\n",
    "# operators\n",
    "operator_lhs = ComplexDiagonalDynamicOperator(vector_dimension, prop_count)\n",
    "operator_rhs = ComplexDiagonalDynamicOperator(vector_dimension, prop_count)\n",
    "comparator = DotComparator()\n",
    "cos_comparator = CosComparator()\n",
    "with h5py.File(f\"{os.environ['TEMP']}/output/model/model.v100.h5\", \"r\") as hf:\n",
    "    operator_state_dict_lhs = {\n",
    "        \"real\": torch.from_numpy(hf[\"model/relations/0/operator/lhs/real\"][...]),\n",
    "        \"imag\": torch.from_numpy(hf[\"model/relations/0/operator/lhs/imag\"][...]),\n",
    "    }\n",
    "    operator_state_dict_rhs = {\n",
    "        \"real\": torch.from_numpy(hf[\"model/relations/0/operator/rhs/real\"][...]),\n",
    "        \"imag\": torch.from_numpy(hf[\"model/relations/0/operator/rhs/imag\"][...]),\n",
    "    }\n",
    "    \n",
    "operator_lhs.load_state_dict(operator_state_dict_lhs)\n",
    "operator_rhs.load_state_dict(operator_state_dict_rhs)\n",
    "\n",
    "# Load the embeddings\n",
    "with h5py.File(f\"{os.environ['TEMP']}/output/model/embeddings_all_0.v100.h5\", \"r\") as hf:\n",
    "    arnold_embedding = torch.from_numpy(hf[\"embeddings\"][...])\n",
    "\n",
    "\n",
    "entity_to_index = {}\n",
    "for i, entity in enumerate(entity_names_list):\n",
    "    entity_to_index[entity] = i\n",
    "    \n",
    "\n",
    "rel_index = {}\n",
    "for i, rel in enumerate(relation_names_list):\n",
    "    rel_index[rel] = i"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "seeing-attitude",
   "metadata": {},
   "source": [
    "The following function takes as input a `Qnode` and a `Property`, and outputs a vector which should be similar to the value of the relation.\n",
    "\n",
    "For example, Qnode: `Q37079` = Tom Cruise, Property: `P166` = awards received and output a vector similar to awards. We will see this equation in action in the subsequent examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "retired-attraction",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_embed(head, relation=None):\n",
    "    ''' This function generate the embeddings for the tail entities:\n",
    "            Head entities: Obtained from the model\n",
    "            Head + relation: Obtained using torch\n",
    "        :param head: subject Qnode\n",
    "        :param relation: optional property\n",
    "    '''\n",
    "    if relation is None:\n",
    "        return arnold_embedding[entity_to_index[head], :].detach().numpy()\n",
    "    return  operator_lhs(\n",
    "                arnold_embedding[entity_to_index[head], :].view(1, vector_dimension),\n",
    "                torch.tensor([rel_index[relation]])\n",
    "            ).detach().numpy()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "retired-corrections",
   "metadata": {},
   "source": [
    "#### Get the vector for `Q37079` (Tom Cruise) + `P166` (award received), then find most similar entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "molecular-supplement",
   "metadata": {},
   "outputs": [],
   "source": [
    "_vector = get_embed('Q37079', 'P166')\n",
    "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "colonial-bolivia",
   "metadata": {},
   "source": [
    "#### Get the vector for `Q170564` (Terminator 2: Judgement Day) + `P161` (cast member), then find most similar entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "armed-federation",
   "metadata": {},
   "outputs": [],
   "source": [
    "_vector = get_embed('Q170564', 'P161')\n",
    "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "driving-insight",
   "metadata": {},
   "source": [
    "#### Get the vector for `Q104123` (Pulp Fiction) + `P161` (cast member), then find most similar entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "helpful-secret",
   "metadata": {},
   "outputs": [],
   "source": [
    "_vector = get_embed('Q104123', 'P161')\n",
    "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "indian-joseph",
   "metadata": {},
   "source": [
    "#### Get the vector for `Q2685` (Arnold Schwarzenegger), then find most similar entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "burning-green",
   "metadata": {},
   "outputs": [],
   "source": [
    "_vector = get_embed('Q2685')\n",
    "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "romance-adrian",
   "metadata": {},
   "source": [
    "#### Get the vector for `Q103148` (Lahn River), then find most similar entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "diverse-museum",
   "metadata": {},
   "outputs": [],
   "source": [
    "_vector = get_embed('Q103148')\n",
    "kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "beautiful-branch",
   "metadata": {},
   "source": [
    "## Prepare files for Google Projector"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "together-paper",
   "metadata": {},
   "source": [
    "In this section, we will prepare `vectors` and `metadata` files for google projector.\n",
    "\n",
    "We are focusing on the following types:\n",
    "\n",
    "- `Q11424` (film)\n",
    "- `Q33999` (actor)\n",
    "- `Q4022` (river)\n",
    "- `Q82955` (politician)\n",
    "\n",
    "First step is to create a file with the following information ,\n",
    "\n",
    "1. node1 :- Qnode\n",
    "2. label :- name of the property\n",
    "3. node2 :- embedding vector for node1\n",
    "4. node1;label :- label for node1\n",
    "5. type :- `instance of` for node1\n",
    "6. type;label :- label for type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "simplified-implementation",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "kgtk(f\"\"\" query -i $GRAPH/claims.wikibase-item.augmented.tsv.gz \n",
    "         -i p279star \n",
    "         -i label \n",
    "         -i {vector_output_path} \n",
    "         -i $GRAPH/derived.P31x.tsv \n",
    "         --match 'item: (n1)-[]->(), \n",
    "             P31x: (n1)-[]->(c), \n",
    "             p279star: (c)-[]->(class), \n",
    "             label: (n1)-[]->(n1_label), \n",
    "             label: (class)-[]->(class_label), embeddings: (n1)-[l]->(embedding)'\n",
    "        --where 'class in [\"Q11424\", \"Q33999\", \"Q4022\", \"Q82955\"]' \n",
    "        --return 'distinct n1, \n",
    "                  l.label as label,\n",
    "                  embedding as node2,\n",
    "                  kgtk_lqstring_text(n1_label) as `node1;label`, \n",
    "                  group_concat(distinct class) as type, \n",
    "                  group_concat(distinct kgtk_lqstring_text(class_label)) as `type;label`'\n",
    "        -o $TEMP/arnold.embeddings.google.projector.tsv\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "civilian-metro",
   "metadata": {},
   "source": [
    "#### Take a peek at the file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "raising-drink",
   "metadata": {},
   "outputs": [],
   "source": [
    "kgtk(\"\"\"head -i $TEMP/arnold.embeddings.google.projector.tsv\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "respective-employer",
   "metadata": {},
   "source": [
    "#### Define a function to build the required files for google projector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "formed-metabolism",
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_embedding_projector_metadata(gp_embeddings_path, metadata_path, vectors_path):\n",
    "    \"\"\"\n",
    "    build the vector and metadata files required for google projector\n",
    "    \n",
    "    :param gp_embeddings_path: file path which has the embeddings and metadata in kgtk format\n",
    "    :param metadata_path: output file path for metadata\n",
    "    :param vectors_path: output file path for vectors\n",
    "    \"\"\"\n",
    "    metadata_file = open(metadata_path, \"w\")\n",
    "    metadata_file.write(\"tag\\tqnode\\ttype\\ttype_label\\n\")\n",
    "\n",
    "    vectors_file = open(vectors_path, \"w\")\n",
    "\n",
    "    with open(gp_embeddings_path) as qnodes_file:\n",
    "        next(qnodes_file)\n",
    "        for line in qnodes_file:\n",
    "            vals = line.split('\\t')\n",
    "            qnode = vals[0]\n",
    "            qnode_label = vals[3]\n",
    "            _type = vals[4] \n",
    "            ftype_label = vals[5]\n",
    "            embeddings = \"\\t\".join(vals[2].strip().split(\",\"))\n",
    "\n",
    "            if qnode.startswith(\"Q\"):\n",
    "                metadata_file.write(\"{}\\t{}\\t{}\\t{}\\n\".format(qnode_label, qnode, _type, ftype_label.strip()))\n",
    "                vectors_file.write(embeddings)\n",
    "                vectors_file.write('\\n')\n",
    "\n",
    "    metadata_file.close()\n",
    "    vectors_file.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "pointed-lunch",
   "metadata": {},
   "outputs": [],
   "source": [
    "build_embedding_projector_metadata(f\"{os.environ['TEMP']}/arnold.embeddings.google.projector.tsv\",\n",
    "                                  f\"{os.environ['OUT']}/arnold.metadata.{vector_dimension}.tsv\",\n",
    "                                  f\"{os.environ['OUT']}/arnold.vectors.{vector_dimension}.tsv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "moved-mercy",
   "metadata": {},
   "source": [
    "#### Peek at the metadata file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acoustic-oliver",
   "metadata": {},
   "outputs": [],
   "source": [
    "kgtk(f\"\"\"head -i $OUT/arnold.metadata.{vector_dimension}.tsv\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "alpine-estonia",
   "metadata": {},
   "source": [
    "#### Peek at the vectors file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "patient-mozambique",
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -2 $OUT/arnold.vectors.$VECTOR_DIMENSION.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "under-angola",
   "metadata": {},
   "source": [
    "## Google embedding projector\n",
    "- open https://projector.tensorflow.org\n",
    "- Load the vect files using the load button\n",
    "- configure the visualization\n",
    "\n",
    "Here we searched on the right for arnold, and we see the closest vecotrs as well as the cluster where it belongs:\n",
    "![Google embedding projector](https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/tutorial/assets/gp-arnold.png \"Google embedding projector\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "rolled-falls",
   "metadata": {},
   "source": [
    "#### PCA visualization of the embeddings, colored by `instance of`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "respective-buying",
   "metadata": {},
   "source": [
    "![PCA Color by Type](https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/tutorial/assets/gp-color-map-types.png \"PCA Color by Type\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c0fc0bb-5dc4-4bdd-baaf-8a617f71348c",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "kgtk-env",
   "language": "python",
   "name": "kgtk-env"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}