# Example Scenario 1: Generating embeddings for ConceptNet nodes

*Alice wishes to import the English subset of ConceptNet in KGTK format. Then, she would extract a subset of ConceptNet where two concepts are connected with a precise semantic relation, like `Causes` or `UsedFor` (as opposed to weaker relations like `/r/RelatedTo`). Text embeddings would be computed for all nodes in this subset, and saved in a file called `emb.txt`.*

**Note on the expected running time:** Running this notebook takes around half an hour on a Macbook Pro laptop with MacOS Catalina 10.15, a 2.3 GHz 8-Core Intel Core i9 processor, 2TB SSD disk, and 64 GB 2667 MHz DDR4 memory.

## Preparation

To run this notebook, Alice would need the ConceptNet graph file. We will work with the latest ConceptNet, v5.7.0. Presumably, this file is not present on Alice's laptop, so we need to download and unpack it first (note: mac users might need to install `wget` first: `brew install wget`):

In [None]:
%%bash
wget https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz

In [None]:
%%bash
gunzip conceptnet-assertions-5.7.0.csv.gz

## Implementation in KGTK

We will select the relevant edges from ConceptNet and sort them (note that we extract three more relations which will be used to extract labels, descriptions, and inheritance by our embedding generator below).

Then we compute text embeddings. For demonstration purposes, we will compute embeddings based on the first 30k edges.

In [1]:
%%bash
kgtk import-conceptnet --english_only -i conceptnet-assertions-5.7.0.csv / \
 filter -p " ; /r/Causes,/r/UsedFor,/r/Synonym,/r/DefinedAs,/r/IsA ; " / sort -c 1,2,3 \
 | head -30000 |
 kgtk text-embedding --debug --embedding-projector-metadata-path none \
 --embedding-projector-metadata-path none \
 --label-properties "/r/Synonym" \
 --isa-properties "/r/IsA" \
 --description-properties "/r/DefinedAs" \
 --property-value "/r/Causes" "/r/UsedFor" \
 --has-properties "" \
 -f kgtk_format \
 --output-data-format kgtk_format \
 --use-cache \
 --model bert-large-nli-cls-token \
 > emb.txt 

100%|██████████| 19571/19571 [29:14<00:00, 11.16it/s]


Let's inspect the result, by printing the first 500 characters of the file:

In [2]:
!head -c500 emb.txt

node	property	value
/c/en/astragalus_glycyphyllos/n/wn/plant	text_embedding	-0.38157165,-0.021805033,0.7940887,-1.5922968,0.52496123,-0.16233969,-0.19431037,1.0408834,0.8114325,0.3559178,0.61059636,-0.24603112,0.5337883,0.4534494,-0.29937816,0.090129025,-0.30235052,-0.6983496,-1.171757,0.9471463,0.9576315,0.6795303,-1.1980538,0.65520096,-0.59407276,0.28939876,-0.6164435,-0.2264376,1.5879735,0.31625852,-0.42459768,-0.43198207,0.22300366,-0.2425214,-0.5070722,-0.08494526,-0.6393699,0.18749073,0.48

## Playing with the embeddings

What can we do with the embeddings now that we have computed them? For applications like query answering or entity resolution, we need a representation where similar concepts have similar embeddings. Let's perform a small trial on whether this is the case for our ConceptNet embeddings. 

We will use the customary metric *cosine similarity* to measure vector similarities. We use invoke an existing function from the `sklearn` package in Python:

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

Let's first load all embeddings into a key-value dictionary:

In [9]:
embeddings={}
with open('emb.txt', 'r') as f:
 header = next(f)
 for line in f:
 node1, label, embedding=line.split()
 embeddings[node1]=embedding.split(',')

We would expect that the embeddings for zero (node: `/c/en/0`) and for a plant (node: `/c/en/astragalus_alpinus/n/wn/plant`) to be fairly different, leading to a low cosine similarity:

In [11]:
emb_0 = embeddings['/c/en/0']
emb_alpinus=embeddings['/c/en/astragalus_alpinus/n/wn/plant']
sim=cosine_similarity([emb_0], [emb_alpinus])

print("Similarity between zero and the astragalus alpinus plant: %f" % sim)

Similarity between zero and the astragalus alpinus plant: 0.437055


On the other hand, we would expect that the similarity between the embeddings of two countries is much higher:

In [76]:
emb_america = embeddings['/c/en/america']
emb_argentina=embeddings['/c/en/argentina']
sim=cosine_similarity([emb_america], [emb_argentina])

print("Similarity between America and Argentina: %f" % sim)

Similarity between America and Argentina: 0.772047


Feel free to experiment further with the embedding similarity. Keep in mind that in this notebook we computed embeddings on the first 30k edges in ConceptNet; if you want to investigate the entire ConceptNet, then it is required that you train the embeddings on the full KG first.