{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Link Prediction With Graph Neural Networks\n", "\n", "In this example, we will use the Cora citation network [available for download here](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz). Nodes are papers and links represent citations among papers. The objective is to use a small sample of positive (i.e., existing) links and negative (i.e, non-existing) links to build a model that can predict whther two nodes have a citation relationship.\n", "\n", "## STEP 1: Load and Preprocess Dataset\n", "Let's begin by loading and preprocessing the dataset. By default, *ktrain* will holdout *10%* (i.e., `val_pct=0.1`) of the links for validation (along with an equal number of negative links). An additional `train_pct` of links will be taken as the training set. Here, we set `train_pct=0.1`, which is also the default." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: \n", "Type: Graph\n", "Number of nodes: 2708\n", "Number of edges: 5278\n", "Average degree: 3.8981\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "I0408 21:56:28.759857 140224628672320 utils.py:129] Note: NumExpr detected 32 cores but \"NUMEXPR_MAX_THREADS\" not set, so enforcing safe limit of 8.\n", "I0408 21:56:28.761173 140224628672320 utils.py:141] NumExpr defaulting to 8 threads.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "** Sampled 527 positive and 527 negative edges. **\n", "** Sampled 475 positive and 475 negative edges. **\n" ] } ], "source": [ "import ktrain\n", "from ktrain import graph as gr\n", "\n", "# load data with supervision ratio of 10%\n", "(trn, val, preproc) = gr.graph_links_from_csv(\n", " 'data/cora/cora.content', # node attributes/labels\n", " 'data/cora/cora.cites', # edge list\n", " train_pct=0.1, sep='\\t')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original graph: 2708 nodes and 5278 edges\n" ] } ], "source": [ "print('original graph: %s nodes and %s edges' % (preproc.G.number_of_nodes(), preproc.G.number_of_edges()))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "validation graph: nodes: 2708, links:4751\n" ] } ], "source": [ "print('validation graph: nodes: %s, links:%s' % (val.graph.number_of_nodes(), val.graph.number_of_edges()))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training graph: nodes: 2708, links:4276\n" ] } ], "source": [ "print('training graph: nodes: %s, links:%s' % (trn.graph.number_of_nodes(), trn.graph.number_of_edges()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 2: Build a Graph Neural Network for Link Prediction\n", "Next, we build a graph neural network model. *ktrain* currently supports [GraphSAGE models](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) for link prediction." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "graphsage: GraphSAGE: https://arxiv.org/pdf/1706.02216.pdf\n" ] } ], "source": [ "gr.print_link_predictors()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "link_classification: using 'ip' method to combine node embeddings into edge embeddings\n" ] } ], "source": [ "model = gr.graph_link_predictor('graphsage', trn, preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will wrap the model and data in a `Learner` object to facilitate training. For instance, let's set the global weight decay to 0.01." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "learner = ktrain.get_learner(model, train_data=trn, val_data=val)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/amaiya/projects/ghub/ktrain/ktrain/core.py:266: UserWarning: recompiling model to use AdamWeightDecay as opimizer with weight decay of 0.01\n", " warnings.warn('recompiling model to use AdamWeightDecay as opimizer with weight decay of %s' % (wd) )\n" ] } ], "source": [ "learner.set_weight_decay(wd=0.01)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 3: Estimate Learning Rate Using Learning-Rate-Finder" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Train for 29 steps\n", "Epoch 1/10\n", "29/29 [==============================] - 4s 122ms/step - loss: 1.1456 - accuracy: 0.6089\n", "Epoch 2/10\n", "29/29 [==============================] - 4s 121ms/step - loss: 1.0372 - accuracy: 0.5828\n", "Epoch 3/10\n", "29/29 [==============================] - 4s 130ms/step - loss: 1.0973 - accuracy: 0.5763\n", "Epoch 4/10\n", "29/29 [==============================] - 4s 124ms/step - loss: 1.1659 - accuracy: 0.6100\n", "Epoch 5/10\n", "29/29 [==============================] - 4s 126ms/step - loss: 0.9463 - accuracy: 0.5686\n", "Epoch 6/10\n", "29/29 [==============================] - 4s 125ms/step - loss: 0.5792 - accuracy: 0.7059\n", "Epoch 7/10\n", "29/29 [==============================] - 4s 127ms/step - loss: 0.4030 - accuracy: 0.8170\n", "Epoch 8/10\n", "29/29 [==============================] - 4s 124ms/step - loss: 0.5284 - accuracy: 0.8344\n", "Epoch 9/10\n", "29/29 [==============================] - 4s 123ms/step - loss: 0.4732 - accuracy: 0.8214\n", "Epoch 10/10\n", "29/29 [==============================] - 4s 121ms/step - loss: 0.5277 - accuracy: 0.7832\n", "\n", "\n", "done.\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_find(show_plot=True, max_epochs=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 4: Train Model With [1Cycle](https://arxiv.org/pdf/1803.09820.pdf) Learning Rate Schedule]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using onecycle policy with max lr of 0.01...\n", "Train for 30 steps, validate for 33 steps\n", "Epoch 1/5\n", "30/30 [==============================] - 10s 334ms/step - loss: 0.8107 - accuracy: 0.6168 - val_loss: 0.5360 - val_accuracy: 0.7543\n", "Epoch 2/5\n", "30/30 [==============================] - 8s 275ms/step - loss: 0.5096 - accuracy: 0.8011 - val_loss: 0.4129 - val_accuracy: 0.8406\n", "Epoch 3/5\n", "30/30 [==============================] - 8s 282ms/step - loss: 0.3617 - accuracy: 0.8789 - val_loss: 0.4621 - val_accuracy: 0.8254\n", "Epoch 4/5\n", "30/30 [==============================] - 8s 278ms/step - loss: 0.3156 - accuracy: 0.9032 - val_loss: 0.4169 - val_accuracy: 0.8273\n", "Epoch 5/5\n", "30/30 [==============================] - 8s 282ms/step - loss: 0.2214 - accuracy: 0.9442 - val_loss: 0.4421 - val_accuracy: 0.8302\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit_onecycle(0.01, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make Predictions\n", "\n", "We will create a `Predictor` object and make predictions. The predict method accepts a `networkx` graph with node features stored as node attributes and a list of edges (tuples of node IDs into the graph). The model will predict whether each edge should exist or not. Since we are making predictions existing edges in the graph, we expect to return a 1 (i.e., a string label of 'positive') for each input." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['positive', 'positive', 'positive', 'positive', 'positive'],\n", " dtype='