{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Link prediction with GCN" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "source": [ "
Run the latest release of this notebook:
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we use our implementation of the [GCN](https://arxiv.org/abs/1609.02907) algorithm to build a model that predicts citation links in the Cora dataset (see below). The problem is treated as a supervised link prediction problem on a homogeneous citation network with nodes representing papers (with attributes such as binary keyword indicators and categorical subject) and links corresponding to paper-paper citations. \n", "\n", "To address this problem, we build a model with the following architecture. First we build a two-layer GCN model that takes labeled node pairs (`citing-paper` -> `cited-paper`) corresponding to possible citation links, and outputs a pair of node embeddings for the `citing-paper` and `cited-paper` nodes of the pair. These embeddings are then fed into a link classification layer, which first applies a binary operator to those node embeddings (e.g., concatenating them) to construct the embedding of the potential link. Thus obtained link embeddings are passed through the dense link classification layer to obtain link predictions - probability for these candidate links to actually exist in the network. The entire model is trained end-to-end by minimizing the loss function of choice (e.g., binary cross-entropy between predicted link probabilities and true link labels, with true/false citation links having labels 1/0) using stochastic gradient descent (SGD) updates of the model parameters, with minibatches of 'training' links fed into the model." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "outputs": [], "source": [ "# install StellarGraph if running on Google Colab\n", "import sys\n", "if 'google.colab' in sys.modules:\n", " %pip install -q stellargraph[demos]==1.2.1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "nbsphinx": "hidden", "tags": [ "VersionCheck" ] }, "outputs": [], "source": [ "# verify that we're using the correct version of StellarGraph for this notebook\n", "import stellargraph as sg\n", "\n", "try:\n", " sg.utils.validate_notebook_version(\"1.2.1\")\n", "except AttributeError:\n", " raise ValueError(\n", " f\"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed. Please see .\"\n", " ) from None" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import stellargraph as sg\n", "from stellargraph.data import EdgeSplitter\n", "from stellargraph.mapper import FullBatchLinkGenerator\n", "from stellargraph.layer import GCN, LinkEmbedding\n", "\n", "\n", "from tensorflow import keras\n", "from sklearn import preprocessing, feature_extraction, model_selection\n", "\n", "from stellargraph import globalvar\n", "from stellargraph import datasets\n", "from IPython.display import display, HTML\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the CORA network data" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "DataLoadingLinks" ] }, "source": [ "(See [the \"Loading from Pandas\" demo](../basics/loading-pandas.ipynb) for details on how data can be loaded.)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [ "DataLoading" ] }, "outputs": [ { "data": { "text/html": [ "The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dataset = datasets.Cora()\n", "display(HTML(dataset.description))\n", "G, _ = dataset.load(subject_as_feature=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 2708, Edges: 5429\n", "\n", " Node types:\n", " paper: [2708]\n", " Features: float32 vector, length 1440\n", " Edge types: paper-cites->paper\n", "\n", " Edge types:\n", " paper-cites->paper: [5429]\n" ] } ], "source": [ "print(G.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We aim to train a link prediction model, hence we need to prepare the train and test sets of links and the corresponding graphs with those links removed.\n", "\n", "We are going to split our input graph into a train and test graphs using the EdgeSplitter class in `stellargraph.data`. We will use the train graph for training the model (a binary classifier that, given two nodes, predicts whether a link between these two nodes should exist or not) and the test graph for evaluating the model's performance on hold out data.\n", "Each of these graphs will have the same number of nodes as the input graph, but the number of links will differ (be reduced) as some of the links will be removed during each split and used as the positive samples for training/testing the link prediction classifier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the original graph G, extract a randomly sampled subset of test edges (true and false citation links) and the reduced graph G_test with the positive test edges removed:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "** Sampled 542 positive and 542 negative edges. **\n" ] } ], "source": [ "# Define an edge splitter on the original graph G:\n", "edge_splitter_test = EdgeSplitter(G)\n", "\n", "# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G, and obtain the\n", "# reduced graph G_test with the sampled links removed:\n", "G_test, edge_ids_test, edge_labels_test = edge_splitter_test.train_test_split(\n", " p=0.1, method=\"global\", keep_connected=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reduced graph G_test, together with the test ground truth set of links (edge_ids_test, edge_labels_test), will be used for testing the model.\n", "\n", "Now repeat this procedure to obtain the training data for the model. From the reduced graph G_test, extract a randomly sampled subset of train edges (true and false citation links) and the reduced graph G_train with the positive train edges removed:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "** Sampled 488 positive and 488 negative edges. **\n" ] } ], "source": [ "# Define an edge splitter on the reduced graph G_test:\n", "edge_splitter_train = EdgeSplitter(G_test)\n", "\n", "# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G_test, and obtain the\n", "# reduced graph G_train with the sampled links removed:\n", "G_train, edge_ids_train, edge_labels_train = edge_splitter_train.train_test_split(\n", " p=0.1, method=\"global\", keep_connected=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "G_train, together with the train ground truth set of links (edge_ids_train, edge_labels_train), will be used for training the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the GCN link model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create the link generators for the train and test link examples to the model. The link generators take the pairs of nodes (`citing-paper`, `cited-paper`) that are given in the `.flow` method to the Keras model, together with the corresponding binary labels indicating whether those pairs represent true or false links.\n", "\n", "The number of epochs for training the model:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "epochs = 50" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For training we create a generator on the `G_train` graph, and make an iterator over the training links using the generator's `flow()` method:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using GCN (local pooling) filters...\n" ] } ], "source": [ "train_gen = FullBatchLinkGenerator(G_train, method=\"gcn\")\n", "train_flow = train_gen.flow(edge_ids_train, edge_labels_train)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using GCN (local pooling) filters...\n" ] } ], "source": [ "test_gen = FullBatchLinkGenerator(G_test, method=\"gcn\")\n", "test_flow = train_gen.flow(edge_ids_test, edge_labels_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can specify our machine learning model, we need a few more parameters for this:\n", "\n", " * the `layer_sizes` is a list of hidden feature sizes of each layer in the model. In this example we use two GCN layers with 16-dimensional hidden node features at each layer.\n", " * `activations` is a list of activations applied to each layer's output\n", " * `dropout=0.3` specifies a 30% dropout at each layer. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a GCN model as follows:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "gcn = GCN(\n", " layer_sizes=[16, 16], activations=[\"relu\", \"relu\"], generator=train_gen, dropout=0.3\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a Keras model we now expose the input and output tensors of the GCN model for link prediction, via the `GCN.in_out_tensors` method:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "x_inp, x_out = gcn.in_out_tensors()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Final link classification layer that takes a pair of node embeddings produced by the GCN model, applies a binary operator to them to produce the corresponding link embedding (`ip` for inner product; other options for the binary operator can be seen by running a cell with `?LinkEmbedding` in it), and passes it through a dense layer:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "prediction = LinkEmbedding(activation=\"relu\", method=\"ip\")(x_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The predictions need to be reshaped from `(X, 1)` to `(X,)` to match the shape of the targets we have supplied above." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "prediction = keras.layers.Reshape((-1,))(prediction)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stack the GCN and prediction layers into a Keras model, and specify the loss" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "model = keras.Model(inputs=x_inp, outputs=prediction)\n", "\n", "model.compile(\n", " optimizer=keras.optimizers.Adam(lr=0.01),\n", " loss=keras.losses.binary_crossentropy,\n", " metrics=[\"acc\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate the initial (untrained) model on the train and test set:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ['...']\n", "1/1 [==============================] - 0s 121ms/step - loss: 1.8927 - acc: 0.5000\n", " ['...']\n", "1/1 [==============================] - 0s 9ms/step - loss: 1.8621 - acc: 0.5000\n", "\n", "Train Set Metrics of the initial (untrained) model:\n", "\tloss: 1.8927\n", "\tacc: 0.5000\n", "\n", "Test Set Metrics of the initial (untrained) model:\n", "\tloss: 1.8621\n", "\tacc: 0.5000\n" ] } ], "source": [ "init_train_metrics = model.evaluate(train_flow)\n", "init_test_metrics = model.evaluate(test_flow)\n", "\n", "print(\"\\nTrain Set Metrics of the initial (untrained) model:\")\n", "for name, val in zip(model.metrics_names, init_train_metrics):\n", " print(\"\\t{}: {:0.4f}\".format(name, val))\n", "\n", "print(\"\\nTest Set Metrics of the initial (untrained) model:\")\n", "for name, val in zip(model.metrics_names, init_test_metrics):\n", " print(\"\\t{}: {:0.4f}\".format(name, val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train the model:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ['...']\n", " ['...']\n", "Train for 1 steps, validate for 1 steps\n", "Epoch 1/50\n", "1/1 - 1s - loss: 1.7886 - acc: 0.5000 - val_loss: 1.5024 - val_acc: 0.5387\n", "Epoch 2/50\n", "1/1 - 0s - loss: 1.7260 - acc: 0.5400 - val_loss: 0.6822 - val_acc: 0.6070\n", "Epoch 3/50\n", "1/1 - 0s - loss: 0.8526 - acc: 0.5953 - val_loss: 0.7401 - val_acc: 0.5729\n", "Epoch 4/50\n", "1/1 - 0s - loss: 0.7397 - acc: 0.5666 - val_loss: 0.7479 - val_acc: 0.5849\n", "Epoch 5/50\n", "1/1 - 0s - loss: 0.7334 - acc: 0.5799 - val_loss: 0.6777 - val_acc: 0.6172\n", "Epoch 6/50\n", "1/1 - 0s - loss: 0.6413 - acc: 0.6404 - val_loss: 0.6981 - val_acc: 0.6375\n", "Epoch 7/50\n", "1/1 - 0s - loss: 0.7289 - acc: 0.6568 - val_loss: 0.6576 - val_acc: 0.6448\n", "Epoch 8/50\n", "1/1 - 0s - loss: 0.6367 - acc: 0.6568 - val_loss: 0.6846 - val_acc: 0.6338\n", "Epoch 9/50\n", "1/1 - 0s - loss: 0.6111 - acc: 0.6639 - val_loss: 0.6850 - val_acc: 0.6384\n", "Epoch 10/50\n", "1/1 - 0s - loss: 0.5818 - acc: 0.6855 - val_loss: 0.6667 - val_acc: 0.6513\n", "Epoch 11/50\n", "1/1 - 0s - loss: 0.5721 - acc: 0.6916 - val_loss: 0.6304 - val_acc: 0.6688\n", "Epoch 12/50\n", "1/1 - 0s - loss: 0.5422 - acc: 0.7551 - val_loss: 0.6461 - val_acc: 0.7048\n", "Epoch 13/50\n", "1/1 - 0s - loss: 0.5791 - acc: 0.7695 - val_loss: 0.6710 - val_acc: 0.7002\n", "Epoch 14/50\n", "1/1 - 0s - loss: 0.4987 - acc: 0.7838 - val_loss: 0.6632 - val_acc: 0.7131\n", "Epoch 15/50\n", "1/1 - 0s - loss: 0.5537 - acc: 0.7920 - val_loss: 0.7022 - val_acc: 0.7168\n", "Epoch 16/50\n", "1/1 - 0s - loss: 0.5463 - acc: 0.7807 - val_loss: 0.7353 - val_acc: 0.7251\n", "Epoch 17/50\n", "1/1 - 0s - loss: 0.5315 - acc: 0.7910 - val_loss: 0.7022 - val_acc: 0.7223\n", "Epoch 18/50\n", "1/1 - 0s - loss: 0.4832 - acc: 0.7930 - val_loss: 0.6777 - val_acc: 0.7251\n", "Epoch 19/50\n", "1/1 - 0s - loss: 0.4477 - acc: 0.8105 - val_loss: 0.6668 - val_acc: 0.7242\n", "Epoch 20/50\n", "1/1 - 0s - loss: 0.4439 - acc: 0.7971 - val_loss: 0.6176 - val_acc: 0.7196\n", "Epoch 21/50\n", "1/1 - 0s - loss: 0.3993 - acc: 0.8309 - val_loss: 0.6136 - val_acc: 0.7196\n", "Epoch 22/50\n", "1/1 - 0s - loss: 0.3830 - acc: 0.8248 - val_loss: 0.6248 - val_acc: 0.7196\n", "Epoch 23/50\n", "1/1 - 0s - loss: 0.4062 - acc: 0.8473 - val_loss: 0.6505 - val_acc: 0.7205\n", "Epoch 24/50\n", "1/1 - 0s - loss: 0.4259 - acc: 0.8504 - val_loss: 0.6313 - val_acc: 0.7232\n", "Epoch 25/50\n", "1/1 - 0s - loss: 0.3858 - acc: 0.8504 - val_loss: 0.6221 - val_acc: 0.7232\n", "Epoch 26/50\n", "1/1 - 0s - loss: 0.3439 - acc: 0.8596 - val_loss: 0.6356 - val_acc: 0.7196\n", "Epoch 27/50\n", "1/1 - 0s - loss: 0.3333 - acc: 0.8709 - val_loss: 0.6512 - val_acc: 0.7205\n", "Epoch 28/50\n", "1/1 - 0s - loss: 0.3255 - acc: 0.8760 - val_loss: 0.6791 - val_acc: 0.7232\n", "Epoch 29/50\n", "1/1 - 0s - loss: 0.3593 - acc: 0.8791 - val_loss: 0.7117 - val_acc: 0.7214\n", "Epoch 30/50\n", "1/1 - 0s - loss: 0.3251 - acc: 0.8873 - val_loss: 0.7323 - val_acc: 0.7242\n", "Epoch 31/50\n", "1/1 - 0s - loss: 0.3256 - acc: 0.8770 - val_loss: 0.7427 - val_acc: 0.7288\n", "Epoch 32/50\n", "1/1 - 0s - loss: 0.3088 - acc: 0.9037 - val_loss: 0.7509 - val_acc: 0.7297\n", "Epoch 33/50\n", "1/1 - 0s - loss: 0.3048 - acc: 0.8934 - val_loss: 0.7523 - val_acc: 0.7371\n", "Epoch 34/50\n", "1/1 - 0s - loss: 0.2989 - acc: 0.8996 - val_loss: 0.7425 - val_acc: 0.7380\n", "Epoch 35/50\n", "1/1 - 0s - loss: 0.2847 - acc: 0.9047 - val_loss: 0.7396 - val_acc: 0.7362\n", "Epoch 36/50\n", "1/1 - 0s - loss: 0.2645 - acc: 0.9016 - val_loss: 0.7313 - val_acc: 0.7380\n", "Epoch 37/50\n", "1/1 - 0s - loss: 0.2811 - acc: 0.8975 - val_loss: 0.7350 - val_acc: 0.7362\n", "Epoch 38/50\n", "1/1 - 0s - loss: 0.2720 - acc: 0.9078 - val_loss: 0.6788 - val_acc: 0.7389\n", "Epoch 39/50\n", "1/1 - 0s - loss: 0.2603 - acc: 0.8986 - val_loss: 0.6679 - val_acc: 0.7371\n", "Epoch 40/50\n", "1/1 - 0s - loss: 0.2580 - acc: 0.9047 - val_loss: 0.6692 - val_acc: 0.7408\n", "Epoch 41/50\n", "1/1 - 0s - loss: 0.2809 - acc: 0.8955 - val_loss: 0.6916 - val_acc: 0.7408\n", "Epoch 42/50\n", "1/1 - 0s - loss: 0.2540 - acc: 0.9016 - val_loss: 0.7552 - val_acc: 0.7435\n", "Epoch 43/50\n", "1/1 - 0s - loss: 0.2629 - acc: 0.9139 - val_loss: 0.8007 - val_acc: 0.7445\n", "Epoch 44/50\n", "1/1 - 0s - loss: 0.2614 - acc: 0.9273 - val_loss: 0.8633 - val_acc: 0.7445\n", "Epoch 45/50\n", "1/1 - 0s - loss: 0.2316 - acc: 0.9057 - val_loss: 0.8980 - val_acc: 0.7500\n", "Epoch 46/50\n", "1/1 - 0s - loss: 0.2204 - acc: 0.9242 - val_loss: 0.9062 - val_acc: 0.7472\n", "Epoch 47/50\n", "1/1 - 0s - loss: 0.2326 - acc: 0.9160 - val_loss: 0.9067 - val_acc: 0.7537\n", "Epoch 48/50\n", "1/1 - 0s - loss: 0.2358 - acc: 0.9334 - val_loss: 0.8805 - val_acc: 0.7601\n", "Epoch 49/50\n", "1/1 - 0s - loss: 0.2196 - acc: 0.9211 - val_loss: 0.8471 - val_acc: 0.7592\n", "Epoch 50/50\n", "1/1 - 0s - loss: 0.2102 - acc: 0.9221 - val_loss: 0.8198 - val_acc: 0.7620\n" ] } ], "source": [ "history = model.fit(\n", " train_flow, epochs=epochs, validation_data=test_flow, verbose=2, shuffle=False\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the training history:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sg.utils.plot_history(history)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate the trained model on test citation links:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ['...']\n", "1/1 [==============================] - 0s 9ms/step - loss: 0.1409 - acc: 0.9641\n", " ['...']\n", "1/1 [==============================] - 0s 9ms/step - loss: 0.8198 - acc: 0.7620\n", "\n", "Train Set Metrics of the trained model:\n", "\tloss: 0.1409\n", "\tacc: 0.9641\n", "\n", "Test Set Metrics of the trained model:\n", "\tloss: 0.8198\n", "\tacc: 0.7620\n" ] } ], "source": [ "train_metrics = model.evaluate(train_flow)\n", "test_metrics = model.evaluate(test_flow)\n", "\n", "print(\"\\nTrain Set Metrics of the trained model:\")\n", "for name, val in zip(model.metrics_names, train_metrics):\n", " print(\"\\t{}: {:0.4f}\".format(name, val))\n", "\n", "print(\"\\nTest Set Metrics of the trained model:\")\n", "for name, val in zip(model.metrics_names, test_metrics):\n", " print(\"\\t{}: {:0.4f}\".format(name, val))" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "source": [ "
Run the latest release of this notebook:
" ] } ], "metadata": { "file_extension": ".py", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "mimetype": "text/x-python", "name": "python", "npconvert_exporter": "python", "pygments_lexer": "ipython3", "version": 3 }, "nbformat": 4, "nbformat_minor": 4 }