{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Node representation learning with GraphSAGE and UnsupervisedSampler\n" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "source": [ "
Run the latest release of this notebook:
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stellargraph Unsupervised GraphSAGE is the implementation of GraphSAGE method outlined in the paper: [Inductive Representation Learning on Large Graphs.](http://snap.stanford.edu/graphsage/) W.L. Hamilton, R. Ying, and J. Leskovec arXiv:1706.02216\n", "[cs.SI], 2017. \n", "\n", "This notebook is a short demo of how Stellargraph Unsupervised GraphSAGE can be used to learn embeddings of the nodes representing papers in the [CORA citation network](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz). Furthermore, this notebook demonstrates the use of the learnt embeddings in a downstream node classification task (classifying papers by subject). Note that the node embeddings can also be used in other graph machine learning tasks, such as link prediction, community detection, etc.\n", "\n", "## Unsupervised GraphSAGE:\n", "\n", "A high-level explanation of the unsupervised GraphSAGE method of graph representation learning is as follows.\n", "\n", "Objective: *Given a graph, learn embeddings of the nodes using only the graph structure and the node features, without using any known node class labels* (hence \"unsupervised\"; for semi-supervised learning of node embeddings, see this [demo](../node-classification/graphsage-node-classification.ipynb))\n", "\n", "**Unsupervised GraphSAGE model:** In the Unsupervised GraphSAGE model, node embeddings are learnt by solving a simple classification task: given a large set of \"positive\" `(target, context)` node pairs generated from random walks performed on the graph (i.e., node pairs that co-occur within a certain context window in random walks), and an equally large set of \"negative\" node pairs that are randomly selected from the graph according to a certain distribution, learn a binary classifier that predicts whether arbitrary node pairs are likely to co-occur in a random walk performed on the graph. Through learning this simple binary node-pair-classification task, the model automatically learns an inductive mapping from attributes of nodes and their neighbors to node embeddings in a high-dimensional vector space, which preserves structural and feature similarities of the nodes. Unlike embeddings obtained by algorithms such as [Node2Vec](https://snap.stanford.edu/node2vec), this mapping is inductive: given a new node (with attributes) and its links to other nodes in the graph (which was unseen during model training), we can evaluate its embeddings without having to re-train the model. \n", "\n", "In our implementation of Unsupervised GraphSAGE, the training set of node pairs is composed of an equal number of positive and negative `(target, context)` pairs from the graph. The positive `(target, context)` pairs are the node pairs co-occurring on random walks over the graph whereas the negative node pairs are sampled randomly from a global node degree distribution of the graph.\n", "\n", "The architecture of the node pair classifier is the following. Input node pairs (with node features) are fed, together with the graph structure, into a pair of identical GraphSAGE encoders, producing a pair of node embeddings. These embeddings are then fed into a node pair classification layer, which applies a binary operator to those node embeddings (e.g., concatenating them), and passes the resulting node pair embeddings through a linear transform followed by a binary activation (e.g., sigmoid), thus predicting a binary label for the node pair. \n", "\n", "The entire model is trained end-to-end by minimizing the loss function of choice (e.g., binary cross-entropy between predicted node pair labels and true link labels) using stochastic gradient descent (SGD) updates of the model parameters, with minibatches of 'training' links generated on demand and fed into the model.\n", "\n", "Node embeddings obtained from the encoder part of the trained classifier can be used in various downstream tasks. In this demo, we show how these can be used for predicting node labels." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "outputs": [], "source": [ "# install StellarGraph if running on Google Colab\n", "import sys\n", "if 'google.colab' in sys.modules:\n", " %pip install -q stellargraph[demos]==1.2.1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "nbsphinx": "hidden", "tags": [ "VersionCheck" ] }, "outputs": [], "source": [ "# verify that we're using the correct version of StellarGraph for this notebook\n", "import stellargraph as sg\n", "\n", "try:\n", " sg.utils.validate_notebook_version(\"1.2.1\")\n", "except AttributeError:\n", " raise ValueError(\n", " f\"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed. Please see .\"\n", " ) from None" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import networkx as nx\n", "import pandas as pd\n", "import numpy as np\n", "import os\n", "import random\n", "\n", "import stellargraph as sg\n", "from stellargraph.data import EdgeSplitter\n", "from stellargraph.mapper import GraphSAGELinkGenerator\n", "from stellargraph.layer import GraphSAGE, link_classification\n", "from stellargraph.data import UniformRandomWalk\n", "from stellargraph.data import UnsupervisedSampler\n", "from sklearn.model_selection import train_test_split\n", "\n", "from tensorflow import keras\n", "from sklearn import preprocessing, feature_extraction, model_selection\n", "from sklearn.linear_model import LogisticRegressionCV, LogisticRegression\n", "from sklearn.metrics import accuracy_score\n", "\n", "from stellargraph import globalvar\n", "\n", "from stellargraph import datasets\n", "from IPython.display import display, HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the CORA network data" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "DataLoadingLinks" ] }, "source": [ "(See [the \"Loading from Pandas\" demo](../basics/loading-pandas.ipynb) for details on how data can be loaded.)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [ "DataLoading" ] }, "outputs": [ { "data": { "text/html": [ "The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dataset = datasets.Cora()\n", "display(HTML(dataset.description))\n", "G, node_subjects = dataset.load()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 2708, Edges: 5429\n", "\n", " Node types:\n", " paper: [2708]\n", " Edge types: paper-cites->paper\n", "\n", " Edge types:\n", " paper-cites->paper: [5429]\n" ] } ], "source": [ "print(G.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unsupervised GraphSAGE with on demand sampling\n", "The Unsupervised GraphSAGE requires a training sample that can be either provided as a list of `(target, context)` node pairs or it can be provided with an `UnsupervisedSampler` instance that takes care of generating positive and negative samples of node pairs on demand. In this demo we discuss the latter technique. \n", "\n", "### UnsupervisedSampler:\n", "The `UnsupervisedSampler` class takes in a `Stellargraph` graph instance. The `generator` method in the `UnsupervisedSampler` is responsible for generating equal number of positive and negative node pair samples from the graph for training. The samples are generated by performing uniform random walks over the graph, using `UniformRandomWalk` object. Positive `(target, context)` node pairs are extracted from the walks, and for each \n", "positive pair a corresponding negative pair `(target, node)` is generated by randomly sampling `node` from the degree distribution of the graph. Once the `batch_size` number of samples is accumulated, the generator yields a list of positive and negative node pairs along with their respective 1/0 labels. \n", "\n", "In the current implementation, we use uniform random walks to explore the graph structure. The length and number of walks, as well as the root nodes for starting the walks can be user-specified. The default list for root nodes is all nodes of the graph, default `number_of_walks` is 1 (at least one walk per root node), and the default `length` of walks is 2 (need at least one node beyond the root node on the walk as a potential positive context)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1. Specify the other optional parameter values: root nodes, the number of walks to take per node, the length of each walk, and random seed.**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "nodes = list(G.nodes())\n", "number_of_walks = 1\n", "length = 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2. Create the UnsupervisedSampler instance with the relevant parameters passed to it.**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "unsupervised_samples = UnsupervisedSampler(\n", " G, nodes=nodes, length=length, number_of_walks=number_of_walks\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The graph G together with the unsupervised sampler will be used to generate samples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3. Create a node pair generator:**\n", "\n", "Next, create the node pair generator for sampling and streaming the training data to the model. The node pair generator essentially \"maps\" pairs of nodes `(target, context)` to the input of GraphSAGE: it either takes minibatches of node pairs, or an `UnsupervisedSampler` instance which generates the minibatches of node pairs on demand. The generator samples 2-hop subgraphs with `(target, context)` head nodes extracted from those pairs, and feeds them, together with the corresponding binary labels indicating which pair represent positive or negative sample, to the input layer of the node pair classifier with GraphSAGE node encoder, for SGD updates of the model parameters.\n", "\n", "Specify:\n", "1. The minibatch size (number of node pairs per minibatch).\n", "2. The number of epochs for training the model.\n", "3. The sizes of 1- and 2-hop neighbor samples for GraphSAGE:\n", "\n", "Note that the length of `num_samples` list defines the number of layers/iterations in the GraphSAGE encoder. In this example, we are defining a 2-layer GraphSAGE encoder." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "batch_size = 50\n", "epochs = 4\n", "num_samples = [10, 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following we show the working of node pair generator with the UnsupervisedSampler, which will generate samples on demand." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "generator = GraphSAGELinkGenerator(G, batch_size, num_samples)\n", "train_gen = generator.flow(unsupervised_samples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build the model: a 2-layer GraphSAGE encoder acting as node representation learner, with a link classification layer on concatenated (`citing-paper`, `cited-paper`) node embeddings.\n", "\n", "GraphSAGE part of the model, with hidden layer sizes of 50 for both GraphSAGE layers, a bias term, and no dropout. (Dropout can be switched on by specifying a positive dropout rate, 0 < dropout < 1).\n", "Note that the length of `layer_sizes` list must be equal to the length of `num_samples`, as `len(num_samples)` defines the number of hops (layers) in the GraphSAGE encoder." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "layer_sizes = [50, 50]\n", "graphsage = GraphSAGE(\n", " layer_sizes=layer_sizes, generator=generator, bias=True, dropout=0.0, normalize=\"l2\"\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Build the model and expose input and output sockets of graphsage, for node pair inputs:\n", "x_inp, x_out = graphsage.in_out_tensors()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Final node pair classification layer that takes a pair of nodes' embeddings produced by `graphsage` encoder, applies a binary operator to them to produce the corresponding node pair embedding (`ip` for inner product; other options for the binary operator can be seen by running a cell with `?link_classification` in it), and passes it through a dense layer:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "link_classification: using 'ip' method to combine node embeddings into edge embeddings\n" ] } ], "source": [ "prediction = link_classification(\n", " output_dim=1, output_act=\"sigmoid\", edge_embedding_method=\"ip\"\n", ")(x_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stack the GraphSAGE encoder and prediction layer into a Keras model, and specify the loss" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "model = keras.Model(inputs=x_inp, outputs=prediction)\n", "\n", "model.compile(\n", " optimizer=keras.optimizers.Adam(lr=1e-3),\n", " loss=keras.losses.binary_crossentropy,\n", " metrics=[keras.metrics.binary_accuracy],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4. Train the model.**" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/4\n", "434/434 [==============================] - 35s 80ms/step - loss: 0.5668 - binary_accuracy: 0.7413\n", "Epoch 2/4\n", "434/434 [==============================] - 33s 77ms/step - loss: 0.5404 - binary_accuracy: 0.7739\n", "Epoch 3/4\n", "434/434 [==============================] - 34s 78ms/step - loss: 0.5378 - binary_accuracy: 0.7823\n", "Epoch 4/4\n", "434/434 [==============================] - 34s 78ms/step - loss: 0.5383 - binary_accuracy: 0.7815\n" ] } ], "source": [ "history = model.fit(\n", " train_gen,\n", " epochs=epochs,\n", " verbose=1,\n", " use_multiprocessing=False,\n", " workers=4,\n", " shuffle=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that multiprocessing is switched off, since with a large training set of node pairs, multiprocessing can considerably slow down the training process with the data being transferred between various processes. \n", "\n", "Also, multiple workers can be used with `Keras version 2.2.4` and above, and it speeds up the training process considerably due to multi-threading." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting node embeddings\n", "Now that the node pair classifier is trained, we can use its node encoder part as node embeddings evaluator. Below we evaluate node embeddings as activations of the output of GraphSAGE layer stack, and visualise them, coloring nodes by their subject label." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.manifold import TSNE\n", "from stellargraph.mapper import GraphSAGENodeGenerator\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Building a new node-based model**\n", "\n", "The `(src, dst)` node pair classifier `model` has two identical node encoders: one for source nodes in the node pairs, the other for destination nodes in the node pairs passed to the model. We can use either of the two identical encoders to evaluate node embeddings. Below we create an embedding model by defining a new Keras model with `x_inp_src` (a list of odd elements in `x_inp`) and `x_out_src` (the 1st element in `x_out`) as input and output, respectively. Note that this model's weights are the same as those of the corresponding node encoder in the previously trained node pair classifier." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "x_inp_src = x_inp[0::2]\n", "x_out_src = x_out[0]\n", "embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need a node generator to feed graph nodes to `embedding_model`. We want to evaluate node embeddings for all nodes in the graph:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "node_ids = node_subjects.index\n", "node_gen = GraphSAGENodeGenerator(G, batch_size, num_samples).flow(node_ids)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now use `node_gen` to feed all nodes into the embedding model and extract their embeddings:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "55/55 [==============================] - 1s 19ms/step\n" ] } ], "source": [ "node_embeddings = embedding_model.predict(node_gen, workers=4, verbose=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize the node embeddings \n", "Next we visualize the node embeddings in 2D using t-SNE. Colors of the nodes depict their true classes (subject in the case of Cora dataset) of the nodes. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "node_subject = node_subjects.astype(\"category\").cat.codes\n", "\n", "X = node_embeddings\n", "if X.shape[1] > 2:\n", " transform = TSNE # PCA\n", "\n", " trans = transform(n_components=2)\n", " emb_transformed = pd.DataFrame(trans.fit_transform(X), index=node_ids)\n", " emb_transformed[\"label\"] = node_subject\n", "else:\n", " emb_transformed = pd.DataFrame(X, index=node_ids)\n", " emb_transformed = emb_transformed.rename(columns={\"0\": 0, \"1\": 1})\n", " emb_transformed[\"label\"] = node_subject" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "alpha = 0.7\n", "\n", "fig, ax = plt.subplots(figsize=(7, 7))\n", "ax.scatter(\n", " emb_transformed[0],\n", " emb_transformed[1],\n", " c=emb_transformed[\"label\"].astype(\"category\"),\n", " cmap=\"jet\",\n", " alpha=alpha,\n", ")\n", "ax.set(aspect=\"equal\", xlabel=\"$X_1$\", ylabel=\"$X_2$\")\n", "plt.title(\n", " \"{} visualization of GraphSAGE embeddings for cora dataset\".format(transform.__name__)\n", ")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The observation that same-colored nodes in the embedding space are concentrated together is indicative of similarity of embeddings of papers on the same topics. We would emphasize here again that the node embeddings are learnt in unsupervised way, without using true class labels. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downstream task\n", "\n", "The node embeddings calculated using the unsupervised GraphSAGE can be used as node feature vectors in a downstream task such as node classification. \n", "\n", "In this example, we will use the node embeddings to train a simple Logistic Regression classifier to predict paper subjects in Cora dataset." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# X will hold the 50 input features (node embeddings)\n", "X = node_embeddings\n", "# y holds the corresponding target values\n", "y = np.array(node_subject)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Splitting\n", "\n", "We split the data into train and test sets. \n", "\n", "We use 5% of the data for training and the remaining 95% for testing as a hold out test set." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, train_size=0.05, test_size=None, stratify=y\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classifier Training\n", "\n", "We train a Logistic Regression classifier on the training data. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=100,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=None, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = LogisticRegression(verbose=0, solver=\"lbfgs\", multi_class=\"auto\")\n", "clf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predict the hold out test set." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "y_pred = clf.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the accuracy of the classifier on the test set." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7427127866303925" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(y_test, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The obtained accuracy is pretty decent, better than that obtained by using node embeddings obtained by `node2vec` that ignores node attributes, only taking into account the graph structure (see this [demo](node2vec-embeddings.ipynb)). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Predicted classes**" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2 831\n", "1 428\n", "6 406\n", "3 356\n", "0 334\n", "4 195\n", "5 23\n", "dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(y_pred).value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**True classes**" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2 818\n", "3 426\n", "1 418\n", "6 351\n", "0 298\n", "4 217\n", "5 180\n", "dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(y).value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Uses for unsupervised graph representation learning\n", "1. Unsupervised GraphSAGE learns embeddings of unlabeled graph nodes. This is highly useful as most of the real-world data is typically either unlabeled, or have noisy, unreliable, or sparse labels. In such scenarios unsupervised techniques that learn low-dimensional meaningful representation of nodes in a graph by leveraging the graph structure and features of the nodes is useful.\n", "2. Moreover, GraphSAGE is an inductive technique that allows us to obtain embeddings of unseen nodes, without the need to re-train the embedding model. That is, instead of training individual embeddings for each node (as in algorithms such as `node2vec` that learn a look-up table of node embeddings), GraphSAGE learns a function that generates embeddings by sampling and aggregating attributes from each node's local neighborhood, and combining those with the node's own attributes." ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "source": [ "
Run the latest release of this notebook:
" ] } ], "metadata": { "file_extension": ".py", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "mimetype": "text/x-python", "name": "python", "npconvert_exporter": "python", "pygments_lexer": "ipython3", "version": 3 }, "nbformat": 4, "nbformat_minor": 4 }