{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Node classification via node representations with attri2vec" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "source": [ "
Run the latest release of this notebook:
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the python implementation of how to perform node classification with the attri2vec algorithm outlined in paper [1]. The implementation uses the stellargraph components.\n", "\n", "\n", "**References:** \n", "\n", "[1] [Attributed Network Embedding via Subspace Discovery](https://link.springer.com/article/10.1007/s10618-019-00650-2). D. Zhang, Y. Jie, X. Zhu and C. Zhang, Data Mining and Knowledge Discovery, 2019. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## attri2vec\n", "\n", "attri2vec learns node representations by performing a linear/non-linear mapping on node content attributes. To make the learned node representations respect structural similarity, [DeepWalk](https://dl.acm.org/citation.cfm?id=2623732)/[Node2Vec](https://snap.stanford.edu/node2vec) learning mechanism is used to make nodes sharing similar random walk context nodes represented closely in the subspace. \n", "\n", "For each (``target``,``context``) node pair $(v_i,v_j)$ collected from random walks, attri2vec learns the representation for the target node $v_i$ by using it to predict the existence of context node $v_j$, with the following three-layer neural network." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](attri2vec-illustration.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Node $v_i$'s representation in the hidden layer is obtained by multiplying $v_i$'s raw content feature vector in the input layer with the input-to-hidden weight matrix $W_{in}$ followed by an activation function. The existence probability of each node conditioned on node $v_i$ is outputted in the output layer, which is obtained by multiplying $v_i$'s hidden-layer representation with the hidden-to-out weight matrix $W_{out}$ followed by a softmax activation. To capture the ``target-context`` relation between $v_i$ and $v_j$, we need to maximize the probability $\\mathrm{P}(v_j|v_i)$. However, computing $\\mathrm{P}(v_j|v_i)$ is time consuming, which involves the matrix multiplication between $v_i$'s hidden-layer representation and the hidden-to-out weight matrix $W_{out}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To speed up the computing, we adopt the negative sampling strategy [1]. For each (``target``, ``context``) node pair, we sample a negative node $v_k$, which is not $v_i$'s context. To obtain the output, instead of multiplying $v_i$'s hidden-layer representation with the hidden-to-out weight matrix $W_{out}$ followed by a softmax activation, we only calculate the dot product between $v_i$'s hidden-layer representation and the $j$th column as well as the $k$th column of the hidden-to-output weight matrix $W_{out}$ followed by a sigmoid activation respectively. According to [1], the original objective to maximize $\\mathrm{P}(v_j|v_i)$ can be approximated by minimizing the cross entropy between $v_j$ and $v_k$'s outputs and their ground-truth labels (1 for $v_j$ and 0 for $v_k$)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The entire model is trained end-to-end by minimizing the binary cross-entropy loss function with regards to predicted node pair labels and true node pair labels, using stochastic gradient descent (SGD) updates of the model parameters, with minibatches of 'training' node pairs generated on demand and fed into the model." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "outputs": [], "source": [ "# install StellarGraph if running on Google Colab\n", "import sys\n", "if 'google.colab' in sys.modules:\n", " %pip install -q stellargraph[demos]==1.2.1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "nbsphinx": "hidden", "tags": [ "VersionCheck" ] }, "outputs": [], "source": [ "# verify that we're using the correct version of StellarGraph for this notebook\n", "import stellargraph as sg\n", "\n", "try:\n", " sg.utils.validate_notebook_version(\"1.2.1\")\n", "except AttributeError:\n", " raise ValueError(\n", " f\"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed. Please see .\"\n", " ) from None" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import networkx as nx\n", "import pandas as pd\n", "import numpy as np\n", "import os\n", "import random\n", "\n", "import stellargraph as sg\n", "from stellargraph.data import UnsupervisedSampler\n", "from stellargraph.mapper import Attri2VecLinkGenerator, Attri2VecNodeGenerator\n", "from stellargraph.layer import Attri2Vec, link_classification\n", "\n", "from tensorflow import keras\n", "\n", "from pandas.core.indexes.base import Index\n", "\n", "import matplotlib.pyplot as plt\n", "from sklearn.manifold import TSNE\n", "from sklearn.decomposition import PCA\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegressionCV\n", "from sklearn.metrics import accuracy_score\n", "from stellargraph import datasets\n", "from IPython.display import display, HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset\n", "\n", "For clarity we ignore isolated nodes and subgraphs and use only the largest connected component." ] }, { "cell_type": "markdown", "metadata": { "tags": [ "DataLoadingLinks" ] }, "source": [ "(See [the \"Loading from Pandas\" demo](../basics/loading-pandas.ipynb) for details on how data can be loaded.)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [ "DataLoading" ] }, "outputs": [ { "data": { "text/html": [ "The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links, although 17 of these have a source or target publication that isn't in the dataset and only 4715 are included in the graph. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dataset = datasets.CiteSeer()\n", "display(HTML(dataset.description))\n", "G, subjects = dataset.load(largest_connected_component_only=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 2110, Edges: 3757\n", "\n", " Node types:\n", " paper: [2110]\n", " Features: float32 vector, length 3703\n", " Edge types: paper-cites->paper\n", "\n", " Edge types:\n", " paper-cites->paper: [3757]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "print(G.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train attri2vec on Citeseer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specify the other optional parameter values: root nodes, the number of walks to take per node, the length of each walk." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "nodes = list(G.nodes())\n", "number_of_walks = 4\n", "length = 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the UnsupervisedSampler instance with the relevant parameters passed to it." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "unsupervised_samples = UnsupervisedSampler(\n", " G, nodes=nodes, length=length, number_of_walks=number_of_walks\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the batch size and the number of epochs. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "batch_size = 50\n", "epochs = 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define an attri2vec generator, which generates batches of (target, context) nodes and labels for the node pair." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "generator = Attri2VecLinkGenerator(G, batch_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Building the model: a 1-hidden-layer node representation ('input embedding') of the `target` node and the parameter vector ('output embedding') for predicting the existence of `context node` for each `(target context)` pair, with a link classification layer performed on the dot product of the 'input embedding' of the `target` node and the 'output embedding' of the `context` node.\n", "\n", "Attri2Vec part of the model, with a 128-dimension hidden layer, no bias term and no normalization. (Normalization can be set to 'l2'). " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "layer_sizes = [128]\n", "attri2vec = Attri2Vec(\n", " layer_sizes=layer_sizes, generator=generator, bias=False, normalize=None\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Build the model and expose input and output sockets of attri2vec, for node pair inputs:\n", "x_inp, x_out = attri2vec.in_out_tensors()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the link_classification function to generate the prediction, with the `ip` edge embedding generation method and the `sigmoid` activation, which actually performs the dot product of the 'input embedding' of the target node and the 'output embedding' of the context node followed by a sigmoid activation. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "link_classification: using 'ip' method to combine node embeddings into edge embeddings\n" ] } ], "source": [ "prediction = link_classification(\n", " output_dim=1, output_act=\"sigmoid\", edge_embedding_method=\"ip\"\n", ")(x_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stack the Attri2Vec encoder and prediction layer into a Keras model, and specify the loss." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "model = keras.Model(inputs=x_inp, outputs=prediction)\n", "\n", "model.compile(\n", " optimizer=keras.optimizers.Adam(lr=1e-3),\n", " loss=keras.losses.binary_crossentropy,\n", " metrics=[keras.metrics.binary_accuracy],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train the model." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train for 1351 steps\n", "Epoch 1/4\n", "1351/1351 - 6s - loss: 0.6822 - binary_accuracy: 0.5551\n", "Epoch 2/4\n", "1351/1351 - 5s - loss: 0.5173 - binary_accuracy: 0.7547\n", "Epoch 3/4\n", "1351/1351 - 5s - loss: 0.3163 - binary_accuracy: 0.8961\n", "Epoch 4/4\n", "1351/1351 - 5s - loss: 0.2059 - binary_accuracy: 0.9439\n" ] } ], "source": [ "history = model.fit(\n", " generator.flow(unsupervised_samples),\n", " epochs=epochs,\n", " verbose=2,\n", " use_multiprocessing=False,\n", " workers=1,\n", " shuffle=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualise Node Embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build the node based model for predicting node representations from node content attributes with the learned parameters. Below a Keras model is constructed, with `x_inp[0]` as input and `x_out[0]` as output. Note that this model's weights are the same as those of the corresponding node encoder in the previously trained node pair classifier." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "x_inp_src = x_inp[0]\n", "x_out_src = x_out[0]\n", "embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the node embeddings by applying the learned mapping function to node content features." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "43/43 [==============================] - 0s 2ms/step\n" ] } ], "source": [ "node_gen = Attri2VecNodeGenerator(G, batch_size).flow(subjects.index)\n", "node_embeddings = embedding_model.predict(node_gen, workers=1, verbose=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform the embeddings to 2d space for visualisation." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "transform = TSNE # PCA\n", "\n", "trans = transform(n_components=2)\n", "node_embeddings_2d = trans.fit_transform(node_embeddings)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# draw the embedding points, coloring them by the target label (paper subject)\n", "alpha = 0.7\n", "label_map = {l: i for i, l in enumerate(np.unique(subjects))}\n", "node_colours = [label_map[target] for target in subjects]\n", "\n", "plt.figure(figsize=(7, 7))\n", "plt.axes().set(aspect=\"equal\")\n", "plt.scatter(\n", " node_embeddings_2d[:, 0],\n", " node_embeddings_2d[:, 1],\n", " c=node_colours,\n", " cmap=\"jet\",\n", " alpha=alpha,\n", ")\n", "plt.title(\"{} visualization of node embeddings\".format(transform.__name__))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Node Classification Task" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The embeddings learned by `attri2vec` can be used as feature vectors in downstream tasks, such as node classification and link prediction.\n", "\n", "In this example, we will use the `attri2vec` node embeddings to train a classifier to predict the subject of a paper in DBLP." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# X will hold the 128-dimensional input features\n", "X = node_embeddings\n", "# y holds the corresponding target values\n", "y = np.array(subjects)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Splitting\n", "\n", "We split the data into train and test sets. \n", "\n", "We use 20% of the data for training and the remaining 80% for testing as a hold out test set." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Array shapes:\n", " X_train = (422, 128)\n", " y_train = (422,)\n", " X_test = (1688, 128)\n", " y_test = (1688,)\n" ] } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, test_size=None)\n", "print(\n", " \"Array shapes:\\n X_train = {}\\n y_train = {}\\n X_test = {}\\n y_test = {}\".format(\n", " X_train.shape, y_train.shape, X_test.shape, y_test.shape\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classifier Training\n", "\n", "We train a Logistic Regression classifier on the training data. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegressionCV(Cs=10, class_weight=None, cv=10, dual=False,\n", " fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,\n", " max_iter=1000, multi_class='ovr', n_jobs=None,\n", " penalty='l2', random_state=None, refit=True,\n", " scoring='accuracy', solver='lbfgs', tol=0.0001,\n", " verbose=False)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = LogisticRegressionCV(\n", " Cs=10, cv=10, scoring=\"accuracy\", verbose=False, multi_class=\"ovr\", max_iter=1000\n", ")\n", "clf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predict the hold-out test set." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "y_pred = clf.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the accuracy of the classifier on the test set." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7535545023696683" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(y_test, y_pred)" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "source": [ "
Run the latest release of this notebook:
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }