{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Research Notes:\n", "\n", "Here, we implement empirical measurements of graph embedding algorithms. We implement multiple methods:\n", "\n", "1) Link Prediction. This is done for all graphs. We remove a fixed % of edges in the graph and predict missing links in the graph. We train a logistic regression and LightGBM gradient boosted decision tree to predict the edges and report their AUC, accuracy and F1 score.\n", "\n", "2) Clustering (on graphs with community/cluster labels). We use hierarchical agglomerative clustering on the graph embeddings and measure overlap of the embedding with the real-world communities in the network. Agglomerative Hierarchical clustering is chosen because it is deterministic and not sensitive to cluster shape or scaling or embedding metric space. We measure overlap with RAND index, mutual information score and Fowlkes-Mallows score.\n", "\n", "3) Label prediction. This can be multilabel classifications (for graphs where a node can be in multiple groups) or regular classification performance. We train a logistic regression and LightGBM gradient boosted decision tree to predict the labels and report their AUC, accuracy and F1 score.\n", "\n", "\n", "### First vs Second Order:\n", "\n", "We see a sharp divide in empirical performance along first-order and higher-order graph embedding methods.\n", "\n", "**First order** methods directly minimize the distance between nodes and their neighbours. Most graph embedding methods that are based around adjacency matrix factorization (laplacian eigenmaps, SVD, etc.) are first order.\n", "\n", "\n", "**Higher order** methods take in account deeper graph structure. For instance, a *second order* method would account for neighbors-of-neighbors in each node's embeddings. 3rd and higher order deepen this relationship.\n", "\n", "Note that you can do higher order embedding through graph factorization algorithms by augmenting the graph adjacency matrix. \n", "\n", "For instance, one of the methods tested below samples random walks on the graph and generates a co-occurence matrix from these samples to train GLoVe (a first order algorithm). GGVec and GraRep generate higher-order adjacency matrices by taking the dot product of the graph's random walk markov chain transition matrix with itself, then taking a first order embedding method on that.\n", "\n", "Methods based around random walks + word2vec (deepwalk, node2vec, etc.) are naturally higher order methods. The order can be constrained by reducing the word2vec window, or restricting walk length to be extremely short.\n", "\n", "### Findings\n", "\n", "We find that first-order methods generally perform better on clustering and label prediction than higher-order methods. Recommended first order methods are GGVec and ProNE.\n", "\n", "On the other hand, higher order methods perform better on the link prediction task. Interestingly, the gap in link prediction performance is inexistant for artificially created graphs. This suggests higher order methods do learn some of the structure intrinsic to real world graphs.\n", "\n", "These results put in context that it's important to have a diversity of downstream tasks when evaluating embedding models.\n", "\n", "Moreover, we find that neural-net based methods (deepwalk, node2vec and descendants) are extremely inefficient with respect to output dimensions. They tend to perform much worse with smaller number of output dimensions.\n", "\n", "### GGVec\n", "\n", "We develop GGVec, a first (and higher) order embedding algorithm.\n", "\n", "This method is very fast and scalable for large graphs by directly minimizing distange between nodes with edges (like GLoVe). It is naturally a first-order method but can be made higher order through the method mentionned above (dot product of graph transition matrix). Scaling of higher-order is worse however.\n", "\n", "Moreover, this is the first algorithm that can embed directly from an edgelist file (because minimization loop is per-edge). This remains to be implemented." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import gc\n", "import networkx as nx\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "import time\n", "import scipy\n", "import sklearn\n", "from sklearn import cluster, linear_model\n", "from sklearn.decomposition import TruncatedSVD\n", "from sklearn.preprocessing import MultiLabelBinarizer\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.multiclass import OneVsRestClassifier\n", "import sys\n", "import warnings # Silence perf warning\n", "\n", "sys.path.append(os.path.realpath('..'))\n", "\n", "import nodevectors\n", "import csrgraph as cg\n", "from csrgraph import methods\n", "from nodevectors.evaluation import link_pred\n", "from nodevectors.evaluation import graph_eval\n", "\n", "# UMAP to test (on pip)\n", "import umap\n", "\n", "warnings.simplefilter(\"ignore\")\n", "\n", "def nx_node_weights(G, method, **kwargs):\n", " \"\"\"Node Weights through networkX API\"\"\"\n", " pr = np.zeros(len(G))\n", " prdict = method(G, **kwargs)\n", " for i in G.nodes:\n", " pr[i] = prdict[i]\n", " return pr" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#### CONFIG\n", "N_COMPONENTS = 32 # resulting embedding dim\n", "SEED = 42 # RNG Seed\n", "TEST_SIZE = 0.2\n", "\n", "# For resampling tests\n", "RESAMPLE_WALKS = 10\n", "RESAMPLE_LEN = 6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Availability\n", "\n", "Data for these notebooks can be found here: https://github.com/VHRanger/Graph-Data\n", "Just download it and point the graph generation methods below to it\n", "\n", "The data is in a different repo to avoid polluting the pip package." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#### GRAPHS\n", "#### Uncomment one to choose which graph to run evaluation on\n", "\n", "#### Artificial random graphs\n", "# G = nx.binomial_graph(700, 0.6)\n", "# G, labels = graph_eval.make_cluster_graph(n_nodes=820, n_clusters=18, connections=1000, drop_pct=0.5)\n", "# G, labels = graph_eval.make_weighed_cluster_graph(n_nodes=500, n_clusters=6, connections=1500, drop_pct=0.2, max_edge_weight=15)\n", "#### Social graphs\n", "# G, labels = graph_eval.make_blogcatalog(dedupe=True)\n", "# G, mlabels = graph_eval.make_blogcatalog(dedupe=False)\n", "G, labels = graph_eval.make_email()\n", "# G, labels = graph_eval.get_karateclub(\"facebook\") # twitch, github, facebook, wikipedia\n", "# G = graph_eval.get_from_snap(url=\"http://snap.stanford.edu/data/facebook_combined.txt.gz\", sep=' ', header=None, comment='#')\n", "#### Biology Graphs\n", "# G, mlabels = graph_eval.get_n2v_ppi(\"../data/bioNEV/node2vec_PPI\")\n", "\n", "\n", "#### Needs OutOfBounds Nodes support from CSRGraphs to work\n", "# G = graph_eval.get_drugbank_ddi(\"../data/bioNEV/DrugBank_DDI\")\n", "# G, mlabels = graph_eval.get_mashup_ppi(\"../data/bioNEV/Mashup_PPI\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clusters: 38\n", "Nodes: 10312\n", "Edges: 333983\n", "connected: True\n" ] } ], "source": [ "#### For Link Prediction: Split graph into train and test edge sets\n", "#### (All nodes are still present in both)\n", "G_train, testing_pos_edges = link_pred.split_train_test_graph(G, testing_ratio=TEST_SIZE)\n", "\n", "#### Lazy way to set up evaluation\n", "try:\n", " y = labels.label\n", " n_clusters = y.nunique()\n", " HAS_LABELS = True\n", " print(f\"clusters: {n_clusters}\")\n", "except:\n", " try: # Multilabels \n", " y = MultiLabelBinarizer().fit_transform(mlabels.mlabels)\n", " HAS_LABELS = True\n", " print(f\"multilabels: {y.shape[1]}\")\n", " except: # No Labels\n", " HAS_LABELS = False\n", " print(\"No Labels\")\n", "NNODES = len(G)\n", "print(f\"Nodes: {NNODES}\\nEdges: {len(G.edges)}\\nconnected: {nx.is_connected(G_train)}\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loss: 0.0070\t: 0%| | 12/6000 [00:27<3:52:49, 2.33s/it]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Converged! Loss: 0.0070\n", "Time: 32.4033\n", "Link Prediction:\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\t(logit) AUC-ROC: 0.956, AUC-PR: 0.954, Acc: 0.891, F1: 0.890\n", "\t(lgbm) AUC-ROC: 0.961, AUC-PR: 0.959, Acc: 0.897, F1: 0.898\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loss: 0.0070\t: 0%| | 12/6000 [00:27<3:49:13, 2.30s/it]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Converged! Loss: 0.0070\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "MI: 0.13, RAND 0.10, FM: 0.10\n", "Label Prediction:\n", "\t(logit) Acc: 0.312, F1 micro: 0.312, F1 macro: 0.312\n", "\t(lgbm) Acc: 0.321, F1 micro: 0.321, F1 macro: 0.321\n" ] } ], "source": [ "ggvec_params = dict(\n", " n_components=N_COMPONENTS,\n", " order=2,\n", " tol=0.07,\n", " tol_samples=10,\n", " max_epoch=6_000,\n", " learning_rate=0.1,\n", " negative_ratio=0.15,\n", " exponent=0.33,\n", " verbose=True,\n", ")\n", "\n", "start_t = time.time()\n", "w_train = nodevectors.GGVec(**ggvec_params).fit_transform(G_train)\n", "\n", "print(f\"Time: {time.time() - start_t :.4f}\")\n", "result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)\n", "time.sleep(0.1)\n", "if HAS_LABELS:\n", " w = nodevectors.GGVec(**ggvec_params).fit_transform(G)\n", " graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Making walks... Done, T=3.73\n", "Mapping Walk Names... Done, T=13.18\n", "Training W2V... Done, T=41.50\n", "Time: 60.9425\n", "Link Prediction:\n", "\t(logit) AUC-ROC: 0.871, AUC-PR: 0.874, Acc: 0.794, F1: 0.792\n", "\t(lgbm) AUC-ROC: 0.955, AUC-PR: 0.951, Acc: 0.889, F1: 0.889\n", "Making walks... Done, T=2.12\n", "Mapping Walk Names... Done, T=13.19\n", "Training W2V... Done, T=43.05\n", "MI: 0.11, RAND 0.13, FM: 0.13\n", "Label Prediction:\n", "\t(logit) Acc: 0.334, F1 micro: 0.334, F1 macro: 0.334\n", "\t(lgbm) Acc: 0.313, F1 micro: 0.313, F1 macro: 0.313\n" ] } ], "source": [ "n2v_params = dict(\n", " n_components=N_COMPONENTS,\n", " epochs=20,\n", " walklen=60,\n", " return_weight=1.,\n", " neighbor_weight=1.,\n", " w2vparams={\n", " \"window\":3, \n", " \"negative\":5, \n", " \"iter\":2,\n", " \"batch_words\":128}\n", ")\n", "\n", "start_t = time.time()\n", "w_train = nodevectors.Node2Vec(**n2v_params).fit_transform(G_train)\n", "print(f\"Time: {time.time() - start_t :.4f}\")\n", "result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)\n", "if HAS_LABELS:\n", " w = nodevectors.Node2Vec(**n2v_params).fit_transform(G)\n", " graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time: 2.4970\n", "Link Prediction:\n", "\t(logit) AUC-ROC: 0.940, AUC-PR: 0.939, Acc: 0.867, F1: 0.866\n", "\t(lgbm) AUC-ROC: 0.957, AUC-PR: 0.954, Acc: 0.885, F1: 0.885\n", "MI: 0.13, RAND 0.11, FM: 0.11\n", "Label Prediction:\n", "\t(logit) Acc: 0.348, F1 micro: 0.348, F1 macro: 0.348\n", "\t(lgbm) Acc: 0.339, F1 micro: 0.339, F1 macro: 0.339\n" ] } ], "source": [ "pne_params = dict(\n", " n_components=N_COMPONENTS,\n", " step=5,\n", " mu=0.2,\n", " theta=0.5,\n", ")\n", "\n", "start_t = time.time()\n", "pne = nodevectors.ProNE(**pne_params)\n", "w_train = pne.fit_transform(G_train)\n", "print(f\"Time: {time.time() - start_t :.4f}\")\n", "result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)\n", "if HAS_LABELS:\n", " pne = nodevectors.ProNE(**pne_params)\n", " w = pne.fit_transform(G)\n", " graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 1/1 [00:22<00:00, 22.44s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time: 24.7683\n", "Link Prediction:\n", "\t(logit) AUC-ROC: 0.922, AUC-PR: 0.909, Acc: 0.843, F1: 0.843\n", "\t(lgbm) AUC-ROC: 0.950, AUC-PR: 0.946, Acc: 0.874, F1: 0.877\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 1/1 [00:22<00:00, 22.16s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "MI: 0.04, RAND 0.23, FM: 0.23\n", "Label Prediction:\n", "\t(logit) Acc: 0.145, F1 micro: 0.145, F1 macro: 0.145\n", "\t(lgbm) Acc: 0.310, F1 micro: 0.310, F1 macro: 0.310\n" ] } ], "source": [ "grarep_params = dict(\n", " n_components=N_COMPONENTS,\n", " order=1,\n", " embedder=TruncatedSVD(\n", " n_iter=10,\n", " random_state=42),\n", "# merger=(lambda x : np.sum(x, axis=0)),\n", " merger=lambda x : x[-1]\n", ")\n", "\n", "start_t = time.time()\n", "w_train = nodevectors.GraRep(**grarep_params).fit_transform(G_train)\n", "\n", "print(f\"Time: {time.time() - start_t :.4f}\")\n", "result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)\n", "time.sleep(0.1)\n", "if HAS_LABELS:\n", " w = nodevectors.GraRep(**grarep_params).fit_transform(G)\n", " graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time: 39.4685\n", "Link Prediction:\n", "\t(logit) AUC-ROC: 0.927, AUC-PR: 0.924, Acc: 0.859, F1: 0.860\n", "\t(lgbm) AUC-ROC: 0.937, AUC-PR: 0.936, Acc: 0.866, F1: 0.865\n", "MI: 0.04, RAND 0.07, FM: 0.07\n", "Label Prediction:\n", "\t(logit) Acc: 0.178, F1 micro: 0.178, F1 macro: 0.178\n", "\t(lgbm) Acc: 0.176, F1 micro: 0.176, F1 macro: 0.176\n" ] } ], "source": [ "ump_params = dict(\n", " embedder=umap.UMAP,\n", " n_neighbors=3,\n", " min_dist=0.,\n", " metric='cosine',\n", " normalize_graph=True,\n", " n_components=N_COMPONENTS,\n", ")\n", "\n", "start_t = time.time()\n", "w_train = nodevectors.SKLearnEmbedder(**ump_params).fit_transform(G_train)\n", "print(f\"Time: {time.time() - start_t :.4f}\")\n", "result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)\n", "if HAS_LABELS:\n", " w = nodevectors.SKLearnEmbedder(**ump_params).fit_transform(G)\n", " graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 5%|▌ | 303/6000 [00:21<06:43, 14.12it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time: 25.4425\n", "Virtual edges: 506553\n", "Link Prediction:\n", "\t(logit) AUC-ROC: 0.671, AUC-PR: 0.709, Acc: 0.609, F1: 0.562\n", "\t(lgbm) AUC-ROC: 0.870, AUC-PR: 0.872, Acc: 0.787, F1: 0.791\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 5%|▍ | 274/6000 [00:21<07:30, 12.72it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "MI: 0.00, RAND 0.04, FM: 0.04\n", "Label Prediction:\n", "\t(logit) Acc: 0.149, F1 micro: 0.149, F1 macro: 0.149\n", "\t(lgbm) Acc: 0.131, F1 micro: 0.131, F1 macro: 0.131\n" ] } ], "source": [ "### GLoVe with random walks ###\n", "glove_params = dict(\n", " n_components=N_COMPONENTS,\n", " tol=0.0005,\n", " max_epoch=6_000,\n", " learning_rate=0.02, \n", " max_loss=10.,\n", " max_count=50, \n", " exponent=0.5,\n", ")\n", "\n", "start_t = time.time()\n", "wg = cg.csrgraph(G_train).random_walk_resample(walklen=RESAMPLE_LEN, epochs=RESAMPLE_WALKS)\n", "w_train = nodevectors.Glove(**glove_params).fit_transform(wg)\n", "\n", "print(f\"Time: {time.time() - start_t :.4f}\")\n", "print(f\"Virtual edges: {wg.dst.size}\")\n", "result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)\n", "if HAS_LABELS:\n", " wg = cg.csrgraph(G).random_walk_resample(walklen=RESAMPLE_LEN, epochs=RESAMPLE_WALKS)\n", " w = nodevectors.Glove(**glove_params).fit_transform(wg)\n", " graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# From the related karateclub lib (on pip)\n", "# https://github.com/benedekrozemberczki/KarateClub\n", "# from karateclub.node_embedding.neighbourhood import NodeSketch, Walklets\n", "\n", "###### Slooooowwwwwww ########\n", "# walklets_params = dict(\n", "# walk_number=10, \n", "# walk_length=30, \n", "# dimensions=N_COMPONENTS,\n", "# window_size=4,\n", "# epochs=1, \n", "# learning_rate=0.05\n", "# )\n", "\n", "# try: # Karateclub models don't handle certain graphs\n", "# start_t = time.time()\n", "# model = Walklets(**walklets_params)\n", "# model.fit(G_train)\n", "# print(f\"Time: {time.time() - start_t :.3f}\")\n", "# w_train = model.get_embedding()\n", "# result = link_pred.LinkPrediction(w_train, G, G_train, testing_pos_edges)\n", "# if HAS_LABELS:\n", "# model = Walklets(**walklets_params)\n", "# model.fit(G)\n", "# w = model.get_embedding()\n", "# graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)\n", "# except: pass" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Link Prediction:\n", "\t(logit) AUC-ROC: 0.572, AUC-PR: 0.569, Acc: 0.551, F1: 0.551\n", "\t(lgbm) AUC-ROC: 0.750, AUC-PR: 0.775, Acc: 0.684, F1: 0.664\n", "MI: -0.00, RAND 0.04, FM: 0.04\n", "Label Prediction:\n", "\t(logit) Acc: 0.143, F1 micro: 0.143, F1 macro: 0.143\n", "\t(lgbm) Acc: 0.115, F1 micro: 0.115, F1 macro: 0.115\n" ] } ], "source": [ "### Completely random baseline ###\n", "\n", "w = np.random.randn(len(G), N_COMPONENTS)\n", "\n", "result = link_pred.LinkPrediction(w, G, G_train, testing_pos_edges)\n", "try:\n", " graph_eval.print_labeled_tests(w, y, test_size=TEST_SIZE, seed=SEED)\n", "except: pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }