{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Node Classification on Citation Network\n", "\n", "As a start, we present a end to end example, demonstrating how GraphScope process node classification task on citation network by combining analytics, interactive and graph neural networks computation.\n", "\n", "In this example, we use [ogbn-mag](https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag) dataset. ogbn-mag is a heterogeneous network composed of a subset of the Microsoft Academic Graph. It contains 4 types of entities(i.e., papers, authors, institutions, and fields of study), as well as four types of directed relations connecting two entities.\n", "\n", "Given the heterogeneous ogbn-mag data, the task is to predict the class of each paper. We apply both the attribute and structural information to classify papers. In the graph, each paper node contains a 128-dimensional word2vec vector representing its content, which is obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are pre-trained. The structural information is computed on-the-fly.\n", "\n", "This tutorial has the following steps:\n", "- Querying graph data with Gremlin;\n", "- Running graph analytical algorithms;\n", "- Running graph-based machine learning tasks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install graphscope package if you are NOT in the Playground\n", "\n", "!pip3 install graphscope\n", "!pip3 uninstall -y importlib_metadata # Address an module conflict issue on colab.google. Remove this line if you are not on colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the graphscope module\n", "\n", "import graphscope\n", "\n", "graphscope.set_option(show_log=False) # enable logging" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the obgn_mag dataset as a graph\n", "\n", "from graphscope.dataset import load_ogbn_mag\n", "\n", "graph = load_ogbn_mag()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interactive query with gremlin\n", "\n", "In this example, we launch a interactive query and use graph traversal to count the number of papers two given authors have co-authored. To simplify the query, we assume the authors can be uniquely identified by ID `2` and `4307`, respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the entrypoint for submitting Gremlin queries on graph g.\n", "interactive = graphscope.gremlin(graph)\n", "\n", "# Count the number of papers two authors (with id 2 and 4307) have co-authored.\n", "papers = interactive.execute(\n", " \"g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()\"\n", ").one()\n", "print(\"result\", papers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Graph analytics with analytical engine\n", "\n", "Continuing our example, we run graph algorithms on graph to generate structural features. below we first derive a subgraph by extracting publications in specific time out of the entire graph (using Gremlin!), and then run k-core decomposition and triangle counting to generate the structural features of each paper node." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Exact a subgraph of publication within a time range.\n", "sub_graph = interactive.subgraph(\"g.V().has('year', inside(2014, 2020)).outE('cites')\")\n", "\n", "# Project the subgraph to simple graph by selecting papers and their citations.\n", "simple_g = sub_graph.project(vertices={\"paper\": []}, edges={\"cites\": []})\n", "# compute the kcore and triangle-counting.\n", "kc_result = graphscope.k_core(simple_g, k=5)\n", "tc_result = graphscope.triangles(simple_g)\n", "\n", "# Add the results as new columns to the citation graph.\n", "sub_graph = sub_graph.add_column(kc_result, {\"kcore\": \"r\"})\n", "sub_graph = sub_graph.add_column(tc_result, {\"tc\": \"r\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Graph neural networks (GNNs)\n", "\n", "Then, we use the generated structural features and original features to train a learning model with learning engine.\n", "\n", "In our example, we train a GCN model to classify the nodes (papers) into 349 categories,\n", "each of which represents a venue (e.g. pre-print and conference)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define the features for learning,\n", "# we chose original 128-dimension feature and k-core, triangle count result as new features.\n", "paper_features = []\n", "for i in range(128):\n", " paper_features.append(\"feat_\" + str(i))\n", "paper_features.append(\"kcore\")\n", "paper_features.append(\"tc\")\n", "\n", "# Launch a learning engine. here we split the dataset, 75% as train, 10% as validation and 15% as test.\n", "lg = graphscope.graphlearn(\n", " sub_graph,\n", " nodes=[(\"paper\", paper_features)],\n", " edges=[(\"paper\", \"cites\", \"paper\")],\n", " gen_labels=[\n", " (\"train\", \"paper\", 100, (0, 75)),\n", " (\"val\", \"paper\", 100, (75, 85)),\n", " (\"test\", \"paper\", 100, (85, 100)),\n", " ],\n", ")\n", "\n", "# Then we define the training process, use internal GCN model.\n", "from graphscope.learning.examples import GCN\n", "from graphscope.learning.graphlearn.python.model.tf.optimizer import get_tf_optimizer\n", "from graphscope.learning.graphlearn.python.model.tf.trainer import LocalTFTrainer\n", "\n", "\n", "def train(config, graph):\n", " def model_fn():\n", " return GCN(\n", " graph,\n", " config[\"class_num\"],\n", " config[\"features_num\"],\n", " config[\"batch_size\"],\n", " val_batch_size=config[\"val_batch_size\"],\n", " test_batch_size=config[\"test_batch_size\"],\n", " categorical_attrs_desc=config[\"categorical_attrs_desc\"],\n", " hidden_dim=config[\"hidden_dim\"],\n", " in_drop_rate=config[\"in_drop_rate\"],\n", " neighs_num=config[\"neighs_num\"],\n", " hops_num=config[\"hops_num\"],\n", " node_type=config[\"node_type\"],\n", " edge_type=config[\"edge_type\"],\n", " full_graph_mode=config[\"full_graph_mode\"],\n", " )\n", "\n", " trainer = LocalTFTrainer(\n", " model_fn,\n", " epoch=config[\"epoch\"],\n", " optimizer=get_tf_optimizer(\n", " config[\"learning_algo\"], config[\"learning_rate\"], config[\"weight_decay\"]\n", " ),\n", " )\n", " trainer.train_and_evaluate()\n", "\n", "\n", "# hyperparameters config.\n", "config = {\n", " \"class_num\": 349, # output dimension\n", " \"features_num\": 130, # 128 dimension + kcore + triangle count\n", " \"batch_size\": 500,\n", " \"val_batch_size\": 100,\n", " \"test_batch_size\": 100,\n", " \"categorical_attrs_desc\": \"\",\n", " \"hidden_dim\": 256,\n", " \"in_drop_rate\": 0.5,\n", " \"hops_num\": 2,\n", " \"neighs_num\": [5, 10],\n", " \"full_graph_mode\": False,\n", " \"agg_type\": \"gcn\", # mean, sum\n", " \"learning_algo\": \"adam\",\n", " \"learning_rate\": 0.01,\n", " \"weight_decay\": 0.0005,\n", " \"epoch\": 5,\n", " \"node_type\": \"paper\",\n", " \"edge_type\": \"cites\",\n", "}\n", "\n", "# Start traning and evaluating\n", "train(config, lg)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }