{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 论文引用网络中的节点分类任务\n", "\n", "在这一教程中,我们将展示 GraphScope 如何结合图分析、图查询和图学习的能力,处理论文引用网络中的节点分类任务。\n", "\n", "\n", "在这个例子中我们使用的是 [ogbn-mag](https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag) 数据集。\"ogbn\" 是由微软学术关系图(Microsoft Academic Graph)的子集组成的异构图网络。该图中包含4种类型的实体(即论文、作者、机构和研究领域),以及连接两个实体的四种类型的有向关系边。\n", "\n", "我们需要处理的任务是,给出异构的 ogbn-mag 数据,在该图上预测每篇论文的类别。这是一个节点分类任务,该任务可以归类在各个领域、各个方向或研究小组的论文,通过对论文属性和引用图上的结构信息对论文进行分类。在该数据中,每个论文节点包含了一个从论文标题、摘要抽取的 128 维 word2vec 向量作为表征,该表征是经过预训练提前获取的;而结构信息是在计算过程中即时计算的。\n", "\n", "这一教程将会分为以下几个步骤:\n", "\n", "- 通过gremlin交互式查询图;\n", "- 执行图算法做图分析;\n", "- 执行基于图数据的机器学习任务;" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install graphscope package if you are NOT in the Playground\n", "\n", "!pip3 install graphscope\n", "!pip3 uninstall -y importlib_metadata # Address an module conflict issue on colab.google. Remove this line if you are not on colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the graphscope module\n", "\n", "import graphscope\n", "\n", "graphscope.set_option(show_log=False) # enable logging" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the obgn_mag dataset as a graph\n", "\n", "from graphscope.dataset import load_ogbn_mag\n", "\n", "graph = load_ogbn_mag()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interactive query with gremlin\n", "\n", "在此示例中,我们启动了一个交互查询引擎,然后使用图遍历来查看两位给定作者共同撰写的论文数量。为了简化查询,我们假设作者可以分别由ID 2 和 4307 唯一标识。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the entrypoint for submitting Gremlin queries on graph g.\n", "interactive = graphscope.gremlin(graph)\n", "\n", "# Count the number of papers two authors (with id 2 and 4307) have co-authored.\n", "papers = interactive.execute(\n", " \"g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()\"\n", ").one()\n", "print(\"result\", papers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Graph analytics with analytical engine\n", "\n", "继续我们的示例,下面我们在图数据中进行图分析来生成节点结构特征。我们首先通过在特定周期内从全图中提取论文(使用Gremlin!)来导出一个子图,然后运行 k-core 分解和三角形计数以生成每个论文节点的结构特征。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Exact a subgraph of publication within a time range.\n", "sub_graph = interactive.subgraph(\"g.V().has('year', inside(2014, 2020)).outE('cites')\")\n", "\n", "# Project the subgraph to simple graph by selecting papers and their citations.\n", "simple_g = sub_graph.project(vertices={\"paper\": []}, edges={\"cites\": []})\n", "# compute the kcore and triangle-counting.\n", "kc_result = graphscope.k_core(simple_g, k=5)\n", "tc_result = graphscope.triangles(simple_g)\n", "\n", "# Add the results as new columns to the citation graph.\n", "sub_graph = sub_graph.add_column(kc_result, {\"kcore\": \"r\"})\n", "sub_graph = sub_graph.add_column(tc_result, {\"tc\": \"r\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Graph neural networks (GNNs)\n", "\n", "接着我们利用生成的结构特征和原有特征通过 GraphScope 的学习引擎来训练一个学习模型。\n", "\n", "在本例中,我们训练了 GraphSAGE 模型,将节点(论文)分类为349个类别,每个类别代表一个出处(例如预印本和会议)。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define the features for learning,\n", "# we chose original 128-dimension feature and k-core, triangle count result as new features.\n", "paper_features = []\n", "for i in range(128):\n", " paper_features.append(\"feat_\" + str(i))\n", "paper_features.append(\"kcore\")\n", "paper_features.append(\"tc\")\n", "\n", "# Launch a learning engine. here we split the dataset, 75% as train, 10% as validation and 15% as test.\n", "lg = graphscope.graphlearn(\n", " sub_graph,\n", " nodes=[(\"paper\", paper_features)],\n", " edges=[(\"paper\", \"cites\", \"paper\")],\n", " gen_labels=[\n", " (\"train\", \"paper\", 100, (0, 75)),\n", " (\"val\", \"paper\", 100, (75, 85)),\n", " (\"test\", \"paper\", 100, (85, 100)),\n", " ],\n", ")\n", "\n", "# Then we define the training process using the example GraphSAGE model with tensorflow.\n", "try:\n", " # https://www.tensorflow.org/guide/migrate\n", " import tensorflow.compat.v1 as tf\n", " tf.disable_v2_behavior()\n", "except ImportError:\n", " import tensorflow as tf\n", "\n", "import graphscope.learning\n", "from graphscope.learning.examples import EgoGraphSAGE\n", "from graphscope.learning.examples import EgoSAGESupervisedDataLoader\n", "from graphscope.learning.examples.tf.trainer import LocalTrainer\n", "\n", "# supervised GraphSAGE.\n", "def train_sage(graph, node_type, edge_type, class_num, features_num,\n", " hops_num=2, nbrs_num=[25, 10], epochs=2,\n", " hidden_dim=256, in_drop_rate=0.5, learning_rate=0.01,\n", "):\n", " graphscope.learning.reset_default_tf_graph()\n", "\n", " dimensions = [features_num] + [hidden_dim] * (hops_num - 1) + [class_num]\n", " model = EgoGraphSAGE(dimensions, act_func=tf.nn.relu, dropout=in_drop_rate)\n", "\n", " # prepare train dataset\n", " train_data = EgoSAGESupervisedDataLoader(\n", " graph, graphscope.learning.Mask.TRAIN,\n", " node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,\n", " )\n", " train_embedding = model.forward(train_data.src_ego)\n", " train_labels = train_data.src_ego.src.labels\n", " loss = tf.reduce_mean(\n", " tf.nn.sparse_softmax_cross_entropy_with_logits(\n", " labels=train_labels, logits=train_embedding,\n", " )\n", " )\n", " optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)\n", "\n", " # prepare test dataset\n", " test_data = EgoSAGESupervisedDataLoader(\n", " graph, graphscope.learning.Mask.TEST,\n", " node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,\n", " )\n", " test_embedding = model.forward(test_data.src_ego)\n", " test_labels = test_data.src_ego.src.labels\n", " test_indices = tf.math.argmax(test_embedding, 1, output_type=tf.int32)\n", " test_acc = tf.div(\n", " tf.reduce_sum(tf.cast(tf.math.equal(test_indices, test_labels), tf.float32)),\n", " tf.cast(tf.shape(test_labels)[0], tf.float32),\n", " )\n", "\n", " # train and test\n", " trainer = LocalTrainer()\n", " trainer.train(train_data.iterator, loss, optimizer, epochs=epochs)\n", " trainer.test(test_data.iterator, test_acc)\n", "\n", "train_sage(lg, node_type=\"paper\", edge_type=\"cites\",\n", " class_num=349, # output dimension\n", " features_num=130, # input dimension, 128 + kcore + triangle count\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }