{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading Graphs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to the NetworkX compatible APIs, GraphScope proposed a set of APIs in Python \n", "to meet the needs for loading/analysing/quering very large graphs.\n", "\n", "GraphScope models graph data as [property graphs](https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model), in which the edges/vertices are labeled and have many properties. In this tutorial, we show how GraphScope load graphs, including\n", "\n", "- How to load a built-in dataset quickly;\n", "- How to define the schema of a property graph;\n", "- Loading graph from various locations;\n", "- Serializing/Deserializing a graph to/from disk.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisite\n", "\n", "First, we launch a session and import necessary packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install graphscope package if you are NOT in the Playground\n", "!pip3 install graphscope" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import graphscope\n", "\n", "graphscope.set_option(show_log=False) # enable logging" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Built-in Datasets\n", "\n", "GraphScope comes with a set of popular datasets, and utility functions to load them into memory,\n", "makes it easy for user to get started.\n", "Here's an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from graphscope.dataset import load_ldbc\n", "\n", "graph = load_ldbc()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In standalone mode, it will automatically download the data to `${HOME}/.graphscope/dataset`, and it will remain in there for future usage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Your Own Datasets \n", "\n", "However, it's common that users need to load their own data and do some analysis.\n", "\n", "To build a property graph on GraphScope, we firstly create an empty graph using ``g()``." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import graphscope\n", "from graphscope.framework.loader import Loader\n", "\n", "graph = graphscope.g()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The class ``Graph`` has several methods:\n", "\n", "```python\n", " def add_vertices(self, vertices, label=\"_\", properties=None, vid_field=0):\n", " pass\n", "\n", " def add_edges(self, edges, label=\"_e\", properties=None, src_label=None, dst_label=None, src_field=0, dst_field=1):\n", " pass\n", "```\n", "These methods helps users to construct the schema of the property graph iteratively.\n", "\n", "We will use files in `ldbc_sample` through this tutorial. You can get the files in [here](https://github.com/GraphScope/gstest/tree/master/ldbc_sample). Here in this tutorial, we have already download it to local in the previous step.\n", "\n", "And you can inspect the graph schema by using ``print(graph.schema)``.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding Vertices\n", "\n", "We can add a kind of vertices to graph, the method has the following parameters:\n", "\n", "``vertices``: A location for the vertex data source, which can be a file location, or a numpy, etc.\n", "\n", "A simple example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It will read data from the the location `${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv`. Since we didn't give additional arguments, these vertices will be labeled ``_`` by default, using the first column in the file as their ID, and other columns as their properties. Both the names and data types of properties will be deduced.\n", "\n", "Another commonly used parameter is label:\n", "\n", "``label``: The label name of the vertex, default to ``_``.\n", "\n", "Since a property graph allows many kinds of vertices, it is suggested for users to give each kind of vertices a meaningful label name. For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g()\n", "\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we have a graph with one kind of vertices, its label name is person.\n", "\n", "In addition, each kind of labeled vertices have their own properties. Here is the third parameter:\n", "\n", "``properties``: A list of properties, Optional, default to ``None``.\n", "\n", "This parameter selects the corresponding columns from the source data file or pandas DataFrames as properties. Please note that \n", "the values of this parameter should exist in the file/DataFrame. By default( values ``None``), all columns except the ``vid_field`` column \n", "will be added as properties. If it equals to a empty list ``[]``, then no properties will be added. \n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# All columns (firstName,lastName,gender,birthday,creationDate,locationIP,browserUsed) will be added as properties.\n", "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", " properties=None,\n", ")\n", "\n", "# Only columns firstName, lastName will be added as properties.\n", "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", " properties=[\"firstName\", \"lastName\"],\n", ")\n", "\n", "# no properties will be added.\n", "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", " properties=[],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``vid_field`` determines which column used as vertex ID. (as well as the source ID or destination ID when loading edges.) \n", "\n", "It can be a ``str``, the name of columns, or ``int``, representing the index of the columns.\n", "\n", "By default, the value is 0, hence the first column will be used as vertex ID." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " vid_field=\"id\",\n", ")\n", "\n", "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " vid_field=0,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding Edges\n", "\n", "Next, let's take a look on the parameters for loading edges.\n", "\n", "``edges``: The location indicating where to read the data. e.g., " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", ")\n", "\n", "# Note we already added a vertex label named 'person'.\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " src_label=\"person\",\n", " dst_label=\"person\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will load an edge which label is ``_e`` (the default value), its source vertex and destination vertex will be ``person``, using the **first column** as the source vertex ID, the **second column** as the destination vertex ID, the others as properties.\n", "\n", "Similar to vertices, we can use parameter `label` to assign label name and `properties` to select properties.\n", "\n", "``label``: The label name of the edges, default to ``_e``. (It's recommended to use a meaningful label name.)\n", "``properties``: A list of properties, default to ``None``(add all columns as properties). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", ")\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " label=\"knows\",\n", " src_label=\"person\",\n", " dst_label=\"person\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Differ to vertices, edges have some additional parameters.\n", "\n", "```src_label```: The label name of the source vertex. \n", "```dst_label```: The label name of the destination vertex, it can be different to the ``src_label``, \n", "```src_field``` and ```dst_field```: The columns used for source(destination) vertex id. Default to 0 and 1, respectively.\n", "\n", "e.g., " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", ")\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv\", delimiter=\"|\"),\n", " label=\"comment\",\n", ")\n", "\n", "# Please note we already added a vertex label named 'person'.\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " label=\"likes\",\n", " src_label=\"person\",\n", " dst_label=\"comment\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# examples for ``src_field`` and ``dst_field``\n", "\n", "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", ")\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv\", delimiter=\"|\"),\n", " label=\"comment\",\n", ")\n", "\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " label=\"likes\",\n", " src_label=\"person\",\n", " dst_label=\"comment\",\n", " src_field=\"Person.id\",\n", " dst_field=\"Comment.id\",\n", ")\n", "# Or use the index.\n", "# graph = graph.add_edges(Loader('${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment', src_field=0, dst_field=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Usage\n", "\n", "Here are some advanced usages to deal with homogeneous graphs or very complex graphs.\n", "\n", "### Deduce vertex labels when not ambiguous\n", "\n", "If there is only one kind of vertices in a graph, the vertex label can be omitted.\n", "GraphScope will infer the source and destination vertex label to that very label." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", ")\n", "# GraphScope will assign ``src_label`` and ``dst_label`` to ``person`` automatically.\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n", " delimiter=\"|\",\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inducing vertex from edges\n", "\n", "If user add edges with unseen ``src_label`` or ``dst_label``, graphscope will extract an vertex table from the given labels from the edge data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g()\n", "# Deduce vertex label `person` from the source and destination endpoints of edges.\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " src_label=\"person\",\n", " dst_label=\"person\",\n", ")\n", "\n", "graph = graphscope.g()\n", "# Deduce the vertex label `person` from the source endpoint,\n", "# and vertex label `comment` from the destination endpoint of edges.\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " label=\"likes\",\n", " src_label=\"person\",\n", " dst_label=\"comment\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple relations\n", "\n", "In some cases, an edge label may connect two kinds of vertices. For example, in a\n", "graph, two kinds of edges are labeled with ``likes`` but represents two relations.\n", "i.e., ``person`` -> ``likes`` <- ``comment`` and ``person`` -> ``likes`` <- ``post``.\n", "\n", "In this case, we can simply add the relation again with the same edge label,\n", "but with different source and destination labels.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sess = graphscope.session(cluster_type=\"hosts\", num_workers=1, mode=\"lazy\")\n", "graph = sess.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", ")\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv\", delimiter=\"|\"),\n", " label=\"comment\",\n", ")\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv\", delimiter=\"|\"),\n", " label=\"post\",\n", ")\n", "\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " label=\"likes\",\n", " src_label=\"person\",\n", " dst_label=\"comment\",\n", ")\n", "\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_post_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " label=\"likes\",\n", " src_label=\"person\",\n", " dst_label=\"post\",\n", ")\n", "graph = sess.run(graph)\n", "print(graph.schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please note:\n", "\n", " 1. This feature(multiple relations using same edge label) is only avaiable in `lazy` mode yet.\n", " 2. It is worth noting that for several configurations in the side `Label`, \n", " the attributes should be the same in number and type, and preferably \n", " have the same name, because the data of the same `Label` will be put into one Table, \n", " and the attribute names will uses the names specified by the first configuration.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specifying data types of properties manually\n", "\n", "GraphScope will deduce data types from input files, and it works as expected in most cases.\n", "However, sometimes user may want to determine the data types as well, e.g." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g()\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv\", delimiter=\"|\"),\n", " label=\"post\",\n", " properties=[\"content\", (\"length\", \"int\")],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "It forces the property to be (casted and) loaded as specified data type. The format of this parameter is tuple(s) with the name and the type.\n", "e.g., in this case, the property ``length`` will have type ``int`` rather than the default ``int64_t``. The options of the types are ``int``, ``int64``, ``float``, ``double``, or ``str``.\n", "\n", "\n", "### Other parameters of graph\n", "\n", "The class ``Graph`` has three meta options, which are:\n", "\n", "- ``oid_type``, can be ``int32_t``, ``int64_t`` or ``string``. Default to ``int64_t`` in consideration of efficiency.\n", " But if the ID column can't be represented by ``int64_t``, then we should use ``string``.\n", " When it is known that the ID is within of range of ``int32``, using ``int32_t`` can be helpful\n", " to optimize the memory usage.\n", "- ``directed``, boolean value and default to ``True``. Controls to load an directed or undirected graph.\n", "- ``generate_eid``, bool, default to ``True``, whether to generate an unique id for all edges automatically.\n", "- ``retain_oid``, bool, default to ``True``, whether to keep the original ID in vertex table.\n", "\n", "\n", "### Putting them Together\n", "\n", "Let's make this example complete. \n", "A more complex example to load LDBC snb graph can be find [here](https://github.com/alibaba/GraphScope/blob/main/python/graphscope/dataset/ldbc.py)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = graphscope.g(oid_type=\"int64_t\", directed=True, generate_eid=True, retain_oid=True)\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n", " label=\"person\",\n", ")\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv\", delimiter=\"|\"),\n", " label=\"comment\",\n", ")\n", "graph = graph.add_vertices(\n", " Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv\", delimiter=\"|\"),\n", " label=\"post\",\n", ")\n", "\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " label=\"knows\",\n", " src_label=\"person\",\n", " dst_label=\"person\",\n", ")\n", "graph = graph.add_edges(\n", " Loader(\n", " \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n", " delimiter=\"|\",\n", " ),\n", " label=\"likes\",\n", " src_label=\"person\",\n", " dst_label=\"comment\",\n", ")\n", "\n", "print(graph.schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading From Pandas or Numpy\n", "\n", "The data source aforementioned is an object of `Loader`. A loader wraps a location or the data itself. \n", "GraphScope supports load a graph from pandas dataframes or numpy ndarrays, making it easy to construct a graph right in the python console.\n", "\n", "Apart from the loader, the other fields like properties, label, etc. are the same as examples above.\n", "\n", "### From Pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "leader_id = np.array([0, 0, 0, 1, 1, 3, 3, 6, 6, 6, 7, 7, 8])\n", "member_id = np.array([2, 3, 4, 5, 6, 6, 8, 0, 2, 8, 8, 9, 9])\n", "group_size = np.array([4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2])\n", "e_data = np.transpose(np.vstack([leader_id, member_id, group_size]))\n", "df_group = pd.DataFrame(e_data, columns=[\"leader_id\", \"member_id\", \"group_size\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "student_id = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", "avg_score = np.array(\n", " [490.33, 164.5, 190.25, 762.0, 434.2, 513.0, 569.0, 25.0, 308.0, 87.0]\n", ")\n", "v_data = np.transpose(np.vstack([student_id, avg_score]))\n", "df_student = pd.DataFrame(v_data, columns=[\"student_id\", \"avg_score\"]).astype(\n", " {\"student_id\": np.int64}\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use a dataframe as datasource, properties omitted, col_0/col_1 will be used as src/dst by default.\n", "# (for vertices, col_0 will be used as vertex_id by default)\n", "graph = graphscope.g().add_vertices(df_student).add_edges(df_group)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### From Numpy\n", "\n", "Note that each array is a column, we pass it like as COO matrix format to the loader.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "array_group = [df_group[col].values for col in [\"leader_id\", \"member_id\", \"group_size\"]]\n", "array_student = [df_student[col].values for col in [\"student_id\", \"avg_score\"]]\n", "\n", "graph = graphscope.g().add_vertices(array_student).add_edges(array_group)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loader Variants\n", "\n", "\n", "When a loader wraps a location, it may only contains a str.\n", "The string follows the standard of URI. When receiving a request for loading graph from a location, \n", "``graphscope`` will parse the URI and invoke corresponding loader according to the schema.\n", "\n", "Currently, ``graphscope`` supports loaders for ``local``, ``s3``, ``oss``, ``hdfs``.\n", "Under the hood, data is loaded distributedly by [v6d](https://github.com/v6d-io/v6d) , ``v6d`` takes advantage\n", "of [fsspec](https://github.com/intake/filesystem_spec) to resolve specific scheme and formats.\n", "Any additional configurations can be passed in kwargs of ``Loader``, which will be parsed \n", "directly by the specific class. e.g., ``host`` and ``port`` to ``hdfs``, or ``access-id``, ``secret-access-key`` to ``oss`` or ``s3``.\n", "\n", "```\n", "\n", " from graphscope.framework.loader import Loader\n", "\n", " ds1 = Loader(\"file:///var/datafiles/group.e\")\n", " ds2 = Loader(\"oss://graphscope_bucket/datafiles/group.e\", key='access-id', secret='secret-access-key', endpoint='oss-cn-hangzhou.aliyuncs.com')\n", " ds3 = Loader(\"hdfs:///datafiles/group.e\", host='localhost', port='9000', extra_conf={'conf1': 'value1'})\n", " d34 = Loader(\"s3://datafiles/group.e\", key='access-id', secret='secret-access-key', client_kwargs={'region_name': 'us-east-1'})\n", "```\n", "\n", "Users can implement customized loaders to support additional data sources. Take [ossfs](https://github.com/v6d-io/v6d/blob/main/modules/io/adaptors/ossfs.py) as an example, a user needs to subclass ``AbstractFileSystem``, which\n", "is used to resolve specific protocol scheme, and ``AbstractBufferFile`` to read and write.\n", "The only methods the user needs to override is ``_upload_chunk``,\n", "``_initiate_upload`` and ``_fetch_range``. In the end, the user needs to use ``fsspec.register_implementation('protocol_name', 'protocol_file_system')`` to register corresponding resolver." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Serialization and Deserialization (Only avaiable in k8s mode)\n", "When the graph is huge, it takes large amount of time(e.g., hours) for the graph loading.\n", "GraphScope provides serialization and deserialization for graph data, \n", "which dumps and load the constructed graphs in the form of binary data to(from) disk. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Serialization\n", "\n", "`graph.save_to` takes a `path` argument, indicating the location to store the binary data.\n", "\n", "`graph.save_to('/tmp/serial')`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deserialization\n", "\n", "`graph.load_from` is a `classmethod`, its `path` argument should be exactly the same to the `path` passed in `graph.save_to`. Please note that during serialization, the workers dump its own data to files with its index as suffix. Thus the number of workers for deserialization should be **exactly the same** to that for serialization.\n", "\n", "In addition, `graph.load_from` needs an extra `sess` parameter, specifying which session the graph would be deserialized in.\n", "\n", "```python\n", "import graphscope\n", "from graphscope import Graph\n", "\n", "sess = graphscope.session()\n", "deserialized_graph = Graph.load_from('/tmp/seri', sess)\n", "print(deserialized_graph.schema)\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }