{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading data into StellarGraph from Pandas\n", "\n", "> This demo explains how to load data into a form that can be used by the StellarGraph library. [See all other demos](../README.md).\n" ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "source": [ "
Run the latest release of this notebook:
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[The StellarGraph library](https://github.com/stellargraph/stellargraph) supports loading graph information from Pandas. [Pandas](https://pandas.pydata.org) is a library for working with data frames.\n", "\n", "This is a great way to load data that offers a good balance between performance and convenience.\n", "\n", "The StellarGraph library supports many deep machine learning (ML) algorithms on [graphs](https://en.wikipedia.org/wiki/Graph_%28discrete_mathematics%29). A graph consists of a set of *nodes* connected by *edges*, potentially with information associated with each node and edge. Any task using the StellarGraph library needs data to be loaded into an instance of the `StellarGraph` class. This class stores the graph structure (the nodes and the edges between them), as well as information about them:\n", "\n", "- *node types* and *edge types*: a class or category to which the nodes and edges belong, dictating what features are available on a node, and potentially signifying some sort of semantic meaning (this is different to machine learning label for a node)\n", "- *node features* and *edge features*: vectors of numbers associated with each node or edge\n", "- *edge weights*: a number associated with each edge\n", "\n", "All of these are optional, because they have sensible defaults if they're not relevant to the task at hand.\n", "\n", "This notebook walks through loading several kinds of graphs using Pandas. Pandas is a reasonably efficient form of loading, that is convenient for preprocessing.\n", "\n", "- homogeneous graph without features (a homogeneous graph is one with only one type of node and one type of edge)\n", "- homogeneous graph with node/edge features\n", "- homogeneous graph with edge weights\n", "- directed graphs (a graph is directed if edges have a \"start\" and \"end\" nodes, instead of just connecting two nodes)\n", "- heterogeneous graphs (more than one node type and/or more than one edge type) with and without node/edge features or edge weights, this includes knowledge graphs\n", "- real data: homogeneous graph from CSV files (an example of reading data from files and doing some preprocessing)\n", "\n", "> StellarGraph supports loading data from many sources with all sorts of data preprocessing, via [Pandas](https://pandas.pydata.org) DataFrames, [NumPy](https://www.numpy.org) arrays, [Neo4j](https://neo4j.com) and [NetworkX](https://networkx.github.io) graphs. See [all loading demos](README.md) for more details.\n", "\n", "The [documentation](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.StellarGraph) for the `StellarGraph` class includes a compressed reminder of everything discussed in this file, as well as explanations of all of the parameters.\n", "\n", "The `StellarGraph` class is available at the top level of the `stellargraph` library:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "outputs": [], "source": [ "# install StellarGraph if running on Google Colab\n", "import sys\n", "if 'google.colab' in sys.modules:\n", " %pip install -q stellargraph[demos]==1.2.1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "nbsphinx": "hidden", "tags": [ "VersionCheck" ] }, "outputs": [], "source": [ "# verify that we're using the correct version of StellarGraph for this notebook\n", "import stellargraph as sg\n", "\n", "try:\n", " sg.utils.validate_notebook_version(\"1.2.1\")\n", "except AttributeError:\n", " raise ValueError(\n", " f\"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed. Please see .\"\n", " ) from None" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from stellargraph import StellarGraph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading from anything, via Pandas\n", "\n", "Pandas DataFrames are tables of data that can be created from [many input sources](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), such as [CSV files](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) and [SQL databases](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html). StellarGraph builds on this power by allowing construction from these DataFrames." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas is widely supported by other libraries and products, like [scikit-learn](http://scikit-learn.github.io/stable), and thus a user of StellarGraph gets to benefit from these easily too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homogeneous graph without features\n", "\n", "We'll start with a homogeneous graph without any node features. This means the graph consists of only nodes and edges without any information other than a unique identifier.\n", "\n", "The basic form of constructing a `StellarGraph` is passing in an edge `DataFrame` with two columns (`source` and `target`), where each row represents a pair of nodes that are connected. Let's construct a `StellarGraph` representing a square with a diagonal:\n", "\n", "```\n", "a -- b\n", "| \\ |\n", "| \\ |\n", "d -- c\n", "```\n", "\n", "We'll start with a synthetic DataFrame defined in code here (there's some examples later of reading DataFrames from files).\n", "\n", "Each row represents a connection: for instance, the first one is the edge from `a` to `b`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sourcetarget
0ab
1bc
2cd
3da
4ac
\n", "
" ], "text/plain": [ " source target\n", "0 a b\n", "1 b c\n", "2 c d\n", "3 d a\n", "4 a c" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_edges = pd.DataFrame(\n", " {\"source\": [\"a\", \"b\", \"c\", \"d\", \"a\"], \"target\": [\"b\", \"c\", \"d\", \"a\", \"c\"]}\n", ")\n", "square_edges" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given our edges, we can create a `StellarGraph` directly:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "square = StellarGraph(edges=square_edges)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `info` method ([docs](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.StellarGraph.info)) gives a high-level summary of a `StellarGraph`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " default: [4]\n", " Features: none\n", " Edge types: default-default->default\n", "\n", " Edge types:\n", " default-default->default: [5]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "print(square.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On this square, it tells us that there's 4 nodes of type `default` (a homogeneous graph still has node and edge types, but they default to `default`), with no features, and one type of edge that touches it. It also tells us that there's 5 edges of type `default` that go between nodes of type `default`. This matches what we expect: it's a graph with 4 nodes and 5 edges and one type of each.\n", "\n", "The default node type and edge types can be set using the `node_type_default` and `edge_type_default` parameters to `StellarGraph(...)`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " corner: [4]\n", " Features: none\n", " Edge types: corner-line->corner\n", "\n", " Edge types:\n", " corner-line->corner: [5]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "square_named = StellarGraph(\n", " edges=square_edges, node_type_default=\"corner\", edge_type_default=\"line\"\n", ")\n", "print(square_named.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The names of the columns used for the edges can be controlled with the `source_column` and `target_column` parameters to `StellarGraph(...)`. For instance, maybe our graph comes from a file with `first` and `second` columns:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstsecond
0ab
1bc
2cd
3da
4ac
\n", "
" ], "text/plain": [ " first second\n", "0 a b\n", "1 b c\n", "2 c d\n", "3 d a\n", "4 a c" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_edges_first_second = square_edges.rename(\n", " columns={\"source\": \"first\", \"target\": \"second\"}\n", ")\n", "square_edges_first_second" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " default: [4]\n", " Features: none\n", " Edge types: default-default->default\n", "\n", " Edge types:\n", " default-default->default: [5]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "square_first_second = StellarGraph(\n", " edges=square_edges_first_second, source_column=\"first\", target_column=\"second\"\n", ")\n", "print(square_first_second.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homogeneous graph with features\n", "\n", "For many real-world problems, we have more than just graph structure: we have information about the nodes and edges. For instance, we might have a graph of academic papers (nodes) and how they cite each other (edges): we might have information about the nodes such as the authors and the publication year, and even the abstract or full paper contents. If we're doing a machine learning task, it can be useful to feed this information into models. The `StellarGraph` class supports this using a Pandas DataFrame: each row corresponds to a feature vector for a node or edge.\n", "\n", "### Node features\n", "\n", "Let's imagine the nodes have two features, which might be their coordinates, or maybe some other piece of information. We'll continue using synthetic DataFrames, but these could easily be read from a file. (There's an example in the \"Real data: Homogeneous graph from CSV files\" section at the end of this notebook.)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
a1-0.2
b20.3
c30.0
d4-0.5
\n", "
" ], "text/plain": [ " x y\n", "a 1 -0.2\n", "b 2 0.3\n", "c 3 0.0\n", "d 4 -0.5" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_node_data = pd.DataFrame(\n", " {\"x\": [1, 2, 3, 4], \"y\": [-0.2, 0.3, 0.0, -0.5]}, index=[\"a\", \"b\", \"c\", \"d\"]\n", ")\n", "square_node_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`StellarGraph` uses the index of the DataFrame as the connection between a node and a row of the DataFrame. Notice that the `square_features` DataFrame has `a`, ..., `d` as its index, matching the identifiers used in the edges.\n", "\n", "We've now got all the right node data, in addition to the edges from before, so now we can create a `StellarGraph`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " default: [4]\n", " Features: float32 vector, length 2\n", " Edge types: default-default->default\n", "\n", " Edge types:\n", " default-default->default: [5]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "square_node_features = StellarGraph(square_node_data, square_edges)\n", "print(square_node_features.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the output of `info` now says that the nodes of the `default` type have 2 features.\n", "\n", "We can also give the node and edge types helpful names, using either the `node_type_default`/`edge_type_default` parameters we saw before, or by passing the DataFrames in with a dictionary, where the key is the name of the type." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " corner: [4]\n", " Features: float32 vector, length 2\n", " Edge types: corner-line->corner\n", "\n", " Edge types:\n", " corner-line->corner: [5]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "square_named_node_features = StellarGraph(\n", " {\"corner\": square_node_data}, {\"line\": square_edges}\n", ")\n", "print(square_named_node_features.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Edge features\n", "\n", "Edges can have features in the same way as nodes. Any columns that don't have a special meaning are taken as feature vector elements. This means that the source and target columns are not included in the feature vectors (nor are the weight or edge type columns, that are discussed later).\n", "\n", "Let's imagine the edges have 3 features each." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sourcetargetABC
0ab-10.412
1bc20.134
2cd-30.956
3da40.078
4ac-50.990
\n", "
" ], "text/plain": [ " source target A B C\n", "0 a b -1 0.4 12\n", "1 b c 2 0.1 34\n", "2 c d -3 0.9 56\n", "3 d a 4 0.0 78\n", "4 a c -5 0.9 90" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_edge_data = pd.DataFrame(\n", " {\n", " \"source\": [\"a\", \"b\", \"c\", \"d\", \"a\"],\n", " \"target\": [\"b\", \"c\", \"d\", \"a\", \"c\"],\n", " \"A\": [-1, 2, -3, 4, -5],\n", " \"B\": [0.4, 0.1, 0.9, 0, 0.9],\n", " \"C\": [12, 34, 56, 78, 90],\n", " }\n", ")\n", "square_edge_data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " corner: [4]\n", " Features: float32 vector, length 2\n", " Edge types: corner-line->corner\n", "\n", " Edge types:\n", " corner-line->corner: [5]\n", " Weights: all 1 (default)\n", " Features: float32 vector, length 3\n" ] } ], "source": [ "square_named_features = StellarGraph(\n", " {\"corner\": square_node_data}, {\"line\": square_edge_data}\n", ")\n", "print(square_named_features.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the output of `info` now says that the edges of the `line` type have 3 features, in addition to the 2 features for each node of type `corner`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homogeneous graph with edge weights\n", "\n", "Some algorithms can understand edge weights, which can be used as a measure of the strength of the connection, or a measure of distance between nodes. A `StellarGraph` instance can have weighted edges, by including a `weight` column in the DataFrame of edges.\n", "\n", "We'll continue with the synthetic square example, by adding that extra `weight` column into the DataFrame. This column might be part of the data naturally, or it might need to be computed. Either of these is fine with Pandas: in the first case, it can be loaded at the same time as loading the source and target information, and in the second, the full power of Pandas is available to compute it (such as manipulating other information associated with the edge DataFrame, or even by comparing the nodes at each end)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sourcetargetweight
0ab1.00
1bc0.20
2cd3.40
3da5.67
4ac1.00
\n", "
" ], "text/plain": [ " source target weight\n", "0 a b 1.00\n", "1 b c 0.20\n", "2 c d 3.40\n", "3 d a 5.67\n", "4 a c 1.00" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_weighted_edges = pd.DataFrame(\n", " {\n", " \"source\": [\"a\", \"b\", \"c\", \"d\", \"a\"],\n", " \"target\": [\"b\", \"c\", \"d\", \"a\", \"c\"],\n", " \"weight\": [1.0, 0.2, 3.4, 5.67, 1.0],\n", " }\n", ")\n", "square_weighted_edges" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " default: [4]\n", " Features: none\n", " Edge types: default-default->default\n", "\n", " Edge types:\n", " default-default->default: [5]\n", " Weights: range=[0.2, 5.67], mean=2.254, std=2.25534\n", " Features: none\n" ] } ], "source": [ "square_weighted = StellarGraph(edges=square_weighted_edges)\n", "print(square_weighted.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the output of `info` now shows additional information about edge weights.\n", "\n", "Edges weights can be used with node and edge features; for instance, we create a similar graph to the last graph in the \"Homogeneous graph with features\" section that has our edge weights:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sourcetargetweightABC
0ab1.00-10.412
1bc0.2020.134
2cd3.40-30.956
3da5.6740.078
4ac1.00-50.990
\n", "
" ], "text/plain": [ " source target weight A B C\n", "0 a b 1.00 -1 0.4 12\n", "1 b c 0.20 2 0.1 34\n", "2 c d 3.40 -3 0.9 56\n", "3 d a 5.67 4 0.0 78\n", "4 a c 1.00 -5 0.9 90" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_weighted_edge_data = pd.DataFrame(\n", " {\n", " \"source\": [\"a\", \"b\", \"c\", \"d\", \"a\"],\n", " \"target\": [\"b\", \"c\", \"d\", \"a\", \"c\"],\n", " \"weight\": [1.0, 0.2, 3.4, 5.67, 1.0],\n", " \"A\": [-1, 2, -3, 4, -5],\n", " \"B\": [0.4, 0.1, 0.9, 0, 0.9],\n", " \"C\": [12, 34, 56, 78, 90],\n", " }\n", ")\n", "square_weighted_edge_data" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " corner: [4]\n", " Features: float32 vector, length 2\n", " Edge types: corner-line->corner\n", "\n", " Edge types:\n", " corner-line->corner: [5]\n", " Weights: range=[0.2, 5.67], mean=2.254, std=2.25534\n", " Features: float32 vector, length 3\n" ] } ], "source": [ "square_features_weighted = StellarGraph(\n", " {\"corner\": square_node_data}, {\"line\": square_weighted_edge_data}\n", ")\n", "print(square_features_weighted.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Directed graphs\n", "\n", "Some graphs have edge directions, where going from source to target has a different meaning to going from target to source.\n", "\n", "A directed graph can be created by using the `StellarDiGraph` class instead of the `StellarGraph` one. The construction is almost identical, and we can reuse any of the DataFrames that we created in the sections above. For instance, continuing from the previous cell, we can have a directed homogeneous graph with node features and edge weights." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarDiGraph: Directed multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " corner: [4]\n", " Features: float32 vector, length 2\n", " Edge types: corner-line->corner\n", "\n", " Edge types:\n", " corner-line->corner: [5]\n", " Weights: range=[0.2, 5.67], mean=2.254, std=2.25534\n", " Features: float32 vector, length 3\n" ] } ], "source": [ "from stellargraph import StellarDiGraph\n", "\n", "square_features_weighted_directed = StellarDiGraph(\n", " {\"corner\": square_node_data}, {\"line\": square_weighted_edge_data}\n", ")\n", "print(square_features_weighted_directed.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Everything discussed about `StellarGraph` in this file also works with `StellarDiGraph`, including parameters like `node_type_default` and `source_column`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Heterogeneous graphs\n", "\n", "Some graphs have multiple types of nodes and multiple types of edges.\n", "\n", "For example, an academic citation network that includes authors might have `wrote` edges connecting `author` nodes to `paper` nodes, in addition to the `cites` edges between `paper` nodes. There could be `supervised` edges between `author`s ([example](https://academictree.org)) too, or any number of additional node and edge types. A knowledge graph (aka RDF, triple stores or knowledge base) is an extreme form of an heterogeneous graph, with dozens, hundreds or even thousands of edge (or relation) types. Typically in a knowledge graph, edges and their types represent the information associated with a node, rather than node features.\n", "\n", "`StellarGraph` supports all forms of heterogeneous graphs.\n", "\n", "A heterogeneous `StellarGraph` can be constructed in a similar way to a homogeneous graph, except we pass a dictionary with multiple elements instead of a single element like we did for the Cora examples in the \"homogeneous graph with features\" section and others above. For a heterogeneous graph, a dictionary has to be passed; passing a single DataFrame does not work.\n", "\n", "Let's return to the square graph from earlier:\n", "\n", "```\n", "a -- b\n", "| \\ |\n", "| \\ |\n", "d -- c\n", "```\n", "\n", "### Multiple node types\n", "\n", "Suppose `a` is of type `foo`, and no features, but `b`, `c` and `d` are of type `bar` and have two features each, e.g. for `b`, `y = 0.4, z = 100`. Since the features are different shapes (`a` has zero), they need to be modeled as different types, with separate `DataFrame`s." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: []\n", "Index: [a]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_foo = pd.DataFrame(index=[\"a\"])\n", "square_foo" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yz
b0.4100
c0.1200
d0.9300
\n", "
" ], "text/plain": [ " y z\n", "b 0.4 100\n", "c 0.1 200\n", "d 0.9 300" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_bar = pd.DataFrame(\n", " {\"y\": [0.4, 0.1, 0.9], \"z\": [100, 200, 300]}, index=[\"b\", \"c\", \"d\"]\n", ")\n", "square_bar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have the information for the two node types `foo` and `bar` in separate DataFrames, so we can now put them in a dictionary to create a `StellarGraph`. Notice that `info()` is now reporting multiple node types, as well as information specific to each." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " bar: [3]\n", " Features: float32 vector, length 2\n", " Edge types: bar-default->bar, bar-default->foo\n", " foo: [1]\n", " Features: none\n", " Edge types: foo-default->bar\n", "\n", " Edge types:\n", " foo-default->bar: [2]\n", " Weights: all 1 (default)\n", " Features: none\n", " bar-default->bar: [2]\n", " Weights: all 1 (default)\n", " Features: none\n", " bar-default->foo: [1]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "square_foo_and_bar = StellarGraph({\"foo\": square_foo, \"bar\": square_bar}, square_edges)\n", "print(square_foo_and_bar.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Node IDs (the DataFrame index) needs to be unique across all types. For example, renaming the `a` corner to `b` like `square_foo_overlap` in the next cell, is not accepted and a `StellarGraph(...)` call will throw an error" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x
b-1
\n", "
" ], "text/plain": [ " x\n", "b -1" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_foo_overlap = pd.DataFrame({\"x\": [-1]}, index=[\"b\"])\n", "square_foo_overlap" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Uncomment to see the error\n", "# StellarGraph({\"foo\": square_foo_overlap, \"bar\": square_bar}, square_edges)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the node IDs aren't unique across types, one way to make them unique is to add a string prefix. You'll need to add the same prefix to the node IDs used in the edges too. Adding a prefix can be done by replacing the index:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x
foo-b-1
\n", "
" ], "text/plain": [ " x\n", "foo-b -1" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_foo_overlap_prefix = square_foo_overlap.set_index(\n", " \"foo-\" + square_foo_overlap.index.astype(str)\n", ")\n", "square_foo_overlap_prefix" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yz
bar-b0.4100
bar-c0.1200
bar-d0.9300
\n", "
" ], "text/plain": [ " y z\n", "bar-b 0.4 100\n", "bar-c 0.1 200\n", "bar-d 0.9 300" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_bar_prefix = square_bar.set_index(\"bar-\" + square_bar.index.astype(str))\n", "square_bar_prefix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple edge types: type column\n", "\n", "Graphs with multiple edge types can be simpler. Since there are often no features on the edges, we can pass a DataFrame with an additional column for the type, specifying it via the `edge_type_column` parameter. If there are features on the edges, multiple edge types can also be created in the same way as multiple node types, by passing with a dictionary of DataFrames.\n", "\n", "For example, suppose the edges in our square graph have types based on their orientation." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sourcetargetorientation
0abhorizontal
1bcvertical
2cdhorizontal
3davertical
4acdiagonal
\n", "
" ], "text/plain": [ " source target orientation\n", "0 a b horizontal\n", "1 b c vertical\n", "2 c d horizontal\n", "3 d a vertical\n", "4 a c diagonal" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_edges_types = square_edges.assign(\n", " orientation=[\"horizontal\", \"vertical\", \"horizontal\", \"vertical\", \"diagonal\"]\n", ")\n", "square_edges_types" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " default: [4]\n", " Features: none\n", " Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default\n", "\n", " Edge types:\n", " default-vertical->default: [2]\n", " Weights: all 1 (default)\n", " Features: none\n", " default-horizontal->default: [2]\n", " Weights: all 1 (default)\n", " Features: none\n", " default-diagonal->default: [1]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "square_orientation = StellarGraph(\n", " edges=square_edges_types, edge_type_column=\"orientation\"\n", ")\n", "print(square_orientation.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Edge weights are supported, in the same way as a homogeneous graph above, with a `weight` column:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sourcetargetorientationweight
0abhorizontal1.00
1bcvertical0.20
2cdhorizontal3.40
3davertical5.67
4acdiagonal1.00
\n", "
" ], "text/plain": [ " source target orientation weight\n", "0 a b horizontal 1.00\n", "1 b c vertical 0.20\n", "2 c d horizontal 3.40\n", "3 d a vertical 5.67\n", "4 a c diagonal 1.00" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_edges_types_weighted = square_edges_types.assign(weight=[1.0, 0.2, 3.4, 5.67, 1.0])\n", "square_edges_types_weighted" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " default: [4]\n", " Features: none\n", " Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default\n", "\n", " Edge types:\n", " default-vertical->default: [2]\n", " Weights: range=[0.2, 5.67], mean=2.935, std=3.86787\n", " Features: none\n", " default-horizontal->default: [2]\n", " Weights: range=[1, 3.4], mean=2.2, std=1.69706\n", " Features: none\n", " default-diagonal->default: [1]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "square_orientation_weighted = StellarGraph(\n", " edges=square_edges_types_weighted, edge_type_column=\"orientation\"\n", ")\n", "print(square_orientation_weighted.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple edge types: edge features\n", "\n", "As mentioned above, if there are multiple edge types and the edges have edge features, one will typically need to pass a dictionary of DataFrames similar to multiple node types. The features of each type can be different.\n", "\n", "Note: Edges also have IDs (the DataFrame index, like nodes), and they need to be unique across all edge types." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sourcetargetA
0ab-1
2cd-3
\n", "
" ], "text/plain": [ " source target A\n", "0 a b -1\n", "2 c d -3" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_edges_horizontal = pd.DataFrame(\n", " {\"source\": [\"a\", \"c\"], \"target\": [\"b\", \"d\"], \"A\": [-1, -3]}, index=[0, 2]\n", ")\n", "square_edges_vertical = pd.DataFrame(\n", " {\"source\": [\"b\", \"d\"], \"target\": [\"c\", \"a\"], \"B\": [0.1, 0], \"C\": [34, 78]},\n", " index=[1, 3],\n", ")\n", "square_edges_diagonal = pd.DataFrame({\"source\": [\"a\"], \"target\": [\"c\"]}, index=[4])\n", "\n", "# example:\n", "square_edges_horizontal" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " default: [4]\n", " Features: none\n", " Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default\n", "\n", " Edge types:\n", " default-vertical->default: [2]\n", " Weights: all 1 (default)\n", " Features: float32 vector, length 2\n", " default-horizontal->default: [2]\n", " Weights: all 1 (default)\n", " Features: float32 vector, length 1\n", " default-diagonal->default: [1]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "square_orientation_separate = StellarGraph(\n", " edges={\n", " \"horizontal\": square_edges_horizontal,\n", " \"vertical\": square_edges_vertical,\n", " \"diagonal\": square_edges_diagonal,\n", " },\n", ")\n", "print(square_orientation_separate.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that `vertical` edges have 2 features, `horizontal` have 1, and `diagonal` have 0.\n", "\n", "Edge weights can be specified with this multiple-DataFrames form too. Any or all of the DataFrames for an edge type can contain a `weight` column." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sourcetargetAweight
0ab-112.3
2cd-345.6
\n", "
" ], "text/plain": [ " source target A weight\n", "0 a b -1 12.3\n", "2 c d -3 45.6" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_edges_horizontal_weighted = square_edges_horizontal.assign(weight=[12.3, 45.6])\n", "square_edges_horizontal_weighted" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " default: [4]\n", " Features: none\n", " Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default\n", "\n", " Edge types:\n", " default-vertical->default: [2]\n", " Weights: all 1 (default)\n", " Features: float32 vector, length 2\n", " default-horizontal->default: [2]\n", " Weights: range=[12.3, 45.6], mean=28.95, std=23.5467\n", " Features: float32 vector, length 1\n", " default-diagonal->default: [1]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "square_orientation_separate_weighted = StellarGraph(\n", " edges={\n", " \"horizontal\": square_edges_horizontal_weighted,\n", " \"vertical\": square_edges_vertical,\n", " \"diagonal\": square_edges_diagonal,\n", " },\n", ")\n", "print(square_orientation_separate_weighted.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple everything\n", "\n", "A graph can have multiple node types and multiple edge types, with features or without, with edge weights or without and with `edge_type_column=...` (shown here) or with multiple DataFrames for edge types. We can put everything together from the previous sections to make a single complicated `StellarGraph`." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " bar: [3]\n", " Features: float32 vector, length 2\n", " Edge types: bar-diagonal->foo, bar-horizontal->bar, bar-horizontal->foo, bar-vertical->bar, bar-vertical->foo\n", " foo: [1]\n", " Features: none\n", " Edge types: foo-diagonal->bar, foo-horizontal->bar, foo-vertical->bar\n", "\n", " Edge types:\n", " foo-horizontal->bar: [1]\n", " Weights: all 1 (default)\n", " Features: none\n", " foo-diagonal->bar: [1]\n", " Weights: all 1 (default)\n", " Features: none\n", " bar-vertical->foo: [1]\n", " Weights: all 5.67\n", " Features: none\n", " bar-vertical->bar: [1]\n", " Weights: all 0.2\n", " Features: none\n", " bar-horizontal->bar: [1]\n", " Weights: all 3.4\n", " Features: none\n" ] } ], "source": [ "square_everything = StellarGraph(\n", " {\"foo\": square_foo, \"bar\": square_bar},\n", " square_edges_types_weighted,\n", " edge_type_column=\"orientation\",\n", ")\n", "print(square_everything.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Directed heterogeneous graphs\n", "\n", "A heterogeneous graph can be directed by using `StellarDiGraph` to construct it, similar to a homogeneous graph." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarDiGraph: Directed multigraph\n", " Nodes: 4, Edges: 5\n", "\n", " Node types:\n", " bar: [3]\n", " Features: float32 vector, length 2\n", " Edge types: bar-horizontal->bar, bar-vertical->bar, bar-vertical->foo\n", " foo: [1]\n", " Features: none\n", " Edge types: foo-diagonal->bar, foo-horizontal->bar\n", "\n", " Edge types:\n", " foo-horizontal->bar: [1]\n", " Weights: all 1 (default)\n", " Features: none\n", " foo-diagonal->bar: [1]\n", " Weights: all 1 (default)\n", " Features: none\n", " bar-vertical->foo: [1]\n", " Weights: all 5.67\n", " Features: none\n", " bar-vertical->bar: [1]\n", " Weights: all 0.2\n", " Features: none\n", " bar-horizontal->bar: [1]\n", " Weights: all 3.4\n", " Features: none\n" ] } ], "source": [ "from stellargraph import StellarDiGraph\n", "\n", "square_everything_directed = StellarDiGraph(\n", " {\"foo\": square_foo, \"bar\": square_bar},\n", " square_edges_types_weighted,\n", " edge_type_column=\"orientation\",\n", ")\n", "print(square_everything_directed.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Real data: Homogeneous graph from CSV files\n", "\n", "We've been using a synthetic square graph with perfectly formatted data as an example for this whole notebook, because it helps us focus on just the core `StellarGraph` functionality. Real life isn't so simple; there's usually files to wrangle and formats to convert, so we'll finish this demo covering some example steps to go from data in files to a `StellarGraph`.\n", "\n", "We'll work with the Cora dataset from :\n", "\n", "> The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details. \n", "\n", "The dataset contains two files: `cora.cites` and `cora.content`.\n", "\n", "`cora.cites` is a tab-separated values (TSV) file of the graph edges. The first column identifies the cited paper, and the second column identifies the paper that cites it. The first three lines of the file look like:\n", "\n", "```\n", "35\t1033\n", "35\t103482\n", "35\t103515\n", "...\n", "```\n", "\n", "`cora.content` is also a TSV file of information about each node (paper), with 1435 columns: the first column is the node ID (matching the IDs used in `cora.cites`), the next 1433 are the 0/1-values of word vectors, and the last is the subject area class of the paper. The first three lines of the file look like (with the 1423 of the 0/1 columns truncated)\n", "\n", "```\n", "31336\t0\t0\t...\t0\t1\t0\t0\t0\t0\t0\t0\tNeural_Networks\n", "1061127\t0\t0\t...\t1\t0\t0\t0\t0\t0\t0\t0\tRule_Learning\n", "1106406\t0\t0\t...\t0\t0\t0\t0\t0\t0\t0\t0\tReinforcement_Learning\n", "...\n", "```\n", "\n", "This graph is homogeneous (all nodes are papers, and all edges are citations), with node features (the 0/1-values) but no edge weights.\n", "\n", "The StellarGraph library provides the `datasets` module ([docs](https://stellargraph.readthedocs.io/en/stable/api.html#module-stellargraph.datasets)) for working with some common datasets via classes like `Cora` ([docs](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.datasets.Cora)). It can download the necessary files via the `download` method. (The `load` method also converts it into a `StellarGraph`, but that's too helpful for this tutorial: we're learning how to do that ourselves.)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "from stellargraph.datasets import Cora\n", "import os\n", "\n", "cora = Cora()\n", "cora.download()\n", "\n", "# the base_directory property tells us where it was downloaded to:\n", "cora_cites_file = os.path.join(cora.base_directory, \"cora.cites\")\n", "cora_content_file = os.path.join(cora.base_directory, \"cora.content\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've now got the files on disk, so we can read them using the `pd.read_csv` function. Despite the \"CSV\" in the name, this function can be used to read TSV files too. The files don't have a row of column headings, so we'll want to set our own.\n", "\n", "First, the edges. We can use `source` and `target` as the column headings, to match `StellarGraph`'s defaults. However, the natural phrasing is \"paper X cites paper Y\", not \"paper Y is cited by paper X\", so we use the columns in reverse order to match." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
targetsource
0351033
135103482
235103515
3351050679
4351103960
.........
542485311619621
5425853116853155
54268531181140289
5427853155853118
54289543151155073
\n", "

5429 rows × 2 columns

\n", "
" ], "text/plain": [ " target source\n", "0 35 1033\n", "1 35 103482\n", "2 35 103515\n", "3 35 1050679\n", "4 35 1103960\n", "... ... ...\n", "5424 853116 19621\n", "5425 853116 853155\n", "5426 853118 1140289\n", "5427 853155 853118\n", "5428 954315 1155073\n", "\n", "[5429 rows x 2 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cora_cites = pd.read_csv(\n", " cora_cites_file,\n", " sep=\"\\t\", # tab-separated\n", " header=None, # no heading row\n", " names=[\"target\", \"source\"], # set our own names for the columns\n", ")\n", "cora_cites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the nodes. Again, we have to choose the columns' names. The names of the 0/1-columns don't matter so much, but we can give the first column (of IDs) and the last one (of subjects) useful names." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idw0w1w2w3w4w5w6w7w8...w1424w1425w1426w1427w1428w1429w1430w1431w1432subject
031336000000000...001000000Neural_Networks
11061127000000000...010000000Rule_Learning
21106406000000000...000000000Reinforcement_Learning
313195000000000...000000000Reinforcement_Learning
437879000000000...000000000Probabilistic_Methods
..................................................................
27031128975000000000...000000000Genetic_Algorithms
27041128977000000000...000000000Genetic_Algorithms
27051128978000000000...000000000Genetic_Algorithms
2706117328000010000...000000000Case_Based
270724043000000000...000000000Neural_Networks
\n", "

2708 rows × 1435 columns

\n", "
" ], "text/plain": [ " id w0 w1 w2 w3 w4 w5 w6 w7 w8 ... w1424 w1425 w1426 \\\n", "0 31336 0 0 0 0 0 0 0 0 0 ... 0 0 1 \n", "1 1061127 0 0 0 0 0 0 0 0 0 ... 0 1 0 \n", "2 1106406 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "3 13195 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "4 37879 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "... ... .. .. .. .. .. .. .. .. .. ... ... ... ... \n", "2703 1128975 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "2704 1128977 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "2705 1128978 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "2706 117328 0 0 0 0 1 0 0 0 0 ... 0 0 0 \n", "2707 24043 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "\n", " w1427 w1428 w1429 w1430 w1431 w1432 subject \n", "0 0 0 0 0 0 0 Neural_Networks \n", "1 0 0 0 0 0 0 Rule_Learning \n", "2 0 0 0 0 0 0 Reinforcement_Learning \n", "3 0 0 0 0 0 0 Reinforcement_Learning \n", "4 0 0 0 0 0 0 Probabilistic_Methods \n", "... ... ... ... ... ... ... ... \n", "2703 0 0 0 0 0 0 Genetic_Algorithms \n", "2704 0 0 0 0 0 0 Genetic_Algorithms \n", "2705 0 0 0 0 0 0 Genetic_Algorithms \n", "2706 0 0 0 0 0 0 Case_Based \n", "2707 0 0 0 0 0 0 Neural_Networks \n", "\n", "[2708 rows x 1435 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cora_feature_names = [f\"w{i}\" for i in range(1433)]\n", "\n", "cora_raw_content = pd.read_csv(\n", " cora_content_file,\n", " sep=\"\\t\", # tab-separated\n", " header=None, # no heading row\n", " names=[\"id\", *cora_feature_names, \"subject\"], # set our own names for the columns\n", ")\n", "cora_raw_content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw above when adding node features, `StellarGraph` uses the index of the DataFrame as the connection between a node and a row of the DataFrame. Currently our dataframe just has a simple numeric range as the index, but it needs to be using the `id` column. Pandas offers [a few ways to control the indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#set-reset-index); in this case, we want to replace the current index by moving the `id` column to it, which is done most easily with `set_index`:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
w0w1w2w3w4w5w6w7w8w9...w1424w1425w1426w1427w1428w1429w1430w1431w1432subject
id
313360000000000...001000000Neural_Networks
10611270000000000...010000000Rule_Learning
11064060000000000...000000000Reinforcement_Learning
131950000000000...000000000Reinforcement_Learning
378790000000000...000000000Probabilistic_Methods
..................................................................
11289750000000000...000000000Genetic_Algorithms
11289770000000000...000000000Genetic_Algorithms
11289780000000000...000000000Genetic_Algorithms
1173280000100000...000000000Case_Based
240430000000000...000000000Neural_Networks
\n", "

2708 rows × 1434 columns

\n", "
" ], "text/plain": [ " w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 ... w1424 w1425 w1426 \\\n", "id ... \n", "31336 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 \n", "1061127 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 \n", "1106406 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "13195 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "37879 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "... .. .. .. .. .. .. .. .. .. .. ... ... ... ... \n", "1128975 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "1128977 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "1128978 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "117328 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 \n", "24043 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "\n", " w1427 w1428 w1429 w1430 w1431 w1432 subject \n", "id \n", "31336 0 0 0 0 0 0 Neural_Networks \n", "1061127 0 0 0 0 0 0 Rule_Learning \n", "1106406 0 0 0 0 0 0 Reinforcement_Learning \n", "13195 0 0 0 0 0 0 Reinforcement_Learning \n", "37879 0 0 0 0 0 0 Probabilistic_Methods \n", "... ... ... ... ... ... ... ... \n", "1128975 0 0 0 0 0 0 Genetic_Algorithms \n", "1128977 0 0 0 0 0 0 Genetic_Algorithms \n", "1128978 0 0 0 0 0 0 Genetic_Algorithms \n", "117328 0 0 0 0 0 0 Case_Based \n", "24043 0 0 0 0 0 0 Neural_Networks \n", "\n", "[2708 rows x 1434 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cora_content_str_subject = cora_raw_content.set_index(\"id\")\n", "cora_content_str_subject" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're almost ready to create the `StellarGraph`, we just have to do something about the non-numeric `subject` column. Many machine learning models only work on numeric features, requiring text and other data to be converted before apply; the models in StellarGraph are no different.\n", "\n", "There are two options, depending on the task:\n", "\n", "1. remove the `subject` column entirely: many uses of Cora are predicting the `subject` of a node, given all of the graph structure and other information, so including it as information in the graph is giving the answer directly\n", "2. convert it to numeric via [one-hot](https://en.wikipedia.org/wiki/One-hot) encoding, where we have 7 columns of 0 and 1, one for each subject value (similar to the 1433 other `w...` features)\n", "\n", "We'll look at both (feel free to skip ahead to 2).\n", "\n", "### 1. Removing columns\n", "\n", "Let's start with the first, removing the columns. The `drop` method ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)) lets us remove one or more columns." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
w0w1w2w3w4w5w6w7w8w9...w1423w1424w1425w1426w1427w1428w1429w1430w1431w1432
id
313360000000000...0001000000
10611270000000000...0010000000
11064060000000000...0000000000
131950000000000...0000000000
378790000000000...0000000000
..................................................................
11289750000000000...0000000000
11289770000000000...0000000000
11289780000000000...0000000000
1173280000100000...1000000000
240430000000000...0000000000
\n", "

2708 rows × 1433 columns

\n", "
" ], "text/plain": [ " w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 ... w1423 w1424 w1425 \\\n", "id ... \n", "31336 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "1061127 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 \n", "1106406 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "13195 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "37879 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "... .. .. .. .. .. .. .. .. .. .. ... ... ... ... \n", "1128975 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "1128977 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "1128978 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "117328 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 \n", "24043 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "\n", " w1426 w1427 w1428 w1429 w1430 w1431 w1432 \n", "id \n", "31336 1 0 0 0 0 0 0 \n", "1061127 0 0 0 0 0 0 0 \n", "1106406 0 0 0 0 0 0 0 \n", "13195 0 0 0 0 0 0 0 \n", "37879 0 0 0 0 0 0 0 \n", "... ... ... ... ... ... ... ... \n", "1128975 0 0 0 0 0 0 0 \n", "1128977 0 0 0 0 0 0 0 \n", "1128978 0 0 0 0 0 0 0 \n", "117328 0 0 0 0 0 0 0 \n", "24043 0 0 0 0 0 0 0 \n", "\n", "[2708 rows x 1433 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cora_content_no_subject = cora_content_str_subject.drop(columns=\"subject\")\n", "cora_content_no_subject" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've got all the right node data, and the right edges, so now we can create a `StellarGraph` using the techniques we saw in the \"homogeneous graph with features\" section above." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 2708, Edges: 5429\n", "\n", " Node types:\n", " paper: [2708]\n", " Features: float32 vector, length 1433\n", " Edge types: paper-cites->paper\n", "\n", " Edge types:\n", " paper-cites->paper: [5429]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "cora_no_subject = StellarGraph({\"paper\": cora_content_no_subject}, {\"cites\": cora_cites})\n", "print(cora_no_subject.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we're trying to predict the subject, we'll probably need to use the `subject` labels as ground-truth labels in a supervised or semi-supervised machine learning task. This can be extracted from the DataFrame and held separately, to be passed in as training, validation or test examples." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "31336 Neural_Networks\n", "1061127 Rule_Learning\n", "1106406 Reinforcement_Learning\n", "13195 Reinforcement_Learning\n", "37879 Probabilistic_Methods\n", " ... \n", "1128975 Genetic_Algorithms\n", "1128977 Genetic_Algorithms\n", "1128978 Genetic_Algorithms\n", "117328 Case_Based\n", "24043 Neural_Networks\n", "Name: subject, Length: 2708, dtype: object" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cora_subject = cora_content_str_subject[\"subject\"]\n", "cora_subject" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a normal Pandas Series, and so can be manipulated with any of the functions that support it. For example, if we wanted to train a machine learning algorithm using 25% of the nodes, we could use the `train_test_split` function ([docs](http://www.scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)) from [the scikit-learn library](https://scikit-learn.org/)." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "191222 Neural_Networks\n", "1109208 Genetic_Algorithms\n", "308003 Rule_Learning\n", "13205 Reinforcement_Learning\n", "3217 Theory\n", " ... \n", "642827 Probabilistic_Methods\n", "1126315 Neural_Networks\n", "1105718 Neural_Networks\n", "3084 Case_Based\n", "80491 Neural_Networks\n", "Name: subject, Length: 677, dtype: object" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import model_selection\n", "\n", "cora_train, cora_test = model_selection.train_test_split(\n", " cora_subject, train_size=0.25, random_state=123\n", ")\n", "cora_train" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id\n", "1103969 Probabilistic_Methods\n", "1119295 Rule_Learning\n", "1130567 Reinforcement_Learning\n", "59045 Theory\n", "1129494 Neural_Networks\n", " ... \n", "126867 Case_Based\n", "1105764 Reinforcement_Learning\n", "782486 Neural_Networks\n", "74821 Probabilistic_Methods\n", "41732 Reinforcement_Learning\n", "Name: subject, Length: 2031, dtype: object" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cora_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset, with this preparation, is used in [a demo of the GCN algorithm for node classification](../node-classification/gcn-node-classification.ipynb). The task is to predict the subject of each node." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. One-hot encoding\n", "\n", "Now, let's look at the other approach: converting the subjects to numeric features. The `pd.get_dummies` function ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)) can do this for us, by adding extra columns (7, in this case), based on the unique values." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
w0w1w2w3w4w5w6w7w8w9...w1430w1431w1432subject_Case_Basedsubject_Genetic_Algorithmssubject_Neural_Networkssubject_Probabilistic_Methodssubject_Reinforcement_Learningsubject_Rule_Learningsubject_Theory
id
313360000000000...0000010000
10611270000000000...0000000010
11064060000000000...0000000100
131950000000000...0000000100
378790000000000...0000001000
..................................................................
11289750000000000...0000100000
11289770000000000...0000100000
11289780000000000...0000100000
1173280000100000...0001000000
240430000000000...0000010000
\n", "

2708 rows × 1440 columns

\n", "
" ], "text/plain": [ " w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 ... w1430 w1431 w1432 \\\n", "id ... \n", "31336 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "1061127 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "1106406 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "13195 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "37879 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "... .. .. .. .. .. .. .. .. .. .. ... ... ... ... \n", "1128975 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "1128977 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "1128978 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "117328 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 \n", "24043 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "\n", " subject_Case_Based subject_Genetic_Algorithms \\\n", "id \n", "31336 0 0 \n", "1061127 0 0 \n", "1106406 0 0 \n", "13195 0 0 \n", "37879 0 0 \n", "... ... ... \n", "1128975 0 1 \n", "1128977 0 1 \n", "1128978 0 1 \n", "117328 1 0 \n", "24043 0 0 \n", "\n", " subject_Neural_Networks subject_Probabilistic_Methods \\\n", "id \n", "31336 1 0 \n", "1061127 0 0 \n", "1106406 0 0 \n", "13195 0 0 \n", "37879 0 1 \n", "... ... ... \n", "1128975 0 0 \n", "1128977 0 0 \n", "1128978 0 0 \n", "117328 0 0 \n", "24043 1 0 \n", "\n", " subject_Reinforcement_Learning subject_Rule_Learning subject_Theory \n", "id \n", "31336 0 0 0 \n", "1061127 0 1 0 \n", "1106406 1 0 0 \n", "13195 1 0 0 \n", "37879 0 0 0 \n", "... ... ... ... \n", "1128975 0 0 0 \n", "1128977 0 0 0 \n", "1128978 0 0 0 \n", "117328 0 0 0 \n", "24043 0 0 0 \n", "\n", "[2708 rows x 1440 columns]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cora_content_one_hot_subject = pd.get_dummies(\n", " cora_content_str_subject, columns=[\"subject\"]\n", ")\n", "cora_content_one_hot_subject" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this DataFrame, we can create a `StellarGraph` with 1440 features per node instead of 1433 like the previous section." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StellarGraph: Undirected multigraph\n", " Nodes: 2708, Edges: 5429\n", "\n", " Node types:\n", " paper: [2708]\n", " Features: float32 vector, length 1440\n", " Edge types: paper-cites->paper\n", "\n", " Edge types:\n", " paper-cites->paper: [5429]\n", " Weights: all 1 (default)\n", " Features: none\n" ] } ], "source": [ "cora_one_hot_subject = StellarGraph(\n", " {\"paper\": cora_content_one_hot_subject}, {\"cites\": cora_cites}\n", ")\n", "print(cora_one_hot_subject.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "You hopefully now know more about building a `StellarGraph` in various configurations via Pandas DataFrames, including some feature preprocessing in the \"Real data: Homogeneous graph from CSV files\" section.\n", "\n", "Revisit this document to use as a reminder, or [the documentation](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.StellarGraph) for the `StellarGraph` class.\n", "\n", "Once you've loaded your data, you can start doing machine learning: a good place to start is the [demo of the GCN algorithm on the Cora dataset for node classification](../node-classification/gcn-node-classification.ipynb). Additionally, StellarGraph includes [many other demos of other algorithms, solving other tasks](../README.md)." ] }, { "cell_type": "markdown", "metadata": { "nbsphinx": "hidden", "tags": [ "CloudRunner" ] }, "source": [ "
Run the latest release of this notebook:
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }