{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loading data into StellarGraph from Pandas\n",
    "\n",
    "> This demo explains how to load data into a form that can be used by the StellarGraph library. [See all other demos](../README.md).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbsphinx": "hidden",
    "tags": [
     "CloudRunner"
    ]
   },
   "source": [
    "<table><tr><td>Run the latest release of this notebook:</td><td><a href=\"https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/basics/loading-pandas.ipynb\" alt=\"Open In Binder\" target=\"_parent\"><img src=\"https://mybinder.org/badge_logo.svg\"/></a></td><td><a href=\"https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/basics/loading-pandas.ipynb\" alt=\"Open In Colab\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a></td></tr></table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[The StellarGraph library](https://github.com/stellargraph/stellargraph) supports loading graph information from Pandas. [Pandas](https://pandas.pydata.org) is a library for working with data frames.\n",
    "\n",
    "This is a great way to load data that offers a good balance between performance and convenience.\n",
    "\n",
    "The StellarGraph library supports many deep machine learning (ML) algorithms on [graphs](https://en.wikipedia.org/wiki/Graph_%28discrete_mathematics%29). A graph consists of a set of *nodes* connected by *edges*, potentially with information associated with each node and edge. Any task using the StellarGraph library needs data to be loaded into an instance of the `StellarGraph` class. This class stores the graph structure (the nodes and the edges between them), as well as information about them:\n",
    "\n",
    "- *node types* and *edge types*: a class or category to which the nodes and edges belong, dictating what features are available on a node, and potentially signifying some sort of semantic meaning (this is different to machine learning label for a node)\n",
    "- *node features* and *edge features*: vectors of numbers associated with each node or edge\n",
    "- *edge weights*: a number associated with each edge\n",
    "\n",
    "All of these are optional, because they have sensible defaults if they're not relevant to the task at hand.\n",
    "\n",
    "This notebook walks through loading several kinds of graphs using Pandas. Pandas is a reasonably efficient form of loading, that is convenient for preprocessing.\n",
    "\n",
    "- homogeneous graph without features (a homogeneous graph is one with only one type of node and one type of edge)\n",
    "- homogeneous graph with node/edge features\n",
    "- homogeneous graph with edge weights\n",
    "- directed graphs (a graph is directed if edges have a \"start\" and \"end\" nodes, instead of just connecting two nodes)\n",
    "- heterogeneous graphs (more than one node type and/or more than one edge type) with and without node/edge features or edge weights, this includes knowledge graphs\n",
    "- real data: homogeneous graph from CSV files (an example of reading data from files and doing some preprocessing)\n",
    "\n",
    "> StellarGraph supports loading data from many sources with all sorts of data preprocessing, via [Pandas](https://pandas.pydata.org) DataFrames, [NumPy](https://www.numpy.org) arrays, [Neo4j](https://neo4j.com) and [NetworkX](https://networkx.github.io) graphs. See [all loading demos](README.md) for more details.\n",
    "\n",
    "The [documentation](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.StellarGraph) for the `StellarGraph` class includes a compressed reminder of everything discussed in this file, as well as explanations of all of the parameters.\n",
    "\n",
    "The `StellarGraph` class is available at the top level of the `stellargraph` library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "nbsphinx": "hidden",
    "tags": [
     "CloudRunner"
    ]
   },
   "outputs": [],
   "source": [
    "# install StellarGraph if running on Google Colab\n",
    "import sys\n",
    "if 'google.colab' in sys.modules:\n",
    "  %pip install -q stellargraph[demos]==1.2.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "nbsphinx": "hidden",
    "tags": [
     "VersionCheck"
    ]
   },
   "outputs": [],
   "source": [
    "# verify that we're using the correct version of StellarGraph for this notebook\n",
    "import stellargraph as sg\n",
    "\n",
    "try:\n",
    "    sg.utils.validate_notebook_version(\"1.2.1\")\n",
    "except AttributeError:\n",
    "    raise ValueError(\n",
    "        f\"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed.  Please see <https://github.com/stellargraph/stellargraph/issues/1172>.\"\n",
    "    ) from None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from stellargraph import StellarGraph"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading from anything, via Pandas\n",
    "\n",
    "Pandas DataFrames are tables of data that can be created from [many input sources](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), such as [CSV files](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) and [SQL databases](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html). StellarGraph builds on this power by allowing construction from these DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas is widely supported by other libraries and products, like [scikit-learn](http://scikit-learn.github.io/stable), and thus a user of StellarGraph gets to benefit from these easily too."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Homogeneous graph without features\n",
    "\n",
    "We'll start with a homogeneous graph without any node features. This means the graph consists of only nodes and edges without any information other than a unique identifier.\n",
    "\n",
    "The basic form of constructing a `StellarGraph` is passing in an edge `DataFrame` with two columns (`source` and `target`), where each row represents a pair of nodes that are connected. Let's construct a `StellarGraph` representing a square with a diagonal:\n",
    "\n",
    "```\n",
    "a -- b\n",
    "| \\  |\n",
    "|  \\ |\n",
    "d -- c\n",
    "```\n",
    "\n",
    "We'll start with a synthetic DataFrame defined in code here (there's some examples later of reading DataFrames from files).\n",
    "\n",
    "Each row represents a connection: for instance, the first one is the edge from `a` to `b`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target\n",
       "0      a      b\n",
       "1      b      c\n",
       "2      c      d\n",
       "3      d      a\n",
       "4      a      c"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_edges = pd.DataFrame(\n",
    "    {\"source\": [\"a\", \"b\", \"c\", \"d\", \"a\"], \"target\": [\"b\", \"c\", \"d\", \"a\", \"c\"]}\n",
    ")\n",
    "square_edges"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given our edges, we can create a `StellarGraph` directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "square = StellarGraph(edges=square_edges)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `info` method ([docs](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.StellarGraph.info)) gives a high-level summary of a `StellarGraph`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  default: [4]\n",
      "    Features: none\n",
      "    Edge types: default-default->default\n",
      "\n",
      " Edge types:\n",
      "    default-default->default: [5]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "print(square.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On this square, it tells us that there's 4 nodes of type `default` (a homogeneous graph still has node and edge types, but they default to `default`), with no features, and one type of edge that touches it. It also tells us that there's 5 edges of type `default` that go between nodes of type `default`. This matches what we expect: it's a graph with 4 nodes and 5 edges and one type of each.\n",
    "\n",
    "The default node type and edge types can be set using the `node_type_default` and `edge_type_default` parameters to `StellarGraph(...)`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  corner: [4]\n",
      "    Features: none\n",
      "    Edge types: corner-line->corner\n",
      "\n",
      " Edge types:\n",
      "    corner-line->corner: [5]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_named = StellarGraph(\n",
    "    edges=square_edges, node_type_default=\"corner\", edge_type_default=\"line\"\n",
    ")\n",
    "print(square_named.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The names of the columns used for the edges can be controlled with the `source_column` and `target_column` parameters to `StellarGraph(...)`. For instance, maybe our graph comes from a file with `first` and `second` columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first</th>\n",
       "      <th>second</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first second\n",
       "0     a      b\n",
       "1     b      c\n",
       "2     c      d\n",
       "3     d      a\n",
       "4     a      c"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_edges_first_second = square_edges.rename(\n",
    "    columns={\"source\": \"first\", \"target\": \"second\"}\n",
    ")\n",
    "square_edges_first_second"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  default: [4]\n",
      "    Features: none\n",
      "    Edge types: default-default->default\n",
      "\n",
      " Edge types:\n",
      "    default-default->default: [5]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_first_second = StellarGraph(\n",
    "    edges=square_edges_first_second, source_column=\"first\", target_column=\"second\"\n",
    ")\n",
    "print(square_first_second.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Homogeneous graph with features\n",
    "\n",
    "For many real-world problems, we have more than just graph structure: we have information about the nodes and edges. For instance, we might have a graph of academic papers (nodes) and how they cite each other (edges): we might have information about the nodes such as the authors and the publication year, and even the abstract or full paper contents. If we're doing a machine learning task, it can be useful to feed this information into models. The `StellarGraph` class supports this using a Pandas DataFrame: each row corresponds to a feature vector for a node or edge.\n",
    "\n",
    "### Node features\n",
    "\n",
    "Let's imagine the nodes have two features, which might be their coordinates, or maybe some other piece of information. We'll continue using synthetic DataFrames, but these could easily be read from a file. (There's an example in the \"Real data: Homogeneous graph from CSV files\" section at the end of this notebook.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1</td>\n",
       "      <td>-0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>2</td>\n",
       "      <td>0.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>3</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>4</td>\n",
       "      <td>-0.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x    y\n",
       "a  1 -0.2\n",
       "b  2  0.3\n",
       "c  3  0.0\n",
       "d  4 -0.5"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_node_data = pd.DataFrame(\n",
    "    {\"x\": [1, 2, 3, 4], \"y\": [-0.2, 0.3, 0.0, -0.5]}, index=[\"a\", \"b\", \"c\", \"d\"]\n",
    ")\n",
    "square_node_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`StellarGraph` uses the index of the DataFrame as the connection between a node and a row of the DataFrame. Notice that the `square_features` DataFrame has `a`, ..., `d` as its index, matching the identifiers used in the edges.\n",
    "\n",
    "We've now got all the right node data, in addition to the edges from before, so now we can create a `StellarGraph`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  default: [4]\n",
      "    Features: float32 vector, length 2\n",
      "    Edge types: default-default->default\n",
      "\n",
      " Edge types:\n",
      "    default-default->default: [5]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_node_features = StellarGraph(square_node_data, square_edges)\n",
    "print(square_node_features.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the output of `info` now says that the nodes of the `default` type have 2 features.\n",
    "\n",
    "We can also give the node and edge types helpful names, using either the `node_type_default`/`edge_type_default` parameters we saw before, or by passing the DataFrames in with a dictionary, where the key is the name of the type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  corner: [4]\n",
      "    Features: float32 vector, length 2\n",
      "    Edge types: corner-line->corner\n",
      "\n",
      " Edge types:\n",
      "    corner-line->corner: [5]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_named_node_features = StellarGraph(\n",
    "    {\"corner\": square_node_data}, {\"line\": square_edges}\n",
    ")\n",
    "print(square_named_node_features.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Edge features\n",
    "\n",
    "Edges can have features in the same way as nodes. Any columns that don't have a special meaning are taken as feature vector elements. This means that the source and target columns are not included in the feature vectors (nor are the weight or edge type columns, that are discussed later).\n",
    "\n",
    "Let's imagine the edges have 3 features each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>b</td>\n",
       "      <td>-1</td>\n",
       "      <td>0.4</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>c</td>\n",
       "      <td>2</td>\n",
       "      <td>0.1</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>-3</td>\n",
       "      <td>0.9</td>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>a</td>\n",
       "      <td>4</td>\n",
       "      <td>0.0</td>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "      <td>-5</td>\n",
       "      <td>0.9</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target  A    B   C\n",
       "0      a      b -1  0.4  12\n",
       "1      b      c  2  0.1  34\n",
       "2      c      d -3  0.9  56\n",
       "3      d      a  4  0.0  78\n",
       "4      a      c -5  0.9  90"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_edge_data = pd.DataFrame(\n",
    "    {\n",
    "        \"source\": [\"a\", \"b\", \"c\", \"d\", \"a\"],\n",
    "        \"target\": [\"b\", \"c\", \"d\", \"a\", \"c\"],\n",
    "        \"A\": [-1, 2, -3, 4, -5],\n",
    "        \"B\": [0.4, 0.1, 0.9, 0, 0.9],\n",
    "        \"C\": [12, 34, 56, 78, 90],\n",
    "    }\n",
    ")\n",
    "square_edge_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  corner: [4]\n",
      "    Features: float32 vector, length 2\n",
      "    Edge types: corner-line->corner\n",
      "\n",
      " Edge types:\n",
      "    corner-line->corner: [5]\n",
      "        Weights: all 1 (default)\n",
      "        Features: float32 vector, length 3\n"
     ]
    }
   ],
   "source": [
    "square_named_features = StellarGraph(\n",
    "    {\"corner\": square_node_data}, {\"line\": square_edge_data}\n",
    ")\n",
    "print(square_named_features.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the output of `info` now says that the edges of the `line` type have 3 features, in addition to the 2 features for each node of type `corner`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Homogeneous graph with edge weights\n",
    "\n",
    "Some algorithms can understand edge weights, which can be used as a measure of the strength of the connection, or a measure of distance between nodes. A `StellarGraph` instance can have weighted edges, by including a `weight` column in the DataFrame of edges.\n",
    "\n",
    "We'll continue with the synthetic square example, by adding that extra `weight` column into the DataFrame. This column might be part of the data naturally, or it might need to be computed. Either of these is fine with Pandas: in the first case, it can be loaded at the same time as loading the source and target information, and in the second, the full power of Pandas is available to compute it (such as manipulating other information associated with the edge DataFrame, or even by comparing the nodes at each end)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>weight</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>b</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>c</td>\n",
       "      <td>0.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>3.40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>a</td>\n",
       "      <td>5.67</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target  weight\n",
       "0      a      b    1.00\n",
       "1      b      c    0.20\n",
       "2      c      d    3.40\n",
       "3      d      a    5.67\n",
       "4      a      c    1.00"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_weighted_edges = pd.DataFrame(\n",
    "    {\n",
    "        \"source\": [\"a\", \"b\", \"c\", \"d\", \"a\"],\n",
    "        \"target\": [\"b\", \"c\", \"d\", \"a\", \"c\"],\n",
    "        \"weight\": [1.0, 0.2, 3.4, 5.67, 1.0],\n",
    "    }\n",
    ")\n",
    "square_weighted_edges"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  default: [4]\n",
      "    Features: none\n",
      "    Edge types: default-default->default\n",
      "\n",
      " Edge types:\n",
      "    default-default->default: [5]\n",
      "        Weights: range=[0.2, 5.67], mean=2.254, std=2.25534\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_weighted = StellarGraph(edges=square_weighted_edges)\n",
    "print(square_weighted.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the output of `info` now shows additional information about edge weights.\n",
    "\n",
    "Edges weights can be used with node and edge features; for instance, we create a similar graph to the last graph in the \"Homogeneous graph with features\" section that has our edge weights:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>weight</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>b</td>\n",
       "      <td>1.00</td>\n",
       "      <td>-1</td>\n",
       "      <td>0.4</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>c</td>\n",
       "      <td>0.20</td>\n",
       "      <td>2</td>\n",
       "      <td>0.1</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>3.40</td>\n",
       "      <td>-3</td>\n",
       "      <td>0.9</td>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>a</td>\n",
       "      <td>5.67</td>\n",
       "      <td>4</td>\n",
       "      <td>0.0</td>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "      <td>1.00</td>\n",
       "      <td>-5</td>\n",
       "      <td>0.9</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target  weight  A    B   C\n",
       "0      a      b    1.00 -1  0.4  12\n",
       "1      b      c    0.20  2  0.1  34\n",
       "2      c      d    3.40 -3  0.9  56\n",
       "3      d      a    5.67  4  0.0  78\n",
       "4      a      c    1.00 -5  0.9  90"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_weighted_edge_data = pd.DataFrame(\n",
    "    {\n",
    "        \"source\": [\"a\", \"b\", \"c\", \"d\", \"a\"],\n",
    "        \"target\": [\"b\", \"c\", \"d\", \"a\", \"c\"],\n",
    "        \"weight\": [1.0, 0.2, 3.4, 5.67, 1.0],\n",
    "        \"A\": [-1, 2, -3, 4, -5],\n",
    "        \"B\": [0.4, 0.1, 0.9, 0, 0.9],\n",
    "        \"C\": [12, 34, 56, 78, 90],\n",
    "    }\n",
    ")\n",
    "square_weighted_edge_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  corner: [4]\n",
      "    Features: float32 vector, length 2\n",
      "    Edge types: corner-line->corner\n",
      "\n",
      " Edge types:\n",
      "    corner-line->corner: [5]\n",
      "        Weights: range=[0.2, 5.67], mean=2.254, std=2.25534\n",
      "        Features: float32 vector, length 3\n"
     ]
    }
   ],
   "source": [
    "square_features_weighted = StellarGraph(\n",
    "    {\"corner\": square_node_data}, {\"line\": square_weighted_edge_data}\n",
    ")\n",
    "print(square_features_weighted.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Directed graphs\n",
    "\n",
    "Some graphs have edge directions, where going from source to target has a different meaning to going from target to source.\n",
    "\n",
    "A directed graph can be created by using the `StellarDiGraph` class instead of the `StellarGraph` one. The construction is almost identical, and we can reuse any of the DataFrames that we created in the sections above. For instance, continuing from the previous cell, we can have a directed homogeneous graph with node features and edge weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarDiGraph: Directed multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  corner: [4]\n",
      "    Features: float32 vector, length 2\n",
      "    Edge types: corner-line->corner\n",
      "\n",
      " Edge types:\n",
      "    corner-line->corner: [5]\n",
      "        Weights: range=[0.2, 5.67], mean=2.254, std=2.25534\n",
      "        Features: float32 vector, length 3\n"
     ]
    }
   ],
   "source": [
    "from stellargraph import StellarDiGraph\n",
    "\n",
    "square_features_weighted_directed = StellarDiGraph(\n",
    "    {\"corner\": square_node_data}, {\"line\": square_weighted_edge_data}\n",
    ")\n",
    "print(square_features_weighted_directed.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Everything discussed about `StellarGraph` in this file also works with `StellarDiGraph`, including parameters like `node_type_default` and `source_column`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Heterogeneous graphs\n",
    "\n",
    "Some graphs have multiple types of nodes and multiple types of edges.\n",
    "\n",
    "For example, an academic citation network that includes authors might have `wrote` edges connecting `author` nodes to `paper` nodes, in addition to the `cites` edges between `paper` nodes. There could be `supervised` edges between `author`s ([example](https://academictree.org)) too, or any number of additional node and edge types. A knowledge graph (aka RDF, triple stores or knowledge base) is an extreme form of an heterogeneous graph, with dozens, hundreds or even thousands of edge (or relation) types. Typically in a knowledge graph, edges and their types represent the information associated with a node, rather than node features.\n",
    "\n",
    "`StellarGraph` supports all forms of heterogeneous graphs.\n",
    "\n",
    "A heterogeneous `StellarGraph` can be constructed in a similar way to a homogeneous graph, except we pass a dictionary with multiple elements instead of a single element like we did for the Cora examples in the \"homogeneous graph with features\" section and others above. For a heterogeneous graph, a dictionary has to be passed; passing a single DataFrame does not work.\n",
    "\n",
    "Let's return to the square graph from earlier:\n",
    "\n",
    "```\n",
    "a -- b\n",
    "| \\  |\n",
    "|  \\ |\n",
    "d -- c\n",
    "```\n",
    "\n",
    "### Multiple node types\n",
    "\n",
    "Suppose `a` is of type `foo`, and no features, but `b`, `c` and `d` are of type `bar` and have two features each, e.g. for `b`, `y = 0.4, z = 100`. Since the features are different shapes (`a` has zero), they need to be modeled as different types, with separate `DataFrame`s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [a]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_foo = pd.DataFrame(index=[\"a\"])\n",
    "square_foo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y</th>\n",
       "      <th>z</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>0.4</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>0.1</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>0.9</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     y    z\n",
       "b  0.4  100\n",
       "c  0.1  200\n",
       "d  0.9  300"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_bar = pd.DataFrame(\n",
    "    {\"y\": [0.4, 0.1, 0.9], \"z\": [100, 200, 300]}, index=[\"b\", \"c\", \"d\"]\n",
    ")\n",
    "square_bar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have the information for the two node types `foo` and `bar` in separate DataFrames, so we can now put them in a dictionary to create a `StellarGraph`. Notice that `info()` is now reporting multiple node types, as well as information specific to each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  bar: [3]\n",
      "    Features: float32 vector, length 2\n",
      "    Edge types: bar-default->bar, bar-default->foo\n",
      "  foo: [1]\n",
      "    Features: none\n",
      "    Edge types: foo-default->bar\n",
      "\n",
      " Edge types:\n",
      "    foo-default->bar: [2]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n",
      "    bar-default->bar: [2]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n",
      "    bar-default->foo: [1]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_foo_and_bar = StellarGraph({\"foo\": square_foo, \"bar\": square_bar}, square_edges)\n",
    "print(square_foo_and_bar.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Node IDs (the DataFrame index) needs to be unique across all types. For example, renaming the `a` corner to `b` like `square_foo_overlap` in the next cell, is not accepted and a `StellarGraph(...)` call will throw an error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x\n",
       "b -1"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_foo_overlap = pd.DataFrame({\"x\": [-1]}, index=[\"b\"])\n",
    "square_foo_overlap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment to see the error\n",
    "# StellarGraph({\"foo\": square_foo_overlap, \"bar\": square_bar}, square_edges)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the node IDs aren't unique across types, one way to make them unique is to add a string prefix. You'll need to add the same prefix to the node IDs used in the edges too. Adding a prefix can be done by replacing the index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>foo-b</th>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       x\n",
       "foo-b -1"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_foo_overlap_prefix = square_foo_overlap.set_index(\n",
    "    \"foo-\" + square_foo_overlap.index.astype(str)\n",
    ")\n",
    "square_foo_overlap_prefix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y</th>\n",
       "      <th>z</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bar-b</th>\n",
       "      <td>0.4</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bar-c</th>\n",
       "      <td>0.1</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bar-d</th>\n",
       "      <td>0.9</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         y    z\n",
       "bar-b  0.4  100\n",
       "bar-c  0.1  200\n",
       "bar-d  0.9  300"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_bar_prefix = square_bar.set_index(\"bar-\" + square_bar.index.astype(str))\n",
    "square_bar_prefix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple edge types: type column\n",
    "\n",
    "Graphs with multiple edge types can be simpler. Since there are often no features on the edges, we can pass a DataFrame with an additional column for the type, specifying it via the `edge_type_column` parameter. If there are features on the edges, multiple edge types can also be created in the same way as multiple node types, by passing with a dictionary of DataFrames.\n",
    "\n",
    "For example, suppose the edges in our square graph have types based on their orientation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>orientation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>b</td>\n",
       "      <td>horizontal</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>c</td>\n",
       "      <td>vertical</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>horizontal</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>a</td>\n",
       "      <td>vertical</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "      <td>diagonal</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target orientation\n",
       "0      a      b  horizontal\n",
       "1      b      c    vertical\n",
       "2      c      d  horizontal\n",
       "3      d      a    vertical\n",
       "4      a      c    diagonal"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_edges_types = square_edges.assign(\n",
    "    orientation=[\"horizontal\", \"vertical\", \"horizontal\", \"vertical\", \"diagonal\"]\n",
    ")\n",
    "square_edges_types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  default: [4]\n",
      "    Features: none\n",
      "    Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default\n",
      "\n",
      " Edge types:\n",
      "    default-vertical->default: [2]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n",
      "    default-horizontal->default: [2]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n",
      "    default-diagonal->default: [1]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_orientation = StellarGraph(\n",
    "    edges=square_edges_types, edge_type_column=\"orientation\"\n",
    ")\n",
    "print(square_orientation.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Edge weights are supported, in the same way as a homogeneous graph above, with a `weight` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>orientation</th>\n",
       "      <th>weight</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>b</td>\n",
       "      <td>horizontal</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>c</td>\n",
       "      <td>vertical</td>\n",
       "      <td>0.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>horizontal</td>\n",
       "      <td>3.40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>a</td>\n",
       "      <td>vertical</td>\n",
       "      <td>5.67</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "      <td>diagonal</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target orientation  weight\n",
       "0      a      b  horizontal    1.00\n",
       "1      b      c    vertical    0.20\n",
       "2      c      d  horizontal    3.40\n",
       "3      d      a    vertical    5.67\n",
       "4      a      c    diagonal    1.00"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_edges_types_weighted = square_edges_types.assign(weight=[1.0, 0.2, 3.4, 5.67, 1.0])\n",
    "square_edges_types_weighted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  default: [4]\n",
      "    Features: none\n",
      "    Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default\n",
      "\n",
      " Edge types:\n",
      "    default-vertical->default: [2]\n",
      "        Weights: range=[0.2, 5.67], mean=2.935, std=3.86787\n",
      "        Features: none\n",
      "    default-horizontal->default: [2]\n",
      "        Weights: range=[1, 3.4], mean=2.2, std=1.69706\n",
      "        Features: none\n",
      "    default-diagonal->default: [1]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_orientation_weighted = StellarGraph(\n",
    "    edges=square_edges_types_weighted, edge_type_column=\"orientation\"\n",
    ")\n",
    "print(square_orientation_weighted.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple edge types: edge features\n",
    "\n",
    "As mentioned above, if there are multiple edge types and the edges have edge features, one will typically need to pass a dictionary of DataFrames similar to multiple node types. The features of each type can be different.\n",
    "\n",
    "Note: Edges also have IDs (the DataFrame index, like nodes), and they need to be unique across all edge types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>b</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>-3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target  A\n",
       "0      a      b -1\n",
       "2      c      d -3"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_edges_horizontal = pd.DataFrame(\n",
    "    {\"source\": [\"a\", \"c\"], \"target\": [\"b\", \"d\"], \"A\": [-1, -3]}, index=[0, 2]\n",
    ")\n",
    "square_edges_vertical = pd.DataFrame(\n",
    "    {\"source\": [\"b\", \"d\"], \"target\": [\"c\", \"a\"], \"B\": [0.1, 0], \"C\": [34, 78]},\n",
    "    index=[1, 3],\n",
    ")\n",
    "square_edges_diagonal = pd.DataFrame({\"source\": [\"a\"], \"target\": [\"c\"]}, index=[4])\n",
    "\n",
    "# example:\n",
    "square_edges_horizontal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  default: [4]\n",
      "    Features: none\n",
      "    Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default\n",
      "\n",
      " Edge types:\n",
      "    default-vertical->default: [2]\n",
      "        Weights: all 1 (default)\n",
      "        Features: float32 vector, length 2\n",
      "    default-horizontal->default: [2]\n",
      "        Weights: all 1 (default)\n",
      "        Features: float32 vector, length 1\n",
      "    default-diagonal->default: [1]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_orientation_separate = StellarGraph(\n",
    "    edges={\n",
    "        \"horizontal\": square_edges_horizontal,\n",
    "        \"vertical\": square_edges_vertical,\n",
    "        \"diagonal\": square_edges_diagonal,\n",
    "    },\n",
    ")\n",
    "print(square_orientation_separate.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that `vertical` edges have 2 features, `horizontal` have 1, and `diagonal` have 0.\n",
    "\n",
    "Edge weights can be specified with this multiple-DataFrames form too. Any or all of the DataFrames for an edge type can contain a `weight` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>A</th>\n",
       "      <th>weight</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>b</td>\n",
       "      <td>-1</td>\n",
       "      <td>12.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>d</td>\n",
       "      <td>-3</td>\n",
       "      <td>45.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target  A  weight\n",
       "0      a      b -1    12.3\n",
       "2      c      d -3    45.6"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_edges_horizontal_weighted = square_edges_horizontal.assign(weight=[12.3, 45.6])\n",
    "square_edges_horizontal_weighted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  default: [4]\n",
      "    Features: none\n",
      "    Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default\n",
      "\n",
      " Edge types:\n",
      "    default-vertical->default: [2]\n",
      "        Weights: all 1 (default)\n",
      "        Features: float32 vector, length 2\n",
      "    default-horizontal->default: [2]\n",
      "        Weights: range=[12.3, 45.6], mean=28.95, std=23.5467\n",
      "        Features: float32 vector, length 1\n",
      "    default-diagonal->default: [1]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_orientation_separate_weighted = StellarGraph(\n",
    "    edges={\n",
    "        \"horizontal\": square_edges_horizontal_weighted,\n",
    "        \"vertical\": square_edges_vertical,\n",
    "        \"diagonal\": square_edges_diagonal,\n",
    "    },\n",
    ")\n",
    "print(square_orientation_separate_weighted.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple everything\n",
    "\n",
    "A graph can have multiple node types and multiple edge types, with features or without, with edge weights or without and with `edge_type_column=...` (shown here) or with multiple DataFrames for edge types. We can put everything together from the previous sections to make a single complicated `StellarGraph`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  bar: [3]\n",
      "    Features: float32 vector, length 2\n",
      "    Edge types: bar-diagonal->foo, bar-horizontal->bar, bar-horizontal->foo, bar-vertical->bar, bar-vertical->foo\n",
      "  foo: [1]\n",
      "    Features: none\n",
      "    Edge types: foo-diagonal->bar, foo-horizontal->bar, foo-vertical->bar\n",
      "\n",
      " Edge types:\n",
      "    foo-horizontal->bar: [1]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n",
      "    foo-diagonal->bar: [1]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n",
      "    bar-vertical->foo: [1]\n",
      "        Weights: all 5.67\n",
      "        Features: none\n",
      "    bar-vertical->bar: [1]\n",
      "        Weights: all 0.2\n",
      "        Features: none\n",
      "    bar-horizontal->bar: [1]\n",
      "        Weights: all 3.4\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "square_everything = StellarGraph(\n",
    "    {\"foo\": square_foo, \"bar\": square_bar},\n",
    "    square_edges_types_weighted,\n",
    "    edge_type_column=\"orientation\",\n",
    ")\n",
    "print(square_everything.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Directed heterogeneous graphs\n",
    "\n",
    "A heterogeneous graph can be directed by using `StellarDiGraph` to construct it, similar to a homogeneous graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarDiGraph: Directed multigraph\n",
      " Nodes: 4, Edges: 5\n",
      "\n",
      " Node types:\n",
      "  bar: [3]\n",
      "    Features: float32 vector, length 2\n",
      "    Edge types: bar-horizontal->bar, bar-vertical->bar, bar-vertical->foo\n",
      "  foo: [1]\n",
      "    Features: none\n",
      "    Edge types: foo-diagonal->bar, foo-horizontal->bar\n",
      "\n",
      " Edge types:\n",
      "    foo-horizontal->bar: [1]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n",
      "    foo-diagonal->bar: [1]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n",
      "    bar-vertical->foo: [1]\n",
      "        Weights: all 5.67\n",
      "        Features: none\n",
      "    bar-vertical->bar: [1]\n",
      "        Weights: all 0.2\n",
      "        Features: none\n",
      "    bar-horizontal->bar: [1]\n",
      "        Weights: all 3.4\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "from stellargraph import StellarDiGraph\n",
    "\n",
    "square_everything_directed = StellarDiGraph(\n",
    "    {\"foo\": square_foo, \"bar\": square_bar},\n",
    "    square_edges_types_weighted,\n",
    "    edge_type_column=\"orientation\",\n",
    ")\n",
    "print(square_everything_directed.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Real data: Homogeneous graph from CSV files\n",
    "\n",
    "We've been using a synthetic square graph with perfectly formatted data as an example for this whole notebook, because it helps us focus on just the core `StellarGraph` functionality. Real life isn't so simple; there's usually files to wrangle and formats to convert, so we'll finish this demo covering some example steps to go from data in files to a `StellarGraph`.\n",
    "\n",
    "We'll work with the Cora dataset from <https://linqs.soe.ucsc.edu/data>:\n",
    "\n",
    "> The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details. \n",
    "\n",
    "The dataset contains two files: `cora.cites` and `cora.content`.\n",
    "\n",
    "`cora.cites` is a tab-separated values (TSV) file of the graph edges. The first column identifies the cited paper, and the second column identifies the paper that cites it. The first three lines of the file look like:\n",
    "\n",
    "```\n",
    "35\t1033\n",
    "35\t103482\n",
    "35\t103515\n",
    "...\n",
    "```\n",
    "\n",
    "`cora.content` is also a TSV file of information about each node (paper), with 1435 columns: the first column is the node ID (matching the IDs used in `cora.cites`), the next 1433 are the 0/1-values of word vectors, and the last is the subject area class of the paper. The first three lines of the file look like (with the 1423 of the 0/1 columns truncated)\n",
    "\n",
    "```\n",
    "31336\t0\t0\t...\t0\t1\t0\t0\t0\t0\t0\t0\tNeural_Networks\n",
    "1061127\t0\t0\t...\t1\t0\t0\t0\t0\t0\t0\t0\tRule_Learning\n",
    "1106406\t0\t0\t...\t0\t0\t0\t0\t0\t0\t0\t0\tReinforcement_Learning\n",
    "...\n",
    "```\n",
    "\n",
    "This graph is homogeneous (all nodes are papers, and all edges are citations), with node features (the 0/1-values) but no edge weights.\n",
    "\n",
    "The StellarGraph library provides the `datasets` module ([docs](https://stellargraph.readthedocs.io/en/stable/api.html#module-stellargraph.datasets)) for working with some common datasets via classes like `Cora` ([docs](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.datasets.Cora)). It can download the necessary files via the `download` method. (The `load` method also converts it into a `StellarGraph`, but that's too helpful for this tutorial: we're learning how to do that ourselves.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "from stellargraph.datasets import Cora\n",
    "import os\n",
    "\n",
    "cora = Cora()\n",
    "cora.download()\n",
    "\n",
    "# the base_directory property tells us where it was downloaded to:\n",
    "cora_cites_file = os.path.join(cora.base_directory, \"cora.cites\")\n",
    "cora_content_file = os.path.join(cora.base_directory, \"cora.content\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've now got the files on disk, so we can read them using the `pd.read_csv` function. Despite the \"CSV\" in the name, this function can be used to read TSV files too. The files don't have a row of column headings, so we'll want to set our own.\n",
    "\n",
    "First, the edges. We can use `source` and `target` as the column headings, to match `StellarGraph`'s defaults. However, the natural phrasing is \"paper X cites paper Y\", not \"paper Y is cited by paper X\", so we use the columns in reverse order to match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>target</th>\n",
       "      <th>source</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>35</td>\n",
       "      <td>1033</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>35</td>\n",
       "      <td>103482</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>35</td>\n",
       "      <td>103515</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>35</td>\n",
       "      <td>1050679</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>35</td>\n",
       "      <td>1103960</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5424</th>\n",
       "      <td>853116</td>\n",
       "      <td>19621</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5425</th>\n",
       "      <td>853116</td>\n",
       "      <td>853155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5426</th>\n",
       "      <td>853118</td>\n",
       "      <td>1140289</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5427</th>\n",
       "      <td>853155</td>\n",
       "      <td>853118</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5428</th>\n",
       "      <td>954315</td>\n",
       "      <td>1155073</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5429 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      target   source\n",
       "0         35     1033\n",
       "1         35   103482\n",
       "2         35   103515\n",
       "3         35  1050679\n",
       "4         35  1103960\n",
       "...      ...      ...\n",
       "5424  853116    19621\n",
       "5425  853116   853155\n",
       "5426  853118  1140289\n",
       "5427  853155   853118\n",
       "5428  954315  1155073\n",
       "\n",
       "[5429 rows x 2 columns]"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cora_cites = pd.read_csv(\n",
    "    cora_cites_file,\n",
    "    sep=\"\\t\",  # tab-separated\n",
    "    header=None,  # no heading row\n",
    "    names=[\"target\", \"source\"],  # set our own names for the columns\n",
    ")\n",
    "cora_cites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, the nodes. Again, we have to choose the columns' names. The names of the 0/1-columns don't matter so much, but we can give the first column (of IDs) and the last one (of subjects) useful names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>w0</th>\n",
       "      <th>w1</th>\n",
       "      <th>w2</th>\n",
       "      <th>w3</th>\n",
       "      <th>w4</th>\n",
       "      <th>w5</th>\n",
       "      <th>w6</th>\n",
       "      <th>w7</th>\n",
       "      <th>w8</th>\n",
       "      <th>...</th>\n",
       "      <th>w1424</th>\n",
       "      <th>w1425</th>\n",
       "      <th>w1426</th>\n",
       "      <th>w1427</th>\n",
       "      <th>w1428</th>\n",
       "      <th>w1429</th>\n",
       "      <th>w1430</th>\n",
       "      <th>w1431</th>\n",
       "      <th>w1432</th>\n",
       "      <th>subject</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>31336</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Neural_Networks</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1061127</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Rule_Learning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1106406</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Reinforcement_Learning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>13195</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Reinforcement_Learning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>37879</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Probabilistic_Methods</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2703</th>\n",
       "      <td>1128975</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Genetic_Algorithms</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2704</th>\n",
       "      <td>1128977</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Genetic_Algorithms</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2705</th>\n",
       "      <td>1128978</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Genetic_Algorithms</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2706</th>\n",
       "      <td>117328</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Case_Based</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2707</th>\n",
       "      <td>24043</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Neural_Networks</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2708 rows × 1435 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           id  w0  w1  w2  w3  w4  w5  w6  w7  w8  ...  w1424  w1425  w1426  \\\n",
       "0       31336   0   0   0   0   0   0   0   0   0  ...      0      0      1   \n",
       "1     1061127   0   0   0   0   0   0   0   0   0  ...      0      1      0   \n",
       "2     1106406   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "3       13195   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "4       37879   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "...       ...  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...    ...    ...    ...   \n",
       "2703  1128975   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "2704  1128977   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "2705  1128978   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "2706   117328   0   0   0   0   1   0   0   0   0  ...      0      0      0   \n",
       "2707    24043   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "\n",
       "      w1427  w1428  w1429  w1430  w1431  w1432                 subject  \n",
       "0         0      0      0      0      0      0         Neural_Networks  \n",
       "1         0      0      0      0      0      0           Rule_Learning  \n",
       "2         0      0      0      0      0      0  Reinforcement_Learning  \n",
       "3         0      0      0      0      0      0  Reinforcement_Learning  \n",
       "4         0      0      0      0      0      0   Probabilistic_Methods  \n",
       "...     ...    ...    ...    ...    ...    ...                     ...  \n",
       "2703      0      0      0      0      0      0      Genetic_Algorithms  \n",
       "2704      0      0      0      0      0      0      Genetic_Algorithms  \n",
       "2705      0      0      0      0      0      0      Genetic_Algorithms  \n",
       "2706      0      0      0      0      0      0              Case_Based  \n",
       "2707      0      0      0      0      0      0         Neural_Networks  \n",
       "\n",
       "[2708 rows x 1435 columns]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cora_feature_names = [f\"w{i}\" for i in range(1433)]\n",
    "\n",
    "cora_raw_content = pd.read_csv(\n",
    "    cora_content_file,\n",
    "    sep=\"\\t\",  # tab-separated\n",
    "    header=None,  # no heading row\n",
    "    names=[\"id\", *cora_feature_names, \"subject\"],  # set our own names for the columns\n",
    ")\n",
    "cora_raw_content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we saw above when adding node features, `StellarGraph` uses the index of the DataFrame as the connection between a node and a row of the DataFrame. Currently our dataframe just has a simple numeric range as the index, but it needs to be using the `id` column. Pandas offers [a few ways to control the indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#set-reset-index); in this case, we want to replace the current index by moving the `id` column to it, which is done most easily with `set_index`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>w0</th>\n",
       "      <th>w1</th>\n",
       "      <th>w2</th>\n",
       "      <th>w3</th>\n",
       "      <th>w4</th>\n",
       "      <th>w5</th>\n",
       "      <th>w6</th>\n",
       "      <th>w7</th>\n",
       "      <th>w8</th>\n",
       "      <th>w9</th>\n",
       "      <th>...</th>\n",
       "      <th>w1424</th>\n",
       "      <th>w1425</th>\n",
       "      <th>w1426</th>\n",
       "      <th>w1427</th>\n",
       "      <th>w1428</th>\n",
       "      <th>w1429</th>\n",
       "      <th>w1430</th>\n",
       "      <th>w1431</th>\n",
       "      <th>w1432</th>\n",
       "      <th>subject</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>31336</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Neural_Networks</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1061127</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Rule_Learning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1106406</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Reinforcement_Learning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13195</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Reinforcement_Learning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37879</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Probabilistic_Methods</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1128975</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Genetic_Algorithms</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1128977</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Genetic_Algorithms</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1128978</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Genetic_Algorithms</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>117328</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Case_Based</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24043</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Neural_Networks</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2708 rows × 1434 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         w0  w1  w2  w3  w4  w5  w6  w7  w8  w9  ...  w1424  w1425  w1426  \\\n",
       "id                                               ...                        \n",
       "31336     0   0   0   0   0   0   0   0   0   0  ...      0      0      1   \n",
       "1061127   0   0   0   0   0   0   0   0   0   0  ...      0      1      0   \n",
       "1106406   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "13195     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "37879     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "...      ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...    ...    ...    ...   \n",
       "1128975   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "1128977   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "1128978   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "117328    0   0   0   0   1   0   0   0   0   0  ...      0      0      0   \n",
       "24043     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "\n",
       "         w1427  w1428  w1429  w1430  w1431  w1432                 subject  \n",
       "id                                                                         \n",
       "31336        0      0      0      0      0      0         Neural_Networks  \n",
       "1061127      0      0      0      0      0      0           Rule_Learning  \n",
       "1106406      0      0      0      0      0      0  Reinforcement_Learning  \n",
       "13195        0      0      0      0      0      0  Reinforcement_Learning  \n",
       "37879        0      0      0      0      0      0   Probabilistic_Methods  \n",
       "...        ...    ...    ...    ...    ...    ...                     ...  \n",
       "1128975      0      0      0      0      0      0      Genetic_Algorithms  \n",
       "1128977      0      0      0      0      0      0      Genetic_Algorithms  \n",
       "1128978      0      0      0      0      0      0      Genetic_Algorithms  \n",
       "117328       0      0      0      0      0      0              Case_Based  \n",
       "24043        0      0      0      0      0      0         Neural_Networks  \n",
       "\n",
       "[2708 rows x 1434 columns]"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cora_content_str_subject = cora_raw_content.set_index(\"id\")\n",
    "cora_content_str_subject"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're almost ready to create the `StellarGraph`, we just have to do something about the non-numeric `subject` column. Many machine learning models only work on numeric features, requiring text and other data to be converted before apply; the models in StellarGraph are no different.\n",
    "\n",
    "There are two options, depending on the task:\n",
    "\n",
    "1. remove the `subject` column entirely: many uses of Cora are predicting the `subject` of a node, given all of the graph structure and other information, so including it as information in the graph is giving the answer directly\n",
    "2. convert it to numeric via [one-hot](https://en.wikipedia.org/wiki/One-hot) encoding, where we have 7 columns of 0 and 1, one for each subject value (similar to the 1433 other `w...` features)\n",
    "\n",
    "We'll look at both (feel free to skip ahead to 2).\n",
    "\n",
    "### 1. Removing columns\n",
    "\n",
    "Let's start with the first, removing the columns. The `drop` method ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)) lets us remove one or more columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>w0</th>\n",
       "      <th>w1</th>\n",
       "      <th>w2</th>\n",
       "      <th>w3</th>\n",
       "      <th>w4</th>\n",
       "      <th>w5</th>\n",
       "      <th>w6</th>\n",
       "      <th>w7</th>\n",
       "      <th>w8</th>\n",
       "      <th>w9</th>\n",
       "      <th>...</th>\n",
       "      <th>w1423</th>\n",
       "      <th>w1424</th>\n",
       "      <th>w1425</th>\n",
       "      <th>w1426</th>\n",
       "      <th>w1427</th>\n",
       "      <th>w1428</th>\n",
       "      <th>w1429</th>\n",
       "      <th>w1430</th>\n",
       "      <th>w1431</th>\n",
       "      <th>w1432</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>31336</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1061127</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1106406</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13195</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37879</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1128975</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1128977</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1128978</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>117328</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24043</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2708 rows × 1433 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         w0  w1  w2  w3  w4  w5  w6  w7  w8  w9  ...  w1423  w1424  w1425  \\\n",
       "id                                               ...                        \n",
       "31336     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "1061127   0   0   0   0   0   0   0   0   0   0  ...      0      0      1   \n",
       "1106406   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "13195     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "37879     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "...      ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...    ...    ...    ...   \n",
       "1128975   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "1128977   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "1128978   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "117328    0   0   0   0   1   0   0   0   0   0  ...      1      0      0   \n",
       "24043     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "\n",
       "         w1426  w1427  w1428  w1429  w1430  w1431  w1432  \n",
       "id                                                        \n",
       "31336        1      0      0      0      0      0      0  \n",
       "1061127      0      0      0      0      0      0      0  \n",
       "1106406      0      0      0      0      0      0      0  \n",
       "13195        0      0      0      0      0      0      0  \n",
       "37879        0      0      0      0      0      0      0  \n",
       "...        ...    ...    ...    ...    ...    ...    ...  \n",
       "1128975      0      0      0      0      0      0      0  \n",
       "1128977      0      0      0      0      0      0      0  \n",
       "1128978      0      0      0      0      0      0      0  \n",
       "117328       0      0      0      0      0      0      0  \n",
       "24043        0      0      0      0      0      0      0  \n",
       "\n",
       "[2708 rows x 1433 columns]"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cora_content_no_subject = cora_content_str_subject.drop(columns=\"subject\")\n",
    "cora_content_no_subject"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've got all the right node data, and the right edges, so now we can create a `StellarGraph` using the techniques we saw in the \"homogeneous graph with features\" section above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 2708, Edges: 5429\n",
      "\n",
      " Node types:\n",
      "  paper: [2708]\n",
      "    Features: float32 vector, length 1433\n",
      "    Edge types: paper-cites->paper\n",
      "\n",
      " Edge types:\n",
      "    paper-cites->paper: [5429]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "cora_no_subject = StellarGraph({\"paper\": cora_content_no_subject}, {\"cites\": cora_cites})\n",
    "print(cora_no_subject.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we're trying to predict the subject, we'll probably need to use the `subject` labels as ground-truth labels in a supervised or semi-supervised machine learning task. This can be extracted from the DataFrame and held separately, to be passed in as training, validation or test examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id\n",
       "31336             Neural_Networks\n",
       "1061127             Rule_Learning\n",
       "1106406    Reinforcement_Learning\n",
       "13195      Reinforcement_Learning\n",
       "37879       Probabilistic_Methods\n",
       "                    ...          \n",
       "1128975        Genetic_Algorithms\n",
       "1128977        Genetic_Algorithms\n",
       "1128978        Genetic_Algorithms\n",
       "117328                 Case_Based\n",
       "24043             Neural_Networks\n",
       "Name: subject, Length: 2708, dtype: object"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cora_subject = cora_content_str_subject[\"subject\"]\n",
    "cora_subject"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a normal Pandas Series, and so can be manipulated with any of the functions that support it. For example, if we wanted to train a machine learning algorithm using 25% of the nodes, we could use the `train_test_split` function ([docs](http://www.scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)) from [the scikit-learn library](https://scikit-learn.org/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id\n",
       "191222            Neural_Networks\n",
       "1109208        Genetic_Algorithms\n",
       "308003              Rule_Learning\n",
       "13205      Reinforcement_Learning\n",
       "3217                       Theory\n",
       "                    ...          \n",
       "642827      Probabilistic_Methods\n",
       "1126315           Neural_Networks\n",
       "1105718           Neural_Networks\n",
       "3084                   Case_Based\n",
       "80491             Neural_Networks\n",
       "Name: subject, Length: 677, dtype: object"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import model_selection\n",
    "\n",
    "cora_train, cora_test = model_selection.train_test_split(\n",
    "    cora_subject, train_size=0.25, random_state=123\n",
    ")\n",
    "cora_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id\n",
       "1103969     Probabilistic_Methods\n",
       "1119295             Rule_Learning\n",
       "1130567    Reinforcement_Learning\n",
       "59045                      Theory\n",
       "1129494           Neural_Networks\n",
       "                    ...          \n",
       "126867                 Case_Based\n",
       "1105764    Reinforcement_Learning\n",
       "782486            Neural_Networks\n",
       "74821       Probabilistic_Methods\n",
       "41732      Reinforcement_Learning\n",
       "Name: subject, Length: 2031, dtype: object"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cora_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset, with this preparation, is used in [a demo of the GCN algorithm for node classification](../node-classification/gcn-node-classification.ipynb). The task is to predict the subject of each node."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. One-hot encoding\n",
    "\n",
    "Now, let's look at the other approach: converting the subjects to numeric features. The `pd.get_dummies` function ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)) can do this for us, by adding extra columns (7, in this case), based on the unique values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>w0</th>\n",
       "      <th>w1</th>\n",
       "      <th>w2</th>\n",
       "      <th>w3</th>\n",
       "      <th>w4</th>\n",
       "      <th>w5</th>\n",
       "      <th>w6</th>\n",
       "      <th>w7</th>\n",
       "      <th>w8</th>\n",
       "      <th>w9</th>\n",
       "      <th>...</th>\n",
       "      <th>w1430</th>\n",
       "      <th>w1431</th>\n",
       "      <th>w1432</th>\n",
       "      <th>subject_Case_Based</th>\n",
       "      <th>subject_Genetic_Algorithms</th>\n",
       "      <th>subject_Neural_Networks</th>\n",
       "      <th>subject_Probabilistic_Methods</th>\n",
       "      <th>subject_Reinforcement_Learning</th>\n",
       "      <th>subject_Rule_Learning</th>\n",
       "      <th>subject_Theory</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>31336</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1061127</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1106406</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13195</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37879</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1128975</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1128977</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1128978</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>117328</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24043</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2708 rows × 1440 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         w0  w1  w2  w3  w4  w5  w6  w7  w8  w9  ...  w1430  w1431  w1432  \\\n",
       "id                                               ...                        \n",
       "31336     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "1061127   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "1106406   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "13195     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "37879     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "...      ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...    ...    ...    ...   \n",
       "1128975   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "1128977   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "1128978   0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "117328    0   0   0   0   1   0   0   0   0   0  ...      0      0      0   \n",
       "24043     0   0   0   0   0   0   0   0   0   0  ...      0      0      0   \n",
       "\n",
       "         subject_Case_Based  subject_Genetic_Algorithms  \\\n",
       "id                                                        \n",
       "31336                     0                           0   \n",
       "1061127                   0                           0   \n",
       "1106406                   0                           0   \n",
       "13195                     0                           0   \n",
       "37879                     0                           0   \n",
       "...                     ...                         ...   \n",
       "1128975                   0                           1   \n",
       "1128977                   0                           1   \n",
       "1128978                   0                           1   \n",
       "117328                    1                           0   \n",
       "24043                     0                           0   \n",
       "\n",
       "         subject_Neural_Networks  subject_Probabilistic_Methods  \\\n",
       "id                                                                \n",
       "31336                          1                              0   \n",
       "1061127                        0                              0   \n",
       "1106406                        0                              0   \n",
       "13195                          0                              0   \n",
       "37879                          0                              1   \n",
       "...                          ...                            ...   \n",
       "1128975                        0                              0   \n",
       "1128977                        0                              0   \n",
       "1128978                        0                              0   \n",
       "117328                         0                              0   \n",
       "24043                          1                              0   \n",
       "\n",
       "         subject_Reinforcement_Learning  subject_Rule_Learning  subject_Theory  \n",
       "id                                                                              \n",
       "31336                                 0                      0               0  \n",
       "1061127                               0                      1               0  \n",
       "1106406                               1                      0               0  \n",
       "13195                                 1                      0               0  \n",
       "37879                                 0                      0               0  \n",
       "...                                 ...                    ...             ...  \n",
       "1128975                               0                      0               0  \n",
       "1128977                               0                      0               0  \n",
       "1128978                               0                      0               0  \n",
       "117328                                0                      0               0  \n",
       "24043                                 0                      0               0  \n",
       "\n",
       "[2708 rows x 1440 columns]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cora_content_one_hot_subject = pd.get_dummies(\n",
    "    cora_content_str_subject, columns=[\"subject\"]\n",
    ")\n",
    "cora_content_one_hot_subject"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using this DataFrame, we can create a `StellarGraph` with 1440 features per node instead of 1433 like the previous section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "StellarGraph: Undirected multigraph\n",
      " Nodes: 2708, Edges: 5429\n",
      "\n",
      " Node types:\n",
      "  paper: [2708]\n",
      "    Features: float32 vector, length 1440\n",
      "    Edge types: paper-cites->paper\n",
      "\n",
      " Edge types:\n",
      "    paper-cites->paper: [5429]\n",
      "        Weights: all 1 (default)\n",
      "        Features: none\n"
     ]
    }
   ],
   "source": [
    "cora_one_hot_subject = StellarGraph(\n",
    "    {\"paper\": cora_content_one_hot_subject}, {\"cites\": cora_cites}\n",
    ")\n",
    "print(cora_one_hot_subject.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "You hopefully now know more about building a `StellarGraph` in various configurations via Pandas DataFrames, including some feature preprocessing in the \"Real data: Homogeneous graph from CSV files\" section.\n",
    "\n",
    "Revisit this document to use as a reminder, or [the documentation](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.StellarGraph) for the `StellarGraph` class.\n",
    "\n",
    "Once you've loaded your data, you can start doing machine learning: a good place to start is the [demo of the GCN algorithm on the Cora dataset for node classification](../node-classification/gcn-node-classification.ipynb). Additionally, StellarGraph includes [many other demos of other algorithms, solving other tasks](../README.md)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbsphinx": "hidden",
    "tags": [
     "CloudRunner"
    ]
   },
   "source": [
    "<table><tr><td>Run the latest release of this notebook:</td><td><a href=\"https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/basics/loading-pandas.ipynb\" alt=\"Open In Binder\" target=\"_parent\"><img src=\"https://mybinder.org/badge_logo.svg\"/></a></td><td><a href=\"https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/basics/loading-pandas.ipynb\" alt=\"Open In Colab\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a></td></tr></table>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}