# torch geometric + skorch @ CORA dataset

In [1]:
!date

Thu Jan 5 13:34:48 UTC 2023


This is an example for how to use skorch with [torch geometric](https://pytorch-geometric.readthedocs.io/). The code is based on the [introduction example](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html) but modified to have a proper train/valid/test split. This example is showcasing a quite small data set that does not need to employ batching to be trained efficiently. How to do batching with skorch + torch geometric will not be handled here since it is non-trivial and quite dataset specific - if you need this and are stuck, feel free to open [an issue](https://github.com/skorch-dev/skorch/issues) so that we can support you the best we can.

Dependencies of this notebook besides skorch base installation:

It is recommended to install the dependencies [as documented by pytorch geometric](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html).

---

In [2]:
import subprocess

# Installation on Google Colab
try:
 import google.colab
 import torch
 subprocess.run(['python', '-m', 'pip', 'install', 'skorch' , 'torch_geometric'])
 subprocess.run(['python', '-m', 'pip', 'install', 'torch-sparse' , '-f', f'https://data.pyg.org/whl/torch-{torch.__version__}.html'])
 subprocess.run(['python', '-m', 'pip', 'install', 'torch-scatter' , '-f', f'https://data.pyg.org/whl/torch-{torch.__version__}.html'])
except ImportError:
 pass

In [3]:
import skorch
import torch

### Data Loading

In [4]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora')

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!


In [5]:
dataset.data, dataset.num_classes

(Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708]),
 7)

In order to use pytorch geometric / the cora dataset with skorch
we need to address the following things:
 
1. graph convolutions cannot handle missing nodes (=> splitting node attributes but keeping edge_index intact will lead to errors)
2. cora dataset has different attributes for the different split masks (i.e. `train_mask`, `val_mask`, `test_mask`)
3. skorch expects to have (X, y) pairs for classification tasks

To deal with (1) we will split the data into three datasets, creating three sub-graphs in the process; these complete sub-graphs can then be convolved over without errors. 
We use the masks mentioned in (2) to identify the nodes and edges of the subgraphs.

(3) will be handled by specifying our own `XYDataset` which will just have length 1 and return the dataset and the respective y values. We will therefore basically simulate a `batch_size=1` scenario.

In [6]:
from torch_geometric.data import Data

# simulating batch_size=1 by returning the whole dataset and the
# y-values. this way, the data loader can iterate over the 'batches'
# and produce X/y values for us.
class XYDataset(torch.utils.data.Dataset):
 def __init__(self, data: Data, y: torch.tensor):
 self.data = data
 self.y = y
 
 def __len__(self):
 return 1
 
 def __getitem__(self, i):
 return self.data, self.y

### Data Splitting

Split the graph into train, validation and test sub-graphs.
This ensures that there will be no leakage between steps when we apply graph
convolution operators on the graph since each split has its own sub-graph.

We use `relabel_nodes=True` to make the node indices in the edge tensor 
zero-based for each sub-graph. If we would not do this the node subsets
(now zero-based after applying the mask) would not match the indices in the
edge tensor.

In [7]:
from torch_geometric.utils import subgraph

data = dataset[0]

edge_index_train, _ = subgraph(
 subset=data.train_mask, 
 edge_index=data.edge_index, 
 relabel_nodes=True
)
ds_train = XYDataset(
 Data(x=data.x[data.train_mask], edge_index=edge_index_train),
 data.y[data.train_mask],
)

edge_index_valid, _ = subgraph(
 subset=data.val_mask, 
 edge_index=data.edge_index, 
 relabel_nodes=True
)
ds_valid = XYDataset(
 Data(x=data.x[data.val_mask], edge_index=edge_index_valid),
 data.y[data.val_mask],
)

edge_index_test, _ = subgraph(
 subset=data.test_mask, 
 edge_index=data.edge_index, 
 relabel_nodes=True
)
ds_test = XYDataset(
 Data(x=data.x[data.test_mask], edge_index=edge_index_test),
 data.y[data.test_mask],
)

### Data Feeding

Our "batch" consists of the whole dataset so if we unpack the
batch into `(X, y)` we will have `X = Data(...)` and `y = [y_true]`.
The `DataLoader` does not modify `X` but `y` gets a new batch dimension.
This will lead to a shape mismatch as `y.shape` would then be `(1, #num_samples)`. Therefore, we need our own loader that strips the first dimension to 
match the predicted `y` and the labelled `y` in length.

Note: It is possible to avoid this by stripping this dimension by overriding `get_loss` in the `NeuralNet` class. For brevity we won't do this in this example. It is possible to use [one of the many `DataLoader` classes](https://pytorch-geometric.readthedocs.io/en/latest/modules/loader.html) provided by torch geometric using the approach outlined below (just base the `RawDataloader` on one of the other classes) - chances are, though, that if you are doing this you need to deal with batching anyway which is a topic that is not handled here since it is not trivial.

In [8]:
from torch_geometric.loader import DataLoader

class RawLoader(DataLoader):
 def __iter__(self):
 it = super().__iter__()
 for X, y in it:
 yield X, y[0]

### Modelling

This is the CORA example module as seen in the [torch geometric introduction](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html).

In [9]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
 def __init__(self):
 super().__init__()
 self.conv1 = GCNConv(dataset.num_node_features, 16)
 self.conv2 = GCNConv(16, dataset.num_classes)

 def forward(self, data): 
 x, edge_index = data.x, data.edge_index

 x = self.conv1(x, edge_index)
 x = F.relu(x)
 x = F.dropout(x, training=self.training)
 x = self.conv2(x, edge_index)

 return F.softmax(x, dim=1)

### Fitting

In [10]:
from skorch.helper import predefined_split

torch.manual_seed(42)

net = skorch.NeuralNetClassifier(
 module=GCN,
 lr=0.1,
 optimizer__weight_decay=5e-4,
 max_epochs=200,
 train_split=skorch.helper.predefined_split(ds_valid),
 batch_size=1,
 iterator_train=RawLoader,
 iterator_valid=RawLoader,
)

In [11]:
net.fit(ds_train, None)

 epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
 1 [36m1.9724[0m [32m0.1680[0m [35m1.9398[0m 0.0963
 2 [36m1.9625[0m [32m0.1740[0m [35m1.9376[0m 0.0091
 3 [36m1.9327[0m 0.1720 [35m1.9342[0m 0.0115
 4 [36m1.9321[0m [32m0.1760[0m [35m1.9324[0m 0.0096
 5 [36m1.9142[0m [32m0.1800[0m [35m1.9307[0m 0.0069
 6 [36m1.8923[0m 0.1800 [35m1.9290[0m 0.0090
 7 [36m1.8848[0m [32m0.1880[0m [35m1.9269[0m 0.0092
 8 1.8936 [32m0.1920[0m [35m1.9247[0m 0.0152
 9 [36m1.8783[0m [32m0.1960[0m [35m1.9219[0m 0.0082
 10 [36m1.8737[0m [32m0.2040[0m [35m1.9192[0m 0.0082
 11 [36m1.8542[0m [32m0.2060[0m [35m1.9176[0m 0.0070
 12 [36m1.8489[0m [32m0.2100[0m [35m1.9156[0m 0.0078
 13 [36m1.8314[0m [32m0.2120[0m [35m1.9121[0m 0.0061
 14 1.8334 [32m0.2220[0m [35m1.9100[0m 0.0076
 15 [36m1.8041[0m [32m0.2240[0m [35m1.9085[0m 0.0080
 16 1.8089 0.2220 [35m1.9065[0m 0.0101
 17 1.8082 0.2200 [35m

[initialized](
 module_=GCN(
 (conv1): GCNConv(1433, 16)
 (conv2): GCNConv(16, 7)
 ),
)

### Evaluation

In [12]:
from sklearn.metrics import accuracy_score

In [13]:
accuracy_score(ds_test.y, net.predict(ds_test))

0.682

In conclusion this example showed you how to use a basic data graph dataset using pytorch geometric in conjunction with skorch. The final test score is lower than the ~80% accuracy in the [introduction example](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html) which can be explained by the reduced leakage between train and validation sets due to our splitting the data into subgraphs beforehand.

The model is now incorporated into the sklearn world (as you could already see, you can simply use sklearn metrics to evaluate the model). Thus, tools like grid and random search are available to you and it is easily possible to include a graph neural net as a feature transformer in your next ML pipeline!