More than 3 years have passed since last update.

Google ColabでGNN（PyTorch geometric）を実装するまで

Last updated at 2021-05-22Posted at 2021-05-22

#1. はじめに

興味本位でGNN (Graph Neural Network) をGoogle Colabで実装したくて，
少しインストールまでが手こずってしまったので，記事にしておきます．
また，インストール以降のGNNの実装までを記載しておりますので，参考にしてください．
本記事の章立ては以下のようになっています．

2.1 Install失敗例
2.2 エラーが発生しないようにInstallしてみる
3. GNN実装まで
4. まとめ
参考記事

2章に関しましては，PyTorch geometricをインストールするまでに
エラーが発生しましたので，共有および解決方法を記載しています．

以下に参考コードも記載しておきます．
Github 参考コード

#2.1 Install失敗例
ここではじめにインストールができなくて出鼻を挫かれそうになりまいました．．．．
まずは失敗例をお伝えしますので，ご注意を

import torch

print("PyTorch ==", torch.__version__)
# PyTorch == 1.8.1
print("CUDA available", torch.cuda.is_available())
# CUDA available True
print("CUDA ==", torch.version.cuda)
# CUDA == 10.1

!pip install torch-scatter==latest+cu101 torch-sparse==latest+cu101 -f https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.4.0.html
!pip install torch-geometric

上記のコードで，torch_geometricからDataをImportすると

from torch_geometric.data import Data

# 出力
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/torch_sparse/__init__.py in <module>()
     14         torch.ops.load_library(importlib.machinery.PathFinder().find_spec(
---> 15             library, [osp.dirname(__file__)]).origin)
     16 except OSError as e:

6 frames
OSError: /usr/local/lib/python3.7/dist-packages/torch_sparse/_version.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/torch_sparse/__init__.py in <module>()
     19     if major != t_major or (major == t_major and minor != t_minor):
     20         raise RuntimeError(
---> 21             f'Expected PyTorch version {t_major}.{t_minor} but found '
     22             f'version {major}.{minor}.')
     23     raise OSError(e)

RuntimeError: Expected PyTorch version 1.4 but found version 1.8.

というようなエラーが来るので注意してください．もしかしたら自身の環境だけ
こうなってしまうのかもしれませんが．．．．．．
エラー対処法は，下記の参考文献をもとに解決しました！！！！

Error 対処法1
Error 対処法2
同じようなエラーで困っている人がいたので，参考になればと!

また違うようなエラーが出てしまった場合の対処法

import torch_sparse
# Detected that PyTorch and torch_scatter were compiled with different CUDA versions. PyTorch has CUDA version 10.1 and torch_scatter has CUDA version 11.0. Please reinstall the torch_scatter that matches your PyTorch install.

Pytorchのcudaのバージョンが10.1で， torch_scatterのcudaのバージョンが11.0で競合しています．
っていう感じのエラーですね．．．

なので一旦，torch_scatterとtorch_sparseのパッケージを削除してから
再度インストールし直す方がいいと思います！！！！

削除するときのtorch_scatterのディレクトリを確認するときは

!pip show torch_scatter

#Name: torch-scatter
#Version: 2.0.6
#Summary: PyTorch Extension Library of Optimized Scatter Operations
#Home-page: https://github.com/rusty1s/pytorch_scatter
#Author: Matthias Fey
#Author-email: matthias.fey@tu-dortmund.de
#License: MIT
#Location: /usr/local/lib/python3.7/dist-packages
#Requires: 
#Required-by:

このように，Locationを確認してからtorch_scatterを削除します．

!rm -rf "/usr/local/lib/python3.7/dist-packages/torch_scatter"

次の節(2.2)からは，GNNを実装するまでのエラーしないようにインストールする方法を記載します！！！！！

#2.2 エラーが発生しないようにInstallしてみる

下記にエラーが発生しないようなコードに変更しました．
お恥ずかしながら，torchのバージョンを1.8.1⇨1.8.0へ下げることで対処しました・・・

!pip install torch===1.8.0 

import torch

print("PyTorch ==", torch.__version__)
# PyTorch == 1.8.0
print("CUDA available", torch.cuda.is_available())
# CUDA available True
print("CUDA ==", torch.version.cuda)
# CUDA == 10.2

def format_pytorch_version(version):
  return version.split('+')[0]
  
TORCH_version = torch.__version__
TORCH = format_pytorch_version(TORCH_version)

def format_cuda_version(version):
  return 'cu' + version.replace('.', '')

CUDA_version = torch.version.cuda
CUDA = format_cuda_version(CUDA_version)

!pip install torch-scatter     -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-sparse      -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
!pip install torch-geometric

これでインストール完了！
Pytorchのバージョンの指定とcudaのバージョンの指定をしっかり行うことが大事です．

#3. GNN実装まで

2章までは，GNNを実装するための必要なパッケージをインストールするまでの
セットアップは終了したので，ここからはGNN実装部分について記載します．

Import

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline 
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf') # For export
from matplotlib.colors import to_rgb
plt.rcParams['lines.linewidth'] = 2.0
import seaborn as sns
sns.reset_orig()
sns.set()

## Progress bar
from tqdm.notebook import tqdm

## PyTorch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim
# # Torchvision
import torchvision
from torchvision.datasets import CIFAR10
from torchvision import transforms

# Ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.determinstic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.cuda("cpu")
print(device)
# cuda:0

Load Dataset

今回はKarateclubのデータセットを用いて実装します．

from torch_geometric.datasets import KarateClub

dataset = KarateClub()
data = dataset[0]  # Get the first graph object.

print(f'Dataset: {dataset}:')
print('======================')
print(data)
print('==============================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Contains isolated nodes: {data.contains_isolated_nodes()}')
print(f'Contains self-loops: {data.contains_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')


#Dataset: KarateClub():
#======================
#Data(edge_index=[2, 156], train_mask=[34], x=[34, 34], y=[34])
#==============================================================
#Number of nodes: 34
#Number of features: 34
#Number of classes: 4
#Number of edges: 156
#Average node degree: 4.59
#Number of training nodes: 4
#Training node label rate: 0.12
#Contains isolated nodes: False
#Contains self-loops: False
#Is undirected: True

Data(edge_index=[2, 156], train_mask=[34], x=[34, 34], y=[34])
→ このDataオブジェクトは4つの属性を持っていることがわかります．

(1) edge_index プロパティは，グラフの接続性に関する情報，すなわち，各エッジのソースノードとデスティネーションノードのインデックスのタプルを保持しています．
(2)ノードの特徴をx（34個のノードにそれぞれ34次元の特徴ベクトルが割り当てられている）
(3)ノードのラベルをy（各ノードは正確に1つのクラスに割り当てられている）と呼ばれています．
(4) train_maskという属性があり，これは，どのノードについてコミュニティの割り当てがすでにわかっているかを示しています.

次に, edge_indexについてもう少し詳しく中身を確認してみると

from IPython.display import Javascript  # Restrict height of output cell.
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))

edge_index = data.edge_index
print(edge_index.t())

# outputはtensor型の配列

edge_indexを表示することで，PyG(data)がどのようにグラフの接続性を内部で表現しているかをさらに理解することができます．
各エッジに対して、edge_indexは2つのノードインデックスのタプルを保持していることがわかります.1つ目の値はエッジのソースノードのノードインデックス,2つ目の値はエッジのデスティネーションノードのノードインデックスを表します.

次にnetworkxライブラリを用いて，グラフの操作機能に加えて，可視化を行いましょう．
PyTorch geometricにはグラフ構造を可視化する関数がないため、networkxを使って可視化します．

G = to_networkx(data, to_undirected=True)
 
# calculate pagerank for visualize
pr = nx.pagerank(G)
pr_max = np.array(list(pr.values())).max()
 
# set node layout
draw_pos = nx.spring_layout(G, seed=0) 
 
# set color of node
cmap = plt.get_cmap('tab10')
labels = data.y.numpy()
colors = [cmap(l) for l in labels]

# visualize
plt.figure(figsize=(10, 10))
nx.draw_networkx_nodes(G, 
                       draw_pos,
                       node_size=[v / pr_max * 1000 for v in pr.values()],
                       node_color=colors, alpha=0.5)
nx.draw_networkx_edges(G, draw_pos, arrowstyle='-', alpha=0.2)
nx.draw_networkx_labels(G, draw_pos, font_size=10)
 
plt.title('KarateClub')
plt.show()

可視化の結果は，以下の図になります．

ノードの数は計34個，クラスは4つを確認することができます．
またエッジから，各ソースノードとデスティネーションノードとの対応を確認することができます．

Implementing Graph Neural Networks

ここからがGNNの実装です．
ネットワーク・アーキテクチャを torch.nn.Module クラスで定義することにより，最初のGNNを作成する準備が整います．

from torch.nn import Linear
from torch_geometric.nn import GCNConv


class GCN(torch.nn.Module):
    def __init__(self):
        super(GCN, self).__init__()
        torch.manual_seed(12345)
        self.conv1 = GCNConv(dataset.num_features, 4)
        self.conv2 = GCNConv(4, 4)
        self.conv3 = GCNConv(4, 2)
        self.classifier = Linear(2, dataset.num_classes)

    def forward(self, x, edge_index):
        h = self.conv1(x, edge_index)
        h = h.tanh()
        h = self.conv2(h, edge_index)
        h = h.tanh()
        h = self.conv3(h, edge_index)
        h = h.tanh()  # Final GNN embedding space.
        
        # Apply a final (linear) classifier.
        out = self.classifier(h)

        return out, h

model = GCN()
print(model)

#GCN(
#  (conv1): GCNConv(34, 4)
#  (conv2): GCNConv(4, 4)
#  (conv3): GCNConv(4, 2)
#  (classifier): Linear(in_features=2, out_features=4, bias=True)
#)

まず,__init__ですべてのblockを初期化し,forwardでネットワークの計算フローを定義します.
これは,各ノードの3ホップ近傍（3ホップ先までのすべてのノード）の情報を集約することに相当します.
さらに,GCNConv層は,ノード特徴量の次元を2 (34⇨4⇨4⇨2) に削減します.
各GCNConvレイヤーはtanhの非線形で強調されます.

その後,ノードを4つのクラス/コミュニティのうちの
1つにマッピングする分類器として機能する線形変換(torch.nn.Linear`)を適用します.

次に，trainingを開始して，nodeの埋め込みが時間とともにどのように変化するかを見てみましょう.

import time

def visualize(h, color, epoch=None, loss=None):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])

    if torch.is_tensor(h):
        h = h.detach().cpu().numpy()
        plt.scatter(h[:, 0], h[:, 1], s=140, c=color, cmap="Set2")
        if epoch is not None and loss is not None:
            plt.xlabel(f'Epoch: {epoch}, Loss: {loss.item():.4f}', fontsize=16)
    else:
        nx.draw_networkx(G, pos=nx.spring_layout(G, seed=42), with_labels=False,
                         node_color=color, cmap="Set2")
    plt.show()


model = GCN()
criterion = nn.CrossEntropyLoss()  # Define loss criterion.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)  # Define optimizer.
history = {
        "epoch": [],
        "train_loss": [],
        "test_loss": [],
        "test_acc": []
    }

def train(data):
    optimizer.zero_grad()  # Clear gradients.
    out, h = model(data.x, data.edge_index)  # Perform a single forward pass.
    loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss, h

for epoch in range(401):
    train_loss = 0.0
    loss, h = train(data)
    history["epoch"].append(epoch)
    history["train_loss"].append(loss.item())
    if epoch % 10 == 0:
        visualize(h, color=data.y, epoch=epoch, loss=loss)
        time.sleep(0.3)

# outputは，各エポックに対するLossとその時のグラフの可視化

最終的なエポック400の時の図が以下のようになります．

このように,3層構造のGNNモデルは,ほとんどのノードを正しく分類することができます！！！

#4. まとめ
本記事では，Google ColabでGCNを実装するまでを記載しました．
実装までに，エラーが発生することが多かったため，その内容および対処法についても
共有いたしましたので，どなたかの参考になればと思います!!!!

参考記事

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up