Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory

1 !pip install torch_geometric

output Collecting torch_geometric


Downloading torch_geometric-2.3.1.tar.gz (661 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 661.6/661.6 kB 6.6 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (4.66.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (1.23.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (1.10.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (3.1.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (2.31.0)
Requirement already satisfied: pyparsing in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (3.1.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (1.2.2)
Requirement already satisfied: psutil>=5.8.0 in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (5.9.5)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch_geometric) (2
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->torch_geo
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torch_geometric) (3.4
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->torch_geometric
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->torch_geometric
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->torch_geometric
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->torch_geo
Building wheels for collected packages: torch_geometric
Building wheel for torch_geometric (pyproject.toml) ... done
Created wheel for torch_geometric: filename=torch_geometric-2.3.1-py3-none-any.whl size=910454 sha256=aa6a8a9d47349b89e0af
Stored in directory: /root/.cache/pip/wheels/ac/dc/30/e2874821ff308ee67dcd7a66dbde912411e19e35a1addda028
Successfully built torch_geometric
Installing collected packages: torch_geometric
Successfully installed torch_geometric-2.3.1

Hands-on Graph Neural Networks with PyTorch Geometric (2): Texas Dataset
Social media has become popular in recent years, and users are connected to each other through followers-followers relationships. There are
many other types of data that have a network structure in which data are connected to each other. One machine learning model that can be
applied to such data is the graph neural networks (GNNs), which is attracting a lot of attention. In this article, I will explain the characteristics of
graph data, as well as the usage of libraries and visualization techniques related to graph neural networks.

Through this article, we will learn the following;

How to handle pytorch geometric and networkx


Characteristics of the Texas dataset
How to effectively visualize data

1 import os
2 import collections
3 import numpy as np
4 import pandas as pd
5 import matplotlib.pyplot as plt
6 import seaborn as sns
7 import scipy.sparse as sp
8 import torch
9 from torch import nn
10 from torch import Tensor
11 import torch_geometric
12 from torch_geometric.nn import GCNConv
13 from torch_geometric.utils import to_networkx
14 from torch_geometric.datasets import WebKB
15 import networkx as nx
16 from networkx.algorithms import community

1 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


2 print(device)
3 data_dir = "./data"
4 os.makedirs(data_dir, exist_ok=True)

cpu

Texas Dataset

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 1/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory
Texas dataset is included in WebKB dataset. The WebKB is a webpage dataset collected from computer science departments of various
universities by Carnegie Mellon University. We use one of the three subdatasets of it, Cornell, Texas, and Wisconsin, where nodes represent web
pages, and edges are hyperlinks between them. Node features are the bag-of-words representation of web pages. The web pages are manually
classified into the five categories, student, project, course, staff, and faculty.

First, download the dataset by running the command below. In this article we will work with the data using pytorch geometric and networkx.

1 dataset = WebKB(root=data_dir, name='texas')


2 data = dataset[0]

Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/new_data/texas/out1_node_feature_label.txt
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/new_data/texas/out1_graph_edges.txt
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_0.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_1.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_2.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_3.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_4.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_5.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_6.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_7.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_8.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_9.npz
Processing...
Done!

Nodes

The Texas dataset contains 183 web pages, which are represented as nodes in the graph.

Edges

The web pages in the Texas dataset have 309 hyperlink connections, which are represented as edges in the graph. The edge information is
unique to graph data.

1 print(f'Number of nodes: {data.num_nodes}')


2 print(f'Number of edges: {data.num_edges}')

Number of nodes: 183


Number of edges: 325

The number of edges seems to be 325. When presented in some papers, they states that the Texas dataset has 309 edges, but why 16 more? It
is because the number of self-loops is included.

1 n_self_loops = 0
2 for i in range(data.num_edges):
3 if data.edge_index[0, i] == data.edge_index[1, i]:
4 print(f'Self-loop at {data.edge_index[0, i]}')
5 n_self_loops += 1
6 print(f"\nTotal of self-loops: {n_self_loops}")

Self-loop at 13
Self-loop at 15
Self-loop at 34
Self-loop at 45
Self-loop at 67
Self-loop at 73
Self-loop at 86
Self-loop at 90
Self-loop at 95
Self-loop at 102
Self-loop at 134
Self-loop at 148
Self-loop at 156
Self-loop at 163
Self-loop at 175
Self-loop at 180
16

1 print(f'Has isolated nodes: {data.has_isolated_nodes()}')


2 print(f'Has self-loops: {data.has_self_loops()}')
3 print(f'Is undirected: {data.is_undirected()}')

Has isolated nodes: False


Has self-loops: True

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 2/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory
Is undirected: False

1 edge_index = data.edge_index.numpy()
2 print(edge_index.shape)
3 index = 15
4 edge_example_outward = edge_index[:, np.where(edge_index[0]==index)[0]]
5 edge_example_inward = edge_index[:, np.where(edge_index[1] == index)[0]]
6 edge_example_outward

(2, 325)
array([[ 15, 15, 15, 15, 15],
[ 15, 16, 22, 65, 165]])

1 node_example = np.unique(np.concatenate(
2 [edge_example_outward.flatten(), edge_example_inward.flatten()]))

Now, let’s try to draw a network centered on this node with networkx.

1 plt.figure(figsize=(10, 6))
2 G = nx.DiGraph()
3 G.add_nodes_from(node_example)
4 G.add_edges_from(list(zip(edge_example_outward[0], edge_example_outward[1])))
5 G.add_edges_from(list(zip(edge_example_inward[0], edge_example_inward[1])))
6 nx.draw_networkx(G, with_labels=False)

Node Degree

Degree in graph theory means the number of edges joining a vertex in a graph.

We saw earlier that each node always has an edge, so how many edges does each node have on average?

1 print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')

Average node degree: 1.78

We found that the average node degree is 1.78. You may have thought it was surprisingly low. We can check the overall distribution by drawing
a histogram of the degree.

1 G = to_networkx(data, to_undirected=False)
2 degrees = [val for (node, val) in G.degree()]
3 display(pd.DataFrame(pd.Series(degrees).describe()).transpose().round(2))
4 print(len(degrees))
5 print(sum(degrees))

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 3/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory

count mean std min 25% 50% 75% max

0 183.0 3.55 7.93 1.0 1.0 2.0 4.0 104.0


183
650

1 plt.figure(figsize=(10, 6))
2 plt.hist(degrees, bins=50)
3 plt.xlabel("node degree")
4 plt.show()

The high degree of a node means that it is connected to many notes (web pages). In other words, nodes with high degree are likely to be
important.

Let's plot the graph to see where the top 10 nodes with the highest degree are located.

1 G = to_networkx(data, to_undirected=False)
2 pos = nx.spring_layout(G, seed=42)
3 cent = nx.degree_centrality(G)
4 node_size = list(map(lambda x: x * 500, cent.values()))
5 cent_array = np.array(list(cent.values()))
6 threshold = sorted(cent_array, reverse=True)[10]
7 print("threshold", threshold)
8 cent_bin = np.where(cent_array >= threshold, 1, 0.1)

threshold 0.049450549450549455

1 plt.figure(figsize=(12, 12))
2 nodes = nx.draw_networkx_nodes(G, pos, node_size=node_size,
3 # cmap=plt.cm.plasma,
4 node_color=cent_bin,
5 nodelist=list(cent.keys()),
6 alpha=cent_bin)
7 edges = nx.draw_networkx_edges(G, pos, width=0.25, alpha=0.3)
8 plt.show()

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 4/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory

/usr/local/lib/python3.10/dist-packages/networkx/drawing/nx_pylab.py:433: UserWarning: No data for colormapping provided via


node_collection = ax.scatter(

The top 10 nodes with the highest degree are represented by yellow dots, and the other nodes are represented by gray dots. The yellow dots are
proportional to the size of the degree. The yellow node in the center, which has the highest degree, has many arrows pointing outward,
indicating that it is a web page with many web page links. Conversely, the other yellow nodes have arrows pointing inward and seem to be
referred to by various web pages.

Features
The papers in the Texas dataset have 1703 features.

For each of the 1703 words, the number of features is expressed as 0 and 1 if the word is included or not included in the paper.

1 print(f'Number of features: {data.num_node_features}')

Number of features: 1703

Let's display some of the features, and you can see that they are composed of 0s and 1s .

1 print(len(data.x[0]))
2 data.x[1][:20]

1703
tensor([0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1.,
0., 0.])

Note that we are now looking at the node features. Edges may also have feature values (edge features), but they are not included in the Texas
dataset.

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 5/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory
1 print(f'Number of edge features: {data.num_edge_features}')

Number of edge features: 0

Classes

The papers in the Texas dataset are labeled with 5 different labels.

1 print(f'Number of classes: {dataset.num_classes}')

Number of classes: 5

Let's display a portion of the class, and we can see that it consists of integers between 0 and 4. Each number corresponds to a subject as
follows.

1 label_dict = {
2 0: "student",
3 1: "project",
4 2: "course",
5 3: "staff",
6 4: "faculty"}
7
8 data.y[:10]

tensor([3, 0, 2, 3, 4, 3, 0, 0, 3, 0])

Often the number of classes is not equal. Let’s find out the number of each class.

1 counter = collections.Counter(data.y.numpy())
2 counter = dict(counter)
3 print(counter)
4 count = [x[1] for x in sorted(counter.items())]

{3: 101, 0: 33, 2: 18, 4: 30, 1: 1}

1 plt.figure(figsize=(10, 6))
2 plt.bar(range(5), count)
3 plt.xlabel("class", size=20)
4 plt.show()

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 6/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory

The highest number of classes is 101 in class 3, and the lowest number is 1 in class 1. We need to be careful this inbalance when training
machine learning models.

Next, draw a network diagram to see if the classes are distributed coherently.

1 G = to_networkx(data, to_undirected=False)
2 node_color = []
3 nodelist = [[], [], [], [], []]
4 colorlist = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00']
5 labels = data.y
6 for n, i in enumerate(labels):
7 node_color.append(colorlist[i])
8 nodelist[i].append(n)
9 pos = nx.spring_layout(G, seed = 42)

1 plt.figure(figsize = (10, 10))


2 labellist = list(label_dict.values())
3 for num, i in enumerate(zip(nodelist, labellist)):
4 n, l = i[0], i[1]
5 nx.draw_networkx_nodes(G, pos, nodelist=n, node_size = 20, node_color = colorlist[num], label=l)
6 nx.draw_networkx_edges(G, pos, width = 0.2, alpha = 0.3)
7 plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
8 plt.savefig("texas.png", dpi = 150, bbox_inches="tight")
9 plt.show()

It is somewhat difficult to see, but it seems that the nodes are not connected to each other in the same class. We will analyze this point from a
different angle in the next section.

Homophily and Heterophily

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 7/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory
Nodes with the same characteristics are often connected. This property is called homophily and the opposite property is called heterophily. For
the five classes we looked at earlier, we will see how many nodes of the same class are connected by edges and vice versa.

1 labels = data.y.numpy()
2 connected_labels_set = list(map(lambda x: labels[x], data.edge_index.numpy()))
3 connected_labels_set = np.array(connected_labels_set)
4
5 def add_missing_keys(counter, classes):
6 for x in classes:
7 if x not in counter.keys():
8 counter[x] = 0
9 return counter
10
11 label_connection_counts = []
12
13 for i in range(5):
14 print(f"label: {i}")
15 connected_labels = connected_labels_set[:, np.where(connected_labels_set[0] == i)[0]]
16 print(connected_labels.shape[1], "edges")
17 counter = collections.Counter(connected_labels[1])
18 counter = dict(counter)
19 print(counter)
20 counter = add_missing_keys(counter, range(5))
21 items = sorted(counter.items())
22 items = [x[1] for x in items]
23 label_connection_counts.append(items)
24
25 label_connection_counts = np.array(label_connection_counts)

label: 0
138 edges
{4: 30, 3: 94, 0: 3, 2: 11}
label: 1
2 edges
{2: 2}
label: 2
58 edges
{3: 39, 4: 12, 2: 6, 0: 1}
label: 3
54 edges
{2: 26, 0: 7, 3: 20, 4: 1}
label: 4
73 edges
{0: 24, 3: 25, 4: 6, 2: 18}

1 plt.figure(figsize=(9, 7))
2 plt.rcParams["font.size"] = 13
3 hm = sns.heatmap(label_connection_counts, annot=True, cmap='hot_r', cbar=True, square=True)
4 plt.xlabel("class",size=20)
5 plt.ylabel("class",size=20)
6 plt.tight_layout()
7 plt.show()

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 8/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory

We can see that there are very few nodes that are connected to each other belonging to the same class.

By dividing the sum of the diagonal components of the matrix by the sum of all components, we calculate the percentage of edges connected
within the same class.

1 label_connection_counts.diagonal().sum() / label_connection_counts.sum()

0.1076923076923077

It seems that about 11% of the edges are connected within the same class.

Next, we will scale each row so that the sum of each row is equal to 1 for easy comparison.

1 def scaling(array):
2 return array / sum(array)
3 label_connection_counts_scaled = np.apply_along_axis(scaling, 1, label_connection_counts)
4 plt.figure(figsize=(9, 7))
5 plt.rcParams["font.size"] = 13
6 hm = sns.heatmap(
7 label_connection_counts_scaled,
8 annot=True,
9 cmap='hot_r',
10 fmt="1.2f",
11 cbar=True,
12 square=True)
13 plt.xlabel("class",size=20)
14 plt.ylabel("class",size=20)
15 plt.tight_layout()
16 plt.show()

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 9/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory

Excellent! We can see that for all classes, the highest number of edges are tied to the different class. The most homophily is class 3, but class 3
nodes are most connected to class 2 nodes. In the other classes, more than 90% of the edges are connected to nodes of different classes.

There is only one node of class 1, but all of its edges are connected to nodes of another class.

Train Test Split

Last but not least, we will discuss data splitting. The data class we are dealing with now has split information in the form of a mask, which can
be divided into training data, validation data, and testing data. First, let’s see how many data each contains.

1 print(f'Number of training nodes: {data.train_mask[:, 0].sum()}')


2 print(f'Number of validation nodes: {data.val_mask[:, 0].sum()}')
3 print(f'Number of test nodes: {data.test_mask[:, 0].sum()}')

Number of training nodes: 87


Number of validation nodes: 59
Number of test nodes: 37

The data was split into 87 training data, 59 validation data, and 37 test data.

We plot the x-axis as index and the y-axis as 0 for unused data, 1 for training data, 2 for validation data, and 3 for test data. It is an odd split, but
it appears that the data is split as above.

Finally, let's see what the class proportions are in each set of data.

1 split_version = 0 # 0 to 9
2 split_type_array = np.zeros(data.num_nodes)
3 split_type_array[np.where(data.train_mask[:, split_version] == True)[0]] = 1
4 split_type_array[np.where(data.val_mask[:, split_version] == True)[0]] = 2
5 split_type_array[np.where(data.test_mask[:, split_version] == True)[0]] = 3
6 split_type_array
7 titles = ["Training", "Validation", "Test"]

1 fig, axes = plt.subplots(ncols=3, figsize=(21, 6))


2 for i in range(3):
3 counter = collections.Counter(data.y.numpy()[np.where(split_type_array == i + 1)[0]])
4 counter = dict(counter)
5 counter = add_missing_keys(counter, range(dataset.num_classes))
6 print(titles[i], counter)
7 count = [x[1] for x in sorted(counter.items())]
8 axes[i].bar(range(dataset.num_classes), count)
9 axes[i].set_xlabel("class", size=20)
10 axes[i].set_title(titles[i])
11 plt.show()

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 10/11
11/26/23, 11:36 PM GNN HANDS 002 - Colaboratory

Training {3: 46, 2: 7, 4: 20, 0: 14, 1: 0}


Validation {0: 15, 3: 31, 1: 1, 4: 5, 2: 7}
Test {3: 24, 0: 4, 2: 4, 4: 5, 1: 0}

In the Cora dataset, the training data had an equal number of each class, but this is not the case in the Texas data. This is due to the small
number of data to begin with. Because of the class imbalance, there are ten different data splits in the Texas dataset. If you are interested, you
can see how the histogram changes when you change the value of split_version.

https://colab.research.google.com/drive/19TJowaxBjl4Q_Ru1DNx9Dt4sqcIZfdEF#printMode=true 11/11

You might also like