图神经网络（03）-图与图学习(二) - 文章 - 开发者社区

在上篇中图与图学习（上），我们简单学习了图论的基本概念，图的表示和存储方式，同构图和异构图的分类，以及几个基础的图论算法。在接下来的前置教程下篇中，我们将会学习图机器学习。

首先我们导入需要的包


            
              
import numpy as np  
import random  
import networkx as nx  
from IPython.display import Image  
import matplotlib.pyplot as plt  
from sklearn.metrics import accuracy_score  
from sklearn.metrics import roc_curve  
from sklearn.metrics import roc_auc_score

图学习的主要任务

picture.image

图学习中包含三种主要的任务：

链接预测（Link prediction）
节点标记预测（Node labeling）
图嵌入（Graph Embedding）

链接预测（Link prediction）

picture.image

在链接预测中，给定图G，我们的目标是预测新边。例如，当图未被完全观察时，或者当新客户加入平台（例如，新的LinkedIn用户）时，预测未来关系或缺失边是很有用的。知乎回答：如何理解链接预测（link prediction）

新LinkedIn用户的链接预测只是给出它可能认识的人的建议。

在链路预测中，我们只是尝试在节点对之间建立相似性度量，并链接最相似的节点。现在的问题是识别和计算正确的相似性分数！

为了说明图中不同链路的相似性差异，让我们通过下面这个图来解释：

picture.image

设是节点的一组邻居。在上图中，节点和的邻居可以表示为：

picture.image

的邻居：

picture.image

1. 相似度分数

我们可以根据它们的邻居为这两个节点建立几个相似度分数。

公共邻居：，即公共邻居的数量。在此示例中，分数将为2，因为它们仅共享2个公共邻居。

picture.image

Jaccard系数：,标准化的共同邻居版本。

交集是共同的邻居，并集是：

picture.image

因此，Jaccard系数由粉红色与黄色的比率计算出：

picture.image

值是。

Adamic-Adar指数：。对于节点i和j的每个公共邻居（common neighbor），我们将1除以该节点的邻居总数。这个概念是，当预测两个节点之间的连接时，与少量节点之间共享的元素相比，具有非常大的邻域的公共元素不太重要。
优先依附(Preferential attachment)：
当社区信息可用时，我们也可以在社区信息中使用它们。

2. 性能指标(Performance metrics)

我们如何进行链接预测的评估？我们必须隐藏节点对的子集，并根据上面定义的规则预测它们的链接。这相当于监督学习中的train/test的划分。然后，我们评估密集图的正确预测的比例，或者使用稀疏图的标准曲线下的面积(AUC)。

这里继续用空手道俱乐部图来举例：


            
              
# 我们导入空手道俱乐部图  
n=34  
m = 78  
G_karate = nx.karate_club_graph()  
  
pos = nx.spring_layout(G_karate)  
nx.draw(G_karate, cmap = plt.get_cmap('rainbow'), with_labels=True, pos=pos)

picture.image


            
              
# 我们首先把有关图的信息打印出来：  
n = G_karate.number_of_nodes()  
m = G_karate.number_of_edges()  
print("Number of nodes : %d" % n)  
print("Number of edges : %d" % m)  
print("Number of connected components : %d" % nx.number_connected_components(G_karate))


              
Number of nodes : 34  
Number of edges : 78  
Number of connected components : 1


            
              
plt.figure(figsize=(12,8))  
nx.draw(G_karate, pos=pos)  
plt.gca().collections[0].set_edgecolor("#000000")

picture.image


            
              
# 现在，让我们删除一些连接，例如25％的边：  
# Take a random sample of edges  
edge_subset = random.sample(G_karate.edges(), int(0.25 * G_karate.number_of_edges()))  
  
# remove some edges  
G_karate_train = G_karate.copy()  
G_karate_train.remove_edges_from(edge_subset)


            
              
# 绘制部分观察到的图，可以对比上图发现，去掉了一些边  
plt.figure(figsize=(12,8))  
nx.draw(G_karate_train, pos=pos)

picture.image


            
              
# 你可以打印我们删除的边数和剩余边数：  
edge_subset_size = len(list(edge_subset))  
print("Deleted : ", str(edge_subset_size))  
print("Remaining : ", str((m - edge_subset_size)))


              
Deleted :  19  
Remaining :  59

计算Jaccard Coefficient


            
              
# 我们可以先使用Jaccard系数进行预测：  
# Make prediction using Jaccard Coefficient  
pred_jaccard = list(nx.jaccard_coefficient(G_karate_train))  
score_jaccard, label_jaccard = zip(*[(s, (u,v) in edge_subset) for (u,v,s) in pred_jaccard])  
# 打印前10组结果  
print(pred_jaccard[0:10])  
# 预测结果如下，其中第一个是节点，第二个是节点，最后一个是Jaccard分数（用来表示两个节点之间边预测的概率）


              
[(0, 2, 0.125), (0, 3, 0.15384615384615385), (0, 4, 0.1), (0, 7, 0.08333333333333333), (0, 9, 0.0), (0, 13, 0.09090909090909091), (0, 14, 0.0), (0, 15, 0.0), (0, 16, 0.1), (0, 18, 0.0)]


            
              
# 然后我们可以使用ROC-AUC标准来比较不同模型的性能，因为我们既有真实的边（label），也有预测边的概率（score）  
# Compute the ROC AUC Score  
# 其中，FPR是False Positive Rate， TPR是True Positive Rate  
fpr_jaccard, tpr_jaccard, _ = roc_curve(label_jaccard, score_jaccard)  
auc_jaccard = roc_auc_score(label_jaccard, score_jaccard)  
print(auc_jaccard)


              
0.6151792524790236

计算Adamic-Adar


            
              
# 我们现在计算Adamic-Adar指数和对应的ROC-AUC分数  
# Prediction using Adamic Adar   
pred_adamic = list(nx.adamic_adar_index(G_karate_train))  
score_adamic, label_adamic = zip(*[(s, (u,v) in edge_subset) for (u,v,s) in pred_adamic])  
print(pred_adamic[0:10])  
# Compute the ROC AUC Score  
fpr_adamic, tpr_adamic, _ = roc_curve(label_adamic, score_adamic)  
auc_adamic = roc_auc_score(label_adamic, score_adamic)  
print(auc_adamic)


              
[(0, 2, 1.9235933878519513), (0, 3, 1.9235933878519513), (0, 4, 0.9102392266268373), (0, 7, 0.48089834696298783), (0, 9, 0), (0, 13, 0.48089834696298783), (0, 14, 0), (0, 15, 0), (0, 16, 0.7213475204444817), (0, 18, 0)]  
0.7230576441102756

计算Preferential Attachment


            
              
# 同样，我们可以计算Preferential Attachment得分和对应的ROC-AUC分数  
# Compute the Preferential Attachment  
pred_pref = list(nx.preferential_attachment(G_karate_train))  
score_pref, label_pref = zip(*[(s, (u,v) in edge_subset) for (u,v,s) in pred_pref])  
print(pred_pref[0:10])  
fpr_pref, tpr_pref, _ = roc_curve(label_pref, score_pref)  
auc_pref = roc_auc_score(label_pref, score_pref)  
print(auc_pref)


              
[(0, 2, 80), (0, 3, 50), (0, 4, 10), (0, 7, 30), (0, 9, 20), (0, 13, 20), (0, 14, 10), (0, 15, 10), (0, 16, 10), (0, 18, 20)]  
0.7427263811703171

绘制ROC-AUC来评价预测的效果


            
              
plt.figure(figsize=(12, 8))  
plt.plot(fpr_jaccard, tpr_jaccard, label='Jaccard Coefficient - AUC %.2f' % auc_jaccard, linewidth=4)  
plt.plot(fpr_adamic, tpr_adamic, label='Adamic-Adar - AUC %.2f' % auc_adamic, linewidth=4)  
plt.plot(fpr_pref, tpr_pref, label='Preferential Attachment - AUC %.2f' % auc_pref, linewidth=4)  
plt.plot([0, 1], [0, 1], 'k--')  
plt.xlim([0.0, 1.0])  
plt.ylim([0.0, 1.0])  
plt.xlabel('False positive rate')  
plt.ylabel('True positive rate')  
plt.title("ROC AUC Curve")  
plt.legend(loc='lower right')  
plt.show()

picture.image

关于更多性能评价介绍，可以阅读博客模型评估指标AUC和ROC

节点标记预测（Node labeling）

picture.image

给定一个未标记某些节点的图，我们希望对这些节点的标签进行预测。这在某种意义上是一种半监督的学习问题。

处理这些问题的一种常见方法是假设图上有一定的平滑度。平滑度假设指出通过数据上的高密度区域的路径连接的点可能具有相似的标签。这是标签传播算法背后的主要假设。

标签传播算法（Label Propagation Algorithm,LPA）是一种快速算法，仅使用网络结构作为指导来发现图中的社区，而无需任何预定义的目标函数或关于社区的先验信息。

picture.image

image

单个标签在密集连接的节点组中迅速占据主导地位，但是在穿过稀疏连接区域时会遇到问题。

半监督标签传播算法是如何工作？

首先，我们有一些数据：,，以及前个点的标签：.

我们定义初始标签矩阵，如果具有标签则，否则为0。

该算法将生成预测矩阵，我们将在下面详述。然后，我们通过查找最可能的标签来预测节点的标签：

预测矩阵是什么？

预测矩阵是矩阵，其最小化平滑度和准确度。因此，我们的结果在平滑性和准确性之间进行权衡。

问题的描述非常复杂，所以我将不会详细介绍。但是，解决方案是：

其中：

参数
是给定的标签
是图的归一化拉普拉斯矩阵（Laplacian matrix）

如果您想进一步了解这个主题，请关注图函数的平滑度和流形正则化的概念。

接下来我们用python来实现节点标签的预测。为了给我们使用到的标签添加更多的特征，我们需要使用来自Facebook的真实数据。你可以再这里下载，然后放到facebook路径下。


            
              
G_fb = nx.read_edgelist("facebook/414.edges")  
n = G_fb.number_of_nodes()  
m = G_fb.number_of_edges()  
  
print("Number of nodes: %d" % n)  
print("Number of edges: %d" % m)  
print("Number of connected components: %d" % nx.number_connected_components(G_fb))


              
Number of nodes: 150  
Number of edges: 1693  
Number of connected components: 2

我们使用到的414号数据，图中有150个节点和1693个边。其中，有2个连通分支构成，这意味着这个图中的两块是分开来的。


            
              
# 我们把图数据显示出来：  
mapping=dict(zip(G_fb.nodes(), range(n)))  
nx.relabel_nodes(G_fb, mapping, copy=False)  
pos = nx.spring_layout(G_fb)  
  
plt.figure(figsize=(12,8))  
nx.draw(G_fb, node_size=200, pos=pos)  
plt.gca().collections[0].set_edgecolor("#000000")

picture.image

这个图中包含了不同的特征。在这些可用的特征中，我们将利用第34组特征来进行实验。这组特征描述了这个某个facebook用户（节点）是否是一所学校的学生。这里我们两种标签（1是红色的，代表改用户是这所学校的学生，反正，0则用蓝色表示）


            
              
with open('facebook/414.featnames') as f:  
    for i, l in enumerate(f):  
        pass  
  
n_feat = i+1  
  
features = np.zeros((n, n_feat))  
f = open('facebook/414.feat', 'r')  
  
for line in f:  
    if line.split()[0] in mapping:  
        node_id = mapping[line.split()[0]]  
        features[node_id, :] = list(map(int, line.split()[1:]))  
  
features = 2*features-1  
feat_id = 43  
labels = features[:, feat_id]  
  
plt.figure(figsize=(12,8))  
nx.draw(G_fb, cmap = plt.get_cmap('bwr'), nodelist=range(n), node_color = labels, node_size=200, pos=pos)  
plt.gca().collections[0].set_edgecolor("#000000")  
plt.show()

picture.image

这个所选择的特征，在图中相对平滑，因此拥有较好的学习传播性能。

为了阐述节点标签预测是如何进行的，我们首先要删掉一些节点的标签，作为要预测的对象。这里我们只保留了30%的节点标签：


            
              
random.seed(5)  
proportion_nodes = 0.3  
labeled_nodes = random.sample(G_fb.nodes(), int(proportion_nodes * G_fb.number_of_nodes()))  
  
known_labels = np.zeros(n)  
known_labels[labeled_nodes] = labels[labeled_nodes]  
  
plt.figure(figsize=(12,8))  
nx.draw(G_fb, cmap = plt.get_cmap('bwr'), nodelist=range(n), node_color = known_labels, node_size=200, pos=pos)  
plt.gca().collections[0].set_edgecolor("#000000") # set node border color to black  
plt.show()

picture.image

然后，我们就可以进行标签的传播预测：


            
              
alpha = 0.7  
L_sym = nx.normalized_laplacian_matrix(G_fb)  
  
Y = np.zeros((n,2))  
Y[known_labels==-1, 0] = 1  
Y[known_labels==1, 1] = 1  
I = np.identity(n)  
  
# Create the F-pred matrix  
F_pred = np.linalg.inv(I*(1-alpha) + L_sym) * Y  
# Identify the prediction as the argmax  
pred = np.array(np.argmax(F_pred, axis=1)*2-1).flatten()  
# Compute the accuracy score  
succ_rate = accuracy_score(labels, pred)


            
              
# 我们把结果打印出来  
plt.figure(figsize=(18, 6))  
f, axarr = plt.subplots(1, 2, num=1)  
  
# Plot true values  
plt.sca(axarr[0])  
nx.draw(G_fb, cmap = plt.get_cmap('bwr'), nodelist=range(n), node_color = labels, node_size=200, pos=pos)  
axarr[0].set_title('True labels', size=16)  
plt.gca().collections[0].set_edgecolor("#000000")  
  
# Plot predicted values  
plt.sca(axarr[1])  
nx.draw(G_fb, cmap = plt.get_cmap('bwr'), nodelist=range(n), node_color = pred, node_size=200, pos=pos)  
axarr[1].set_title('Predicted labels (Success Rate: %.2f)' % succ_rate, size=16)  
plt.gca().collections[0].set_edgecolor("#000000")