CN112308326B

CN112308326B - Biological network link prediction method based on meta-path and bidirectional encoder

Info

Publication number: CN112308326B
Application number: CN202011226195.3A
Authority: CN
Inventors: 彭绍亮; 王小奇; 李非; 辛彬; 肖霞; 王红; 张兴龙
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2022-12-13
Anticipated expiration: 2040-11-05
Also published as: CN112308326A

Abstract

The invention belongs to the field of computer science, and discloses a biological network link prediction method based on a meta-path and a bidirectional encoder. Firstly, constructing a multi-source heterogeneous medicine information network, and designing various semantic paths for sequence sampling to form a large-scale semantic information base; secondly, organically fusing a depth Transformer coder and a mask language model (masked language model) to design a depth bidirectional coding characterization model and effectively extract a low latitude characterization vector of each node; finally, a biological link prediction such as a disease-protein association relation, a protein-drug interaction, a drug-side effect association relation and the like is carried out by utilizing an Inductive matrix completion (Inductive matrix completion) technology, and a drug research and development technology system of disease-target-drug-side effect is further completed.

Description

A Link Prediction Method for Biological Networks Based on Meta-Path and Bidirectional Encoder

技术领域technical field

本发明属于计算机科学领域，涉及人工智能技术应用，具体涉及一种基于元路径和双向编码器的生物网络链接预测方法。The invention belongs to the field of computer science and relates to the application of artificial intelligence technology, in particular to a biological network link prediction method based on meta-paths and bidirectional encoders.

背景技术Background technique

针对一组生物医学实体及其已知的相互作用，旨在预测实体之间的其他潜在相互作用(链接)是生物医学领域最重要的任务之一，因此，越来越多的研究者利用计算机技术来预测各种生物医学网络中的潜在相互作用。For a set of biomedical entities and their known interactions, aiming to predict other potential interactions (links) between entities is one of the most important tasks in the field of biomedicine. Therefore, more and more researchers use computer techniques to predict potential interactions in various biomedical networks.

在生物医学领域的传统方法已经投入大量精力来开发生物学相关的特征，例如，化学亚结构，基因本体论(gene ontology)和拓扑结构相似性。与此同时，有监督的学习方法和半监督图的推理模型被用来预测潜在的相互作用。这些方法主要基于相似性假设，即具有相似生物特征或结构特征的实体可能具有相似的联系。但是，基于生物学特征的预测方法通常会面临两个问题：(1)生物学特征提取过程成本很高，甚至有些生物特征很难获得，尽管可以通过预处理删除那些没有特征的生物实体，但这通常会导致数据集规模较小，丢失重要的信息，因此在实际应用中并不实用；(2)生物学特征以可能不够精确，无法代表生物医学实体，并且可能无法建立稳定准确的模型。Traditional approaches in the biomedical field have devoted considerable effort to exploiting biologically relevant features, such as chemical substructure, gene ontology, and topological similarity. Meanwhile, supervised learning methods and semi-supervised graph inference models are used to predict potential interactions. These methods are mainly based on the similarity assumption that entities with similar biological or structural features are likely to be similarly related. However, prediction methods based on biological features usually face two problems: (1) The cost of biological feature extraction is very high, and even some biological features are difficult to obtain, although those biological entities without features can be deleted through preprocessing, but This usually results in small data sets, missing important information, and thus not practical in practical applications; (2) biological features may not be precise enough to represent biomedical entities, and may not be able to build stable and accurate models.

试图自动学习网络节点的低纬向量的网络表征方法有望解决上述两个问题，并且被广泛应用于生物链路预测中。例如，基于矩阵分解的技术被用于药物-疾病关联的预测；一些研究者提出了流形正则化的矩阵分解技术，通过合并了拉普拉斯正则化以学习更好的药物表示，进而提高药物-药物相互作用的预测，除之之外，也有人提出一些基于随机游走的网络表征方法和基于深度神经网络的表征方法。但是现有方法只关注网络的节点之间的结构特征，而忽略了网络实体之间的语义信息；或者只能捕获较短的结构和元路径，无法深度挖掘网络节点之间的结构和语义关系。Network representation methods that attempt to automatically learn low-dimensional vectors of network nodes are expected to solve the above two problems and are widely used in biological link prediction. For example, matrix factorization-based techniques have been used to predict drug-disease associations; some researchers have proposed manifold-regularized matrix factorization techniques that incorporate Laplacian regularization to learn better drug representations, thereby improving In addition to the prediction of drug-drug interactions, some random walk-based network representation methods and deep neural network-based representation methods have also been proposed. However, existing methods only focus on the structural features between network nodes, while ignoring the semantic information between network entities; or they can only capture shorter structures and meta-paths, and cannot deeply mine the structural and semantic relationships between network nodes. .

发明内容Contents of the invention

为了克服上述技术的不足，本发明提供了一种基于元路径和双向编码器的生物网络链接预测方法。首选构建了多源异构的药物信息网络，同时设计多种元路径进行序列采样，构成大规模的语义信息库；其次，将深度Transformer编码器与掩码语言模型(maskedlanguage model)有机融合设计出深度双向的编码表征模型有效地提取每个节点的低纬表征向量；最后，利用归纳矩阵补全(Inductive matrix completion)技术进行疾病-蛋白关联关系、蛋白-药物相互作用、药物-副作用关联关系等生物链接预测，进而完成从疾病—靶标—药物—副作用的药物研发技术体系。In order to overcome the deficiencies of the above-mentioned technologies, the present invention provides a biological network link prediction method based on meta-paths and bidirectional encoders. Firstly, a multi-source heterogeneous drug information network was constructed, and multiple meta-paths were designed for sequence sampling to form a large-scale semantic information base; secondly, a deep Transformer encoder was organically integrated with a masked language model (masked language model) to design a The deep two-way encoding representation model effectively extracts the low-dimensional representation vector of each node; finally, the inductive matrix completion (Inductive matrix completion) technology is used to conduct disease-protein association, protein-drug interaction, drug-side effect association, etc. Biological link prediction, and then complete the drug research and development technology system from disease-target-drug-side effects.

本发明所采用的技术方案是：The technical scheme adopted in the present invention is:

一种基于元路径和双向编码器的生物网络链接预测方法，包括如下步骤：A biological network link prediction method based on meta-paths and bidirectional encoders, comprising the following steps:

1)参数初始化，包括：网络序列长度l，节点读书的阈值deg，表征向量维度dim，Transformer编码器的层数n，语言模型的掩码序列比率k∈(0,1)，掩码序列被特殊字符[MASK]替换的概率p∈(0,1)，掩码序列被语义文本中其他序列替换的概率p′∈(0,1-p)；1) Parameter initialization, including: network sequence length l, node reading threshold deg, representation vector dimension dim, Transformer encoder layer number n, language model mask sequence ratio k∈(0,1), mask sequence is The probability p ∈ (0,1) of the replacement of the special character [MASK], the probability p′ ∈ (0,1-p) of the mask sequence being replaced by other sequences in the semantic text;

2)构建药物信息网络和元路径；2) Construct drug information network and meta-path;

3)对网络中的所有节点进行编号x_i∈{x_i|i＝1,2,...,num}，其中num代表节点的总个数，并对每个节点x_i∈{x_i|i＝1,2,...,num}根据所述步骤2)的元路径依次进行采样；3) Number all nodes in the network x _i ∈ { _xi |i=1,2,...,num}, where num represents the total number of nodes, and for each node x _i ∈ { _xi |i=1,2,...,num} Sampling is performed sequentially according to the meta-path in step 2);

4)将所有的语义序列输入深层双向Transformer编码器进行表征学习，获得节点的低维表征向量，其中每层的Transformer模型都包含同样的多头自注意力机制(multi-head self-attention mechanism)和全连接网络；4) Input all the semantic sequences into the deep two-way Transformer encoder for representation learning, and obtain the low-dimensional representation vector of the node, in which the Transformer model of each layer contains the same multi-head self-attention mechanism (multi-head self-attention mechanism) and fully connected network;

5)判断是否达到最大的训练次数，如果达到最大迭代次数，则输出每个节点的表征向量

转至步骤6)，否则转至步骤4)；5) Determine whether the maximum number of training times is reached, and if the maximum number of iterations is reached, then output the representation vector of each node

Go to step 6), otherwise go to step 4);

6)利用归纳矩阵补全方法进行疾病-蛋白关联预测；6) Using the inductive matrix completion method for disease-protein association prediction;

7)与步骤6)中疾病-蛋白关联预测相同，利用归纳矩阵补全方法预测靶标-药物相互作用；7) Same as the disease-protein association prediction in step 6), using the induction matrix completion method to predict the target-drug interaction;

8)与步骤6)中疾病-蛋白关联预测相同，利用归纳矩阵补全方法预测药物-副作用关联关系。8) Same as the disease-protein association prediction in step 6), use the inductive matrix completion method to predict the drug-side effect association relationship.

作为本发明的进一步改进，所述步骤2)通过以下步骤实现：As a further improvement of the present invention, said step 2) is realized through the following steps:

2.1)通过DrugBank、UniProt、HPRD、SIDER、CTD、NDFRT和STRING公开数据库构建包含药物、靶标、疾病和副作用4种节点类型、7种边缘的药物信息网络，并且删除度小于deg的节点，所述7种边缘包括药物-药物相互作用，药物-蛋白相互作用,药物-疾病关联关系,药物-副作用关联关系，蛋白-疾病关联关系，药物-药物结构相似度,蛋白-蛋白序列相似性；2.1) Construct a drug information network containing 4 node types and 7 edges of drugs, targets, diseases and side effects through DrugBank, UniProt, HPRD, SIDER, CTD, NDFRT and STRING public databases, and delete nodes with degrees less than deg, the Seven kinds of edges include drug-drug interaction, drug-protein interaction, drug-disease relationship, drug-side effect relationship, protein-disease relationship, drug-drug structural similarity, protein-protein sequence similarity;

2.2)根据不同的生物通路、药物机理构建23种元路径，分别为：药物-蛋白，药物-蛋白-药物，药物-蛋白-蛋白，药物-蛋白-疾病，药物-蛋白-蛋白-药物，药物-蛋白-蛋白-疾病，药物-蛋白-药物-蛋白，药物-蛋白-药物-疾病，药物-蛋白-药物-副作用，药物-蛋白-疾病-蛋白，药物-蛋白-疾病-药物，蛋白-药物-药物，蛋白-药物-蛋白，蛋白-药物-疾病，蛋白-药物-副作用，蛋白-药物-药物-蛋白，蛋白-药物-药物-疾病，蛋白-药物-药物-副作用，蛋白-药物-蛋白-蛋白，蛋白-药物-蛋白-疾病，蛋白-药物-疾病-蛋白，蛋白-药物-疾病-药物，蛋白-药物-副作用-药物；2.2) Construct 23 meta-paths according to different biological pathways and drug mechanisms, namely: drug-protein, drug-protein-drug, drug-protein-protein, drug-protein-disease, drug-protein-protein-drug, drug -protein-protein-disease, drug-protein-drug-protein, drug-protein-drug-disease, drug-protein-drug-side effect, drug-protein-disease-protein, drug-protein-disease-drug, protein-drug -drug, protein-drug-protein, protein-drug-disease, protein-drug-side effect, protein-drug-drug-protein, protein-drug-drug-disease, protein-drug-drug-side effect, protein-drug-protein - protein, protein-drug-protein-disease, protein-drug-disease-protein, protein-drug-disease-drug, protein-drug-side effect-drug;

作为本发明的进一步改进，所述步骤4)通过以下步骤实现：As a further improvement of the present invention, said step 4) is realized through the following steps:

4.1)对所有的语义序列进行分词，包括去除特殊字符和多余字符、空格分词过程，最后采用掩码语言模型对语义序列进行处理，从所有的语义序列中按掩码比率k随机选取掩码序列，针对每个掩码序列，生成一个随机数rand∈[0,1]，如果rand＜p，则该序列被替换为[MASK]，其中p∈(0,1)为掩码序列被[MASK]替换的概率；如果p≤rand＜p+p′，则从语义序列中随机选则一个序列用来替换该掩码序列，其中p′∈(0,1-p)是掩码序列被其他序列替换的概率；如果p+p′≤rand＜1，则该掩码序列保持不变；4.1) Segment all semantic sequences, including removing special characters and redundant characters, space word segmentation process, and finally use mask language model to process semantic sequences, randomly select mask sequences according to mask ratio k from all semantic sequences , for each mask sequence, generate a random number rand∈[0,1], if rand<p, the sequence is replaced by [MASK], where p∈(0,1) is the mask sequence is [MASK ] replacement probability; if p≤rand<p+p', then a sequence is randomly selected from the semantic sequence to replace the mask sequence, where p'∈(0,1-p) is the mask sequence replaced by other The probability of sequence replacement; if p+p'≤rand<1, the mask sequence remains unchanged;

4.2)将每个节点的初始表征向量和位置向量进行叠加记为

并输入多头注意力机制学习得到向量

并利用残差连接和归一化处理得到

其次，利用全连接前馈网络进一步学习，全连接前馈网络也进行残差连接和归一化操作；最终得到节点的低维表征向量。4.2) The initial characterization vector and position vector of each node are superimposed and recorded as

And input the multi-head attention mechanism to learn the vector

And use the residual connection and normalization processing to get

Secondly, the fully connected feedforward network is used for further learning, and the fully connected feedforward network also performs residual connection and normalization operations; finally, the low-dimensional representation vector of the node is obtained.

作为本发明的进一步改进，所述步骤6)通过以下步骤实现：As a further improvement of the present invention, said step 6) is realized through the following steps:

6.1)计算网络中疾病-蛋白相互关联的个数Ninter，并从疾病-蛋白关联网络中随机选择同样数量的Ninter个负样本，将这些正样本和负样本混合在一起，进行10-折(10-fold)交叉验证；6.1) Calculate the number Ninter of disease-protein correlations in the network, and randomly select the same number of Ninter negative samples from the disease-protein association network, mix these positive samples and negative samples together, and perform 10-fold (10 -fold) cross-validation;

6.2)基于归纳矩阵补全模型重构异构网络，并且剔除测试集的网络关联信息，具体操作为：通过公式

将节点链接预测转换成优化问题，其中r是7种网络边缘的类型，P_r是7种单网络的邻接矩阵，Z_r是要求解的单网络对应的低秩矩阵，V_u和V_w是单网络中节点的特征向量；所述7种网络边缘的类型包括：药物-药物相互作用，药物-蛋白相互作用,药物-疾病关联关系，药物-副作用关联关系，蛋白-疾病关联关系，药物-药物结构相似度，蛋白-蛋白序列相似性；6.2) Reconstruct the heterogeneous network based on the induction matrix completion model, and remove the network association information of the test set. The specific operation is: through the formula

Transform node link prediction into an optimization problem, where r is the type of 7 network edges, P _r is the adjacency matrix of 7 single networks, Z _r is the low-rank matrix corresponding to the single network to be solved, V _u and V _w are The eigenvectors of nodes in a single network; the seven network edge types include: drug-drug interaction, drug-protein interaction, drug-disease association, drug-side effect association, protein-disease association, drug- Drug structure similarity, protein-protein sequence similarity;

6.3)基于训练的疾病-蛋白关联关系对应的低秩矩阵，计算测试集中的疾病-靶标关联关系得分。6.3) Calculate the disease-target correlation score in the test set based on the low-rank matrix corresponding to the trained disease-protein correlation.

与现有技术相比，本发明的有益效果是：Compared with prior art, the beneficial effect of the present invention is:

本发明首先通过构建不同类型和不同长度的元路径，有机融合了网络节点之间的结构关系和生物通路、药物学机理等语义关系；其次，采用多头注意力机制有效地捕获了不同距离的网络节点之间的依赖性，进而保证了局部和全局的平衡；最后，通过掩码语言模型集成了语义序列的上下文关系，进一步极大地促进了网络表征的能力；此外，通过在链路预测中采用归纳矩阵补全模型，有效地解决了稀疏网络面临的冷启动问题。Firstly, by constructing meta-paths of different types and different lengths, the present invention organically integrates the structural relationship between network nodes and semantic relationships such as biological pathways and pharmacological mechanisms; secondly, the multi-head attention mechanism is used to effectively capture the network The dependencies between nodes ensure the local and global balance; finally, the contextual relationship of the semantic sequence is integrated through the masked language model, which further greatly promotes the ability of network representation; in addition, by adopting in link prediction The induction matrix completion model effectively solves the cold start problem faced by sparse networks.

附图说明Description of drawings

图1为基于元路径和双向编码器的生物网络链接预测方法流程图；Fig. 1 is the flow chart of the biological network link prediction method based on meta-path and bidirectional encoder;

图2为基于元路径和双向编码器的生物网络链接预测方法的预测结果。Figure 2 shows the prediction results of the biological network link prediction method based on meta-path and bidirectional encoder.

具体实施方式detailed description

下面结合附图对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.

图1给出了本发明实施例提出的一种基于元路径和双向编码器的生物网络链接预测方法流程图。Fig. 1 shows a flowchart of a biological network link prediction method based on a meta-path and a bidirectional encoder proposed by an embodiment of the present invention.

参照图1，Referring to Figure 1,

一种基于元路径和双向编码器的生物网络链接预测方法，包括以下步骤：A method for predicting biological network links based on meta-paths and bidirectional encoders, comprising the following steps:

Go to step 6), otherwise go to step 4);

4.1)对所有的语义序列进行分词，包括去除特殊字符和多余字符、空格分词过程，最后采用掩码语言模型对语义序列进行处理，从所有的语义序列中按掩码比率k随机选取掩码序列，针对每个掩码序列，生成一个随机数rand∈[0,1]，如果rand＜p，则该序列被替换为[MASK]，其中p∈(0,1)为掩码序列被[MASK]替换的概率；如果p≤rand＜p+p′，则从语义序列中随机选则一个序列用来替换该掩码序列，其中p′∈(0,1-p)是掩码序列被其他序列替换的概率；如果p+p′≤rand＜1，则该掩码序列保持不变；4.1) Segment all semantic sequences, including removing special characters and redundant characters, space word segmentation process, and finally use mask language model to process semantic sequences, randomly select mask sequences according to mask ratio k from all semantic sequences , for each mask sequence, generate a random number rand∈[0,1], if rand<p, the sequence is replaced by [MASK], where p∈(0,1) is the mask sequence is [MASK ] replacement probability; if p≤rand<p+p', then a sequence is randomly selected from the semantic sequence to replace the mask sequence, where p'∈(0,1-p) is the mask sequence replaced by other The probability of sequence replacement; if p+p′≤rand<1, the mask sequence remains unchanged;

4.2)将每个节点的初始表征向量和位置向量进行叠加记为

并输入多头注意力机制学习得到向量

并利用残差连接和归一化处理得到

And input the multi-head attention mechanism to learn the vector

And use the residual connection and normalization processing to get

以上所述仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本实用新型的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present utility model. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should also be regarded as the protection scope of the present invention.

Claims

1. A biological network link prediction method based on meta-path and bidirectional encoder is characterized by comprising the following steps:

1) Initializing parameters, including: the method comprises the following steps of 1, network sequence length l, a node reading threshold value deg, a characterization vector dimension dim, the number of layers n of a transform encoder, a MASK sequence ratio k (0, 1) of a language model, a probability p (0, 1) that a MASK sequence is replaced by a special character [ MASK ], and a probability p' () that the MASK sequence is replaced by other sequences in a semantic text, wherein the number of layers n is equal to the number of layers n;

2) Constructing a medicine information network and a meta path, and realizing the following steps:

2.1 Constructing a drug information network comprising 4 node types of drugs, targets, diseases and side effects, 7 edges with a deletion degree smaller than that of the nodes through drug bank, uniProt, HPRD, signature, CTD, NDFRT and STRING public databases, wherein the 7 edges comprise drug-drug interactions, drug-protein interactions, drug-disease associations, drug-side effect associations, protein-disease associations, drug-drug structural similarities and protein-protein sequence similarities;

2.2 23 kinds of meta-paths are constructed according to different biological pathways and pharmaceutical mechanisms, and respectively are as follows: drug-protein, drug-protein-drug, drug-protein, drug-protein-disease, drug-protein-drug, drug-protein-disease, drug-protein-drug-protein, drug-protein-drug-disease, drug-protein-drug-side effect, drug-protein-disease-protein, drug-protein-disease-drug, protein-drug, protein-drug-protein, protein-drug-disease, protein-drug-side effect, protein-drug-protein, protein-drug-disease, protein-drug-side effect, protein-drug-protein-disease, protein-drug-disease-protein, protein-drug-side effect-drug;

3) Numbering x all nodes in a network _i ∈{x _i I =1,2,. N, num }, where num represents the total number of nodes and x represents the total number of nodes for each node _i ∈{x _i I =1, 2.. Num } follows the meta-path of step 2)Sampling is carried out for the second time;

4) Inputting all semantic sequences into a deep bidirectional Transformer encoder for characterization learning to obtain low-dimensional characterization vectors of nodes, wherein each layer of Transformer model comprises the same multi-head self-attention mechanism and a full-connection network;

5) Judging whether the maximum training times are reached, if so, outputting the characterization vector of each node

Go to step 6), otherwise go to step 4);

6) The induction matrix completion method is used for predicting the disease-protein association, and is realized by the following steps:

6.1 Calculating the number of Ninter of disease-protein correlation in the network, randomly selecting the same number of negative samples of Ninter from the disease-protein correlation network, mixing the positive samples and the negative samples together, and performing 10-fold cross validation;

6.2 Reconstructing a heterogeneous network based on the inductive matrix completion model, and eliminating network association information of a test set, wherein the specific operation is as follows: by the formula

Converting node link prediction into optimization problem, where r is 7 types of network edge, P _r Is a contiguous matrix of 7 types of single networks, Z _r Is the low-rank matrix, V, corresponding to the single network to be solved _u And V _w Is a feature vector of a node in a single network; the 7 types of network edges include: drug-drug interactions, drug-protein interactions, drug-disease associations, drug-side effect associations, protein-disease associations, drug-drug structural similarities, protein-protein sequence similarities;

6.3 Computing a disease-target association score in the test set based on the low rank matrix corresponding to the trained disease-protein association;

7) The same as the disease-protein correlation prediction in the step 6), the target-drug interaction is predicted by using an induction matrix completion method;

8) The same as the disease-protein correlation prediction in the step 6), the drug-side effect correlation is predicted by using an induction matrix completion method.

2. The bio-network link prediction method based on meta-path and bi-directional encoder as claimed in claim 1, wherein the step 4) is implemented by the steps of:

4.1 Dividing words of all semantic sequences, including a process of removing special characters and redundant characters and dividing words by blank spaces, finally processing the semantic sequences by adopting a MASK language model, randomly selecting MASK sequences from all semantic sequences according to a MASK ratio k, generating a random number rand from [0,1] aiming at each MASK sequence, and if rand is less than p, replacing the sequences with [ MASK ], wherein p from [ MASK ] is the probability that the MASK sequences are replaced by [ MASK ]; if p ≦ rand < p + p ', randomly selecting a sequence from the semantic sequences to replace the mask sequence, where p' e (0, 1-p) is the probability that the mask sequence is replaced by other sequences; if p + p' ≦ rand < 1, the mask sequence remains unchanged;

4.2 Superimpose the initial token vector and location vector of each node as

And inputting the result into a multi-head attention mechanism to learn to obtain a vector

And using residual connection and normalization processing to obtain

Secondly, further learning by utilizing a fully-connected feedforward network, and performing residual connection and normalization operation by using the fully-connected feedforward network; finally, the low-dimensional characterization vector of the node is obtained.