CN112308326B - Biological network link prediction method based on meta-path and bidirectional encoder - Google Patents
Biological network link prediction method based on meta-path and bidirectional encoder Download PDFInfo
- Publication number
- CN112308326B CN112308326B CN202011226195.3A CN202011226195A CN112308326B CN 112308326 B CN112308326 B CN 112308326B CN 202011226195 A CN202011226195 A CN 202011226195A CN 112308326 B CN112308326 B CN 112308326B
- Authority
- CN
- China
- Prior art keywords
- drug
- protein
- disease
- network
- mask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Analytical Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Human Resources & Organizations (AREA)
- Molecular Biology (AREA)
- Economics (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Chemical & Material Sciences (AREA)
- Computational Linguistics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Marketing (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Development Economics (AREA)
- Computing Systems (AREA)
- Game Theory and Decision Science (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
Abstract
Description
技术领域technical field
本发明属于计算机科学领域,涉及人工智能技术应用,具体涉及一种基于元路径和双向编码器的生物网络链接预测方法。The invention belongs to the field of computer science and relates to the application of artificial intelligence technology, in particular to a biological network link prediction method based on meta-paths and bidirectional encoders.
背景技术Background technique
针对一组生物医学实体及其已知的相互作用,旨在预测实体之间的其他潜在相互作用(链接)是生物医学领域最重要的任务之一,因此,越来越多的研究者利用计算机技术来预测各种生物医学网络中的潜在相互作用。For a set of biomedical entities and their known interactions, aiming to predict other potential interactions (links) between entities is one of the most important tasks in the field of biomedicine. Therefore, more and more researchers use computer techniques to predict potential interactions in various biomedical networks.
在生物医学领域的传统方法已经投入大量精力来开发生物学相关的特征,例如,化学亚结构,基因本体论(gene ontology)和拓扑结构相似性。与此同时,有监督的学习方法和半监督图的推理模型被用来预测潜在的相互作用。这些方法主要基于相似性假设,即具有相似生物特征或结构特征的实体可能具有相似的联系。但是,基于生物学特征的预测方法通常会面临两个问题:(1)生物学特征提取过程成本很高,甚至有些生物特征很难获得,尽管可以通过预处理删除那些没有特征的生物实体,但这通常会导致数据集规模较小,丢失重要的信息,因此在实际应用中并不实用;(2)生物学特征以可能不够精确,无法代表生物医学实体,并且可能无法建立稳定准确的模型。Traditional approaches in the biomedical field have devoted considerable effort to exploiting biologically relevant features, such as chemical substructure, gene ontology, and topological similarity. Meanwhile, supervised learning methods and semi-supervised graph inference models are used to predict potential interactions. These methods are mainly based on the similarity assumption that entities with similar biological or structural features are likely to be similarly related. However, prediction methods based on biological features usually face two problems: (1) The cost of biological feature extraction is very high, and even some biological features are difficult to obtain, although those biological entities without features can be deleted through preprocessing, but This usually results in small data sets, missing important information, and thus not practical in practical applications; (2) biological features may not be precise enough to represent biomedical entities, and may not be able to build stable and accurate models.
试图自动学习网络节点的低纬向量的网络表征方法有望解决上述两个问题,并且被广泛应用于生物链路预测中。例如,基于矩阵分解的技术被用于药物-疾病关联的预测;一些研究者提出了流形正则化的矩阵分解技术,通过合并了拉普拉斯正则化以学习更好的药物表示,进而提高药物-药物相互作用的预测,除之之外,也有人提出一些基于随机游走的网络表征方法和基于深度神经网络的表征方法。但是现有方法只关注网络的节点之间的结构特征,而忽略了网络实体之间的语义信息;或者只能捕获较短的结构和元路径,无法深度挖掘网络节点之间的结构和语义关系。Network representation methods that attempt to automatically learn low-dimensional vectors of network nodes are expected to solve the above two problems and are widely used in biological link prediction. For example, matrix factorization-based techniques have been used to predict drug-disease associations; some researchers have proposed manifold-regularized matrix factorization techniques that incorporate Laplacian regularization to learn better drug representations, thereby improving In addition to the prediction of drug-drug interactions, some random walk-based network representation methods and deep neural network-based representation methods have also been proposed. However, existing methods only focus on the structural features between network nodes, while ignoring the semantic information between network entities; or they can only capture shorter structures and meta-paths, and cannot deeply mine the structural and semantic relationships between network nodes. .
发明内容Contents of the invention
为了克服上述技术的不足,本发明提供了一种基于元路径和双向编码器的生物网络链接预测方法。首选构建了多源异构的药物信息网络,同时设计多种元路径进行序列采样,构成大规模的语义信息库;其次,将深度Transformer编码器与掩码语言模型(maskedlanguage model)有机融合设计出深度双向的编码表征模型有效地提取每个节点的低纬表征向量;最后,利用归纳矩阵补全(Inductive matrix completion)技术进行疾病-蛋白关联关系、蛋白-药物相互作用、药物-副作用关联关系等生物链接预测,进而完成从疾病—靶标—药物—副作用的药物研发技术体系。In order to overcome the deficiencies of the above-mentioned technologies, the present invention provides a biological network link prediction method based on meta-paths and bidirectional encoders. Firstly, a multi-source heterogeneous drug information network was constructed, and multiple meta-paths were designed for sequence sampling to form a large-scale semantic information base; secondly, a deep Transformer encoder was organically integrated with a masked language model (masked language model) to design a The deep two-way encoding representation model effectively extracts the low-dimensional representation vector of each node; finally, the inductive matrix completion (Inductive matrix completion) technology is used to conduct disease-protein association, protein-drug interaction, drug-side effect association, etc. Biological link prediction, and then complete the drug research and development technology system from disease-target-drug-side effects.
本发明所采用的技术方案是:The technical scheme adopted in the present invention is:
一种基于元路径和双向编码器的生物网络链接预测方法,包括如下步骤:A biological network link prediction method based on meta-paths and bidirectional encoders, comprising the following steps:
1)参数初始化,包括:网络序列长度l,节点读书的阈值deg,表征向量维度dim,Transformer编码器的层数n,语言模型的掩码序列比率k∈(0,1),掩码序列被特殊字符[MASK]替换的概率p∈(0,1),掩码序列被语义文本中其他序列替换的概率p′∈(0,1-p);1) Parameter initialization, including: network sequence length l, node reading threshold deg, representation vector dimension dim, Transformer encoder layer number n, language model mask sequence ratio k∈(0,1), mask sequence is The probability p ∈ (0,1) of the replacement of the special character [MASK], the probability p′ ∈ (0,1-p) of the mask sequence being replaced by other sequences in the semantic text;
2)构建药物信息网络和元路径;2) Construct drug information network and meta-path;
3)对网络中的所有节点进行编号xi∈{xi|i=1,2,...,num},其中num代表节点的总个数,并对每个节点xi∈{xi|i=1,2,...,num}根据所述步骤2)的元路径依次进行采样;3) Number all nodes in the network x i ∈ { xi |i=1,2,...,num}, where num represents the total number of nodes, and for each node x i ∈ { xi |i=1,2,...,num} Sampling is performed sequentially according to the meta-path in step 2);
4)将所有的语义序列输入深层双向Transformer编码器进行表征学习,获得节点的低维表征向量,其中每层的Transformer模型都包含同样的多头自注意力机制(multi-head self-attention mechanism)和全连接网络;4) Input all the semantic sequences into the deep two-way Transformer encoder for representation learning, and obtain the low-dimensional representation vector of the node, in which the Transformer model of each layer contains the same multi-head self-attention mechanism (multi-head self-attention mechanism) and fully connected network;
5)判断是否达到最大的训练次数,如果达到最大迭代次数,则输出每个节点的表征向量转至步骤6),否则转至步骤4);5) Determine whether the maximum number of training times is reached, and if the maximum number of iterations is reached, then output the representation vector of each node Go to step 6), otherwise go to step 4);
6)利用归纳矩阵补全方法进行疾病-蛋白关联预测;6) Using the inductive matrix completion method for disease-protein association prediction;
7)与步骤6)中疾病-蛋白关联预测相同,利用归纳矩阵补全方法预测靶标-药物相互作用;7) Same as the disease-protein association prediction in step 6), using the induction matrix completion method to predict the target-drug interaction;
8)与步骤6)中疾病-蛋白关联预测相同,利用归纳矩阵补全方法预测药物-副作用关联关系。8) Same as the disease-protein association prediction in step 6), use the inductive matrix completion method to predict the drug-side effect association relationship.
作为本发明的进一步改进,所述步骤2)通过以下步骤实现:As a further improvement of the present invention, said step 2) is realized through the following steps:
2.1)通过DrugBank、UniProt、HPRD、SIDER、CTD、NDFRT和STRING公开数据库构建包含药物、靶标、疾病和副作用4种节点类型、7种边缘的药物信息网络,并且删除度小于deg的节点,所述7种边缘包括药物-药物相互作用,药物-蛋白相互作用,药物-疾病关联关系,药物-副作用关联关系,蛋白-疾病关联关系,药物-药物结构相似度,蛋白-蛋白序列相似性;2.1) Construct a drug information network containing 4 node types and 7 edges of drugs, targets, diseases and side effects through DrugBank, UniProt, HPRD, SIDER, CTD, NDFRT and STRING public databases, and delete nodes with degrees less than deg, the Seven kinds of edges include drug-drug interaction, drug-protein interaction, drug-disease relationship, drug-side effect relationship, protein-disease relationship, drug-drug structural similarity, protein-protein sequence similarity;
2.2)根据不同的生物通路、药物机理构建23种元路径,分别为:药物-蛋白,药物-蛋白-药物,药物-蛋白-蛋白,药物-蛋白-疾病,药物-蛋白-蛋白-药物,药物-蛋白-蛋白-疾病,药物-蛋白-药物-蛋白,药物-蛋白-药物-疾病,药物-蛋白-药物-副作用,药物-蛋白-疾病-蛋白,药物-蛋白-疾病-药物,蛋白-药物-药物,蛋白-药物-蛋白,蛋白-药物-疾病,蛋白-药物-副作用,蛋白-药物-药物-蛋白,蛋白-药物-药物-疾病,蛋白-药物-药物-副作用,蛋白-药物-蛋白-蛋白,蛋白-药物-蛋白-疾病,蛋白-药物-疾病-蛋白,蛋白-药物-疾病-药物,蛋白-药物-副作用-药物;2.2) Construct 23 meta-paths according to different biological pathways and drug mechanisms, namely: drug-protein, drug-protein-drug, drug-protein-protein, drug-protein-disease, drug-protein-protein-drug, drug -protein-protein-disease, drug-protein-drug-protein, drug-protein-drug-disease, drug-protein-drug-side effect, drug-protein-disease-protein, drug-protein-disease-drug, protein-drug -drug, protein-drug-protein, protein-drug-disease, protein-drug-side effect, protein-drug-drug-protein, protein-drug-drug-disease, protein-drug-drug-side effect, protein-drug-protein - protein, protein-drug-protein-disease, protein-drug-disease-protein, protein-drug-disease-drug, protein-drug-side effect-drug;
作为本发明的进一步改进,所述步骤4)通过以下步骤实现:As a further improvement of the present invention, said step 4) is realized through the following steps:
4.1)对所有的语义序列进行分词,包括去除特殊字符和多余字符、空格分词过程,最后采用掩码语言模型对语义序列进行处理,从所有的语义序列中按掩码比率k随机选取掩码序列,针对每个掩码序列,生成一个随机数rand∈[0,1],如果rand<p,则该序列被替换为[MASK],其中p∈(0,1)为掩码序列被[MASK]替换的概率;如果p≤rand<p+p′,则从语义序列中随机选则一个序列用来替换该掩码序列,其中p′∈(0,1-p)是掩码序列被其他序列替换的概率;如果p+p′≤rand<1,则该掩码序列保持不变;4.1) Segment all semantic sequences, including removing special characters and redundant characters, space word segmentation process, and finally use mask language model to process semantic sequences, randomly select mask sequences according to mask ratio k from all semantic sequences , for each mask sequence, generate a random number rand∈[0,1], if rand<p, the sequence is replaced by [MASK], where p∈(0,1) is the mask sequence is [MASK ] replacement probability; if p≤rand<p+p', then a sequence is randomly selected from the semantic sequence to replace the mask sequence, where p'∈(0,1-p) is the mask sequence replaced by other The probability of sequence replacement; if p+p'≤rand<1, the mask sequence remains unchanged;
4.2)将每个节点的初始表征向量和位置向量进行叠加记为并输入多头注意力机制学习得到向量并利用残差连接和归一化处理得到其次,利用全连接前馈网络进一步学习,全连接前馈网络也进行残差连接和归一化操作;最终得到节点的低维表征向量。4.2) The initial characterization vector and position vector of each node are superimposed and recorded as And input the multi-head attention mechanism to learn the vector And use the residual connection and normalization processing to get Secondly, the fully connected feedforward network is used for further learning, and the fully connected feedforward network also performs residual connection and normalization operations; finally, the low-dimensional representation vector of the node is obtained.
作为本发明的进一步改进,所述步骤6)通过以下步骤实现:As a further improvement of the present invention, said step 6) is realized through the following steps:
6.1)计算网络中疾病-蛋白相互关联的个数Ninter,并从疾病-蛋白关联网络中随机选择同样数量的Ninter个负样本,将这些正样本和负样本混合在一起,进行10-折(10-fold)交叉验证;6.1) Calculate the number Ninter of disease-protein correlations in the network, and randomly select the same number of Ninter negative samples from the disease-protein association network, mix these positive samples and negative samples together, and perform 10-fold (10 -fold) cross-validation;
6.2)基于归纳矩阵补全模型重构异构网络,并且剔除测试集的网络关联信息,具体操作为:通过公式将节点链接预测转换成优化问题,其中r是7种网络边缘的类型,Pr是7种单网络的邻接矩阵,Zr是要求解的单网络对应的低秩矩阵,Vu和Vw是单网络中节点的特征向量;所述7种网络边缘的类型包括:药物-药物相互作用,药物-蛋白相互作用,药物-疾病关联关系,药物-副作用关联关系,蛋白-疾病关联关系,药物-药物结构相似度,蛋白-蛋白序列相似性;6.2) Reconstruct the heterogeneous network based on the induction matrix completion model, and remove the network association information of the test set. The specific operation is: through the formula Transform node link prediction into an optimization problem, where r is the type of 7 network edges, P r is the adjacency matrix of 7 single networks, Z r is the low-rank matrix corresponding to the single network to be solved, V u and V w are The eigenvectors of nodes in a single network; the seven network edge types include: drug-drug interaction, drug-protein interaction, drug-disease association, drug-side effect association, protein-disease association, drug- Drug structure similarity, protein-protein sequence similarity;
6.3)基于训练的疾病-蛋白关联关系对应的低秩矩阵,计算测试集中的疾病-靶标关联关系得分。6.3) Calculate the disease-target correlation score in the test set based on the low-rank matrix corresponding to the trained disease-protein correlation.
与现有技术相比,本发明的有益效果是:Compared with prior art, the beneficial effect of the present invention is:
本发明首先通过构建不同类型和不同长度的元路径,有机融合了网络节点之间的结构关系和生物通路、药物学机理等语义关系;其次,采用多头注意力机制有效地捕获了不同距离的网络节点之间的依赖性,进而保证了局部和全局的平衡;最后,通过掩码语言模型集成了语义序列的上下文关系,进一步极大地促进了网络表征的能力;此外,通过在链路预测中采用归纳矩阵补全模型,有效地解决了稀疏网络面临的冷启动问题。Firstly, by constructing meta-paths of different types and different lengths, the present invention organically integrates the structural relationship between network nodes and semantic relationships such as biological pathways and pharmacological mechanisms; secondly, the multi-head attention mechanism is used to effectively capture the network The dependencies between nodes ensure the local and global balance; finally, the contextual relationship of the semantic sequence is integrated through the masked language model, which further greatly promotes the ability of network representation; in addition, by adopting in link prediction The induction matrix completion model effectively solves the cold start problem faced by sparse networks.
附图说明Description of drawings
图1为基于元路径和双向编码器的生物网络链接预测方法流程图;Fig. 1 is the flow chart of the biological network link prediction method based on meta-path and bidirectional encoder;
图2为基于元路径和双向编码器的生物网络链接预测方法的预测结果。Figure 2 shows the prediction results of the biological network link prediction method based on meta-path and bidirectional encoder.
具体实施方式detailed description
下面结合附图对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.
图1给出了本发明实施例提出的一种基于元路径和双向编码器的生物网络链接预测方法流程图。Fig. 1 shows a flowchart of a biological network link prediction method based on a meta-path and a bidirectional encoder proposed by an embodiment of the present invention.
参照图1,Referring to Figure 1,
一种基于元路径和双向编码器的生物网络链接预测方法,包括以下步骤:A method for predicting biological network links based on meta-paths and bidirectional encoders, comprising the following steps:
1)参数初始化,包括:网络序列长度l,节点读书的阈值deg,表征向量维度dim,Transformer编码器的层数n,语言模型的掩码序列比率k∈(0,1),掩码序列被特殊字符[MASK]替换的概率p∈(0,1),掩码序列被语义文本中其他序列替换的概率p′∈(0,1-p);1) Parameter initialization, including: network sequence length l, node reading threshold deg, representation vector dimension dim, Transformer encoder layer number n, language model mask sequence ratio k∈(0,1), mask sequence is The probability p ∈ (0,1) of the replacement of the special character [MASK], the probability p′ ∈ (0,1-p) of the mask sequence being replaced by other sequences in the semantic text;
2)构建药物信息网络和元路径;2) Construct drug information network and meta-path;
3)对网络中的所有节点进行编号xi∈{xi|i=1,2,...,num},其中num代表节点的总个数,并对每个节点xi∈{xi|i=1,2,...,num}根据所述步骤2)的元路径依次进行采样;3) Number all nodes in the network x i ∈ { xi |i=1,2,...,num}, where num represents the total number of nodes, and for each node x i ∈ { xi |i=1,2,...,num} Sampling is performed sequentially according to the meta-path in step 2);
4)将所有的语义序列输入深层双向Transformer编码器进行表征学习,获得节点的低维表征向量,其中每层的Transformer模型都包含同样的多头自注意力机制(multi-head self-attention mechanism)和全连接网络;4) Input all the semantic sequences into the deep two-way Transformer encoder for representation learning, and obtain the low-dimensional representation vector of the node, in which the Transformer model of each layer contains the same multi-head self-attention mechanism (multi-head self-attention mechanism) and fully connected network;
5)判断是否达到最大的训练次数,如果达到最大迭代次数,则输出每个节点的表征向量转至步骤6),否则转至步骤4);5) Determine whether the maximum number of training times is reached, and if the maximum number of iterations is reached, then output the representation vector of each node Go to step 6), otherwise go to step 4);
6)利用归纳矩阵补全方法进行疾病-蛋白关联预测;6) Using the inductive matrix completion method for disease-protein association prediction;
7)与步骤6)中疾病-蛋白关联预测相同,利用归纳矩阵补全方法预测靶标-药物相互作用;7) Same as the disease-protein association prediction in step 6), using the induction matrix completion method to predict the target-drug interaction;
8)与步骤6)中疾病-蛋白关联预测相同,利用归纳矩阵补全方法预测药物-副作用关联关系。8) Same as the disease-protein association prediction in step 6), use the inductive matrix completion method to predict the drug-side effect association relationship.
作为本发明的进一步改进,所述步骤2)通过以下步骤实现:As a further improvement of the present invention, said step 2) is realized through the following steps:
2.1)通过DrugBank、UniProt、HPRD、SIDER、CTD、NDFRT和STRING公开数据库构建包含药物、靶标、疾病和副作用4种节点类型、7种边缘的药物信息网络,并且删除度小于deg的节点,所述7种边缘包括药物-药物相互作用,药物-蛋白相互作用,药物-疾病关联关系,药物-副作用关联关系,蛋白-疾病关联关系,药物-药物结构相似度,蛋白-蛋白序列相似性;2.1) Construct a drug information network containing 4 node types and 7 edges of drugs, targets, diseases and side effects through DrugBank, UniProt, HPRD, SIDER, CTD, NDFRT and STRING public databases, and delete nodes with degrees less than deg, the Seven kinds of edges include drug-drug interaction, drug-protein interaction, drug-disease relationship, drug-side effect relationship, protein-disease relationship, drug-drug structural similarity, protein-protein sequence similarity;
2.2)根据不同的生物通路、药物机理构建23种元路径,分别为:药物-蛋白,药物-蛋白-药物,药物-蛋白-蛋白,药物-蛋白-疾病,药物-蛋白-蛋白-药物,药物-蛋白-蛋白-疾病,药物-蛋白-药物-蛋白,药物-蛋白-药物-疾病,药物-蛋白-药物-副作用,药物-蛋白-疾病-蛋白,药物-蛋白-疾病-药物,蛋白-药物-药物,蛋白-药物-蛋白,蛋白-药物-疾病,蛋白-药物-副作用,蛋白-药物-药物-蛋白,蛋白-药物-药物-疾病,蛋白-药物-药物-副作用,蛋白-药物-蛋白-蛋白,蛋白-药物-蛋白-疾病,蛋白-药物-疾病-蛋白,蛋白-药物-疾病-药物,蛋白-药物-副作用-药物;2.2) Construct 23 meta-paths according to different biological pathways and drug mechanisms, namely: drug-protein, drug-protein-drug, drug-protein-protein, drug-protein-disease, drug-protein-protein-drug, drug -protein-protein-disease, drug-protein-drug-protein, drug-protein-drug-disease, drug-protein-drug-side effect, drug-protein-disease-protein, drug-protein-disease-drug, protein-drug -drug, protein-drug-protein, protein-drug-disease, protein-drug-side effect, protein-drug-drug-protein, protein-drug-drug-disease, protein-drug-drug-side effect, protein-drug-protein - protein, protein-drug-protein-disease, protein-drug-disease-protein, protein-drug-disease-drug, protein-drug-side effect-drug;
作为本发明的进一步改进,所述步骤4)通过以下步骤实现:As a further improvement of the present invention, said step 4) is realized through the following steps:
4.1)对所有的语义序列进行分词,包括去除特殊字符和多余字符、空格分词过程,最后采用掩码语言模型对语义序列进行处理,从所有的语义序列中按掩码比率k随机选取掩码序列,针对每个掩码序列,生成一个随机数rand∈[0,1],如果rand<p,则该序列被替换为[MASK],其中p∈(0,1)为掩码序列被[MASK]替换的概率;如果p≤rand<p+p′,则从语义序列中随机选则一个序列用来替换该掩码序列,其中p′∈(0,1-p)是掩码序列被其他序列替换的概率;如果p+p′≤rand<1,则该掩码序列保持不变;4.1) Segment all semantic sequences, including removing special characters and redundant characters, space word segmentation process, and finally use mask language model to process semantic sequences, randomly select mask sequences according to mask ratio k from all semantic sequences , for each mask sequence, generate a random number rand∈[0,1], if rand<p, the sequence is replaced by [MASK], where p∈(0,1) is the mask sequence is [MASK ] replacement probability; if p≤rand<p+p', then a sequence is randomly selected from the semantic sequence to replace the mask sequence, where p'∈(0,1-p) is the mask sequence replaced by other The probability of sequence replacement; if p+p′≤rand<1, the mask sequence remains unchanged;
4.2)将每个节点的初始表征向量和位置向量进行叠加记为并输入多头注意力机制学习得到向量并利用残差连接和归一化处理得到其次,利用全连接前馈网络进一步学习,全连接前馈网络也进行残差连接和归一化操作;最终得到节点的低维表征向量。4.2) The initial characterization vector and position vector of each node are superimposed and recorded as And input the multi-head attention mechanism to learn the vector And use the residual connection and normalization processing to get Secondly, the fully connected feedforward network is used for further learning, and the fully connected feedforward network also performs residual connection and normalization operations; finally, the low-dimensional representation vector of the node is obtained.
作为本发明的进一步改进,所述步骤6)通过以下步骤实现:As a further improvement of the present invention, said step 6) is realized through the following steps:
6.1)计算网络中疾病-蛋白相互关联的个数Ninter,并从疾病-蛋白关联网络中随机选择同样数量的Ninter个负样本,将这些正样本和负样本混合在一起,进行10-折(10-fold)交叉验证;6.1) Calculate the number Ninter of disease-protein correlations in the network, and randomly select the same number of Ninter negative samples from the disease-protein association network, mix these positive samples and negative samples together, and perform 10-fold (10 -fold) cross-validation;
6.2)基于归纳矩阵补全模型重构异构网络,并且剔除测试集的网络关联信息,具体操作为:通过公式将节点链接预测转换成优化问题,其中r是7种网络边缘的类型,Pr是7种单网络的邻接矩阵,Zr是要求解的单网络对应的低秩矩阵,Vu和Vw是单网络中节点的特征向量;所述7种网络边缘的类型包括:药物-药物相互作用,药物-蛋白相互作用,药物-疾病关联关系,药物-副作用关联关系,蛋白-疾病关联关系,药物-药物结构相似度,蛋白-蛋白序列相似性;6.2) Reconstruct the heterogeneous network based on the induction matrix completion model, and remove the network association information of the test set. The specific operation is: through the formula Transform node link prediction into an optimization problem, where r is the type of 7 network edges, P r is the adjacency matrix of 7 single networks, Z r is the low-rank matrix corresponding to the single network to be solved, V u and V w are The eigenvectors of nodes in a single network; the seven network edge types include: drug-drug interaction, drug-protein interaction, drug-disease association, drug-side effect association, protein-disease association, drug- Drug structure similarity, protein-protein sequence similarity;
6.3)基于训练的疾病-蛋白关联关系对应的低秩矩阵,计算测试集中的疾病-靶标关联关系得分。6.3) Calculate the disease-target correlation score in the test set based on the low-rank matrix corresponding to the trained disease-protein correlation.
以上所述仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本实用新型的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present utility model. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should also be regarded as the protection scope of the present invention.
Claims (2)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011226195.3A CN112308326B (en) | 2020-11-05 | 2020-11-05 | Biological network link prediction method based on meta-path and bidirectional encoder |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011226195.3A CN112308326B (en) | 2020-11-05 | 2020-11-05 | Biological network link prediction method based on meta-path and bidirectional encoder |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN112308326A CN112308326A (en) | 2021-02-02 |
| CN112308326B true CN112308326B (en) | 2022-12-13 |
Family
ID=74326187
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202011226195.3A Active CN112308326B (en) | 2020-11-05 | 2020-11-05 | Biological network link prediction method based on meta-path and bidirectional encoder |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN112308326B (en) |
Families Citing this family (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113327644B (en) * | 2021-04-09 | 2024-05-14 | 中山大学 | Drug-target interaction prediction method based on deep embedding learning of graph and sequence |
| CN113160894B (en) * | 2021-04-23 | 2023-10-24 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for predicting interaction between medicine and target |
| CN113223655B (en) * | 2021-05-07 | 2023-05-12 | 西安电子科技大学 | Drug-disease association prediction method based on variation self-encoder |
| CN113611356B (en) * | 2021-07-29 | 2023-04-07 | 湖南大学 | Drug relocation prediction method based on self-supervision graph representation learning |
| CN114334038B (en) * | 2021-12-31 | 2024-05-14 | 杭州师范大学 | Disease medicine prediction method based on heterogeneous network embedded model |
| CN114648172A (en) * | 2022-04-09 | 2022-06-21 | 北京工业大学 | Dynamic heterogeneous network link prediction method based on Transformer |
| CN115440297B (en) * | 2022-09-05 | 2025-08-29 | 湖南大学 | A drug and target prediction method based on graph attribute neural network |
| CN115512761A (en) * | 2022-10-08 | 2022-12-23 | 华中师范大学 | Meta-pathway based drug-drug interaction prediction framework |
| CN115512857B (en) * | 2022-10-08 | 2025-07-25 | 华中师范大学 | Medicine side effect prediction model based on meta-path diagram neural network |
| CN116504331B (en) * | 2023-04-28 | 2024-07-26 | 东北林业大学 | Frequency score prediction method for drug side effects based on multimodality and multitask |
| CN116646001B (en) * | 2023-06-05 | 2024-05-24 | 兰州大学 | A method for predicting drug target binding based on a joint cross-domain attention model |
| CN118888165B (en) * | 2024-06-18 | 2026-03-24 | 安徽医科大学 | A Drug Interaction Prediction Method and System Based on Metapath Length and Type |
| CN120874855B (en) * | 2025-09-29 | 2026-01-23 | 中铁建网络信息科技有限公司 | Enterprise information dynamic modeling method based on multidimensional data driving |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7998708B2 (en) * | 2006-03-24 | 2011-08-16 | Handylab, Inc. | Microfluidic system for amplifying and detecting polynucleotides in parallel |
| CN102298674B (en) * | 2010-06-25 | 2014-03-26 | 清华大学 | Method for determining medicament target and/or medicament function based on protein network |
| US20170281784A1 (en) * | 2016-04-05 | 2017-10-05 | Arvinas, Inc. | Protein-protein interaction inducing technology |
| CN109783618B (en) * | 2018-12-11 | 2021-01-19 | 北京大学 | Attention mechanism neural network-based drug entity relationship extraction method and system |
| CN111667884B (en) * | 2020-06-12 | 2022-09-09 | 天津大学 | A Convolutional Neural Network Model for Predicting Protein Interactions Using Protein Primary Sequences Based on Attention Mechanism |
| CN111785320B (en) * | 2020-06-28 | 2024-02-06 | 西安电子科技大学 | Drug target interaction prediction method based on multi-layer network representation learning |
| CN111814460B (en) * | 2020-07-06 | 2021-02-09 | 四川大学 | External knowledge-based drug interaction relation extraction method and system |
-
2020
- 2020-11-05 CN CN202011226195.3A patent/CN112308326B/en active Active
Also Published As
| Publication number | Publication date |
|---|---|
| CN112308326A (en) | 2021-02-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112308326B (en) | Biological network link prediction method based on meta-path and bidirectional encoder | |
| Hu et al. | Unsupervised contrastive cross-modal hashing | |
| CN111382272B (en) | Electronic medical record ICD automatic coding method based on knowledge graph | |
| Liang et al. | Symbolic graph reasoning meets convolutions | |
| CN111859935B (en) | A literature-based method for building a database of cancer-related biomedical events | |
| CN111079409B (en) | Emotion classification method utilizing context and aspect memory information | |
| Zhu et al. | Graph structure enhanced pre-training language model for knowledge graph completion | |
| Zhang et al. | Motif-driven contrastive learning of graph representations | |
| CN114841261B (en) | Incremental width and depth learning drug response prediction methods, media and devices | |
| CN116168775B (en) | Methods, storage media, and chips for training and applying molecular multimodal models | |
| CN116128024A (en) | Multi-perspective comparative self-supervised attribute network outlier detection method | |
| CN114842983B (en) | Anticancer drug response prediction method and device based on tumor cell line self-supervision learning | |
| CN111476038A (en) | Long text generation method and device, computer equipment and storage medium | |
| CN115762706A (en) | A drug representation method and storage medium based on deep learning | |
| Chen et al. | Bilinear joint learning of word and entity embeddings for entity linking | |
| CN114138971A (en) | Genetic algorithm-based maximum multi-label classification method | |
| CN115019878B (en) | A drug discovery method based on graph representation and deep learning | |
| CN116226404A (en) | A knowledge map construction method and knowledge map system for the gut-brain axis | |
| Alotaibi et al. | Computational linguistics based text emotion analysis using enhanced beetle antenna search with deep learning during COVID-19 pandemic | |
| CN112148891A (en) | Knowledge graph completion method based on graph perception tensor decomposition | |
| Zhen et al. | Frequent words and syntactic context integrated biomedical discontinuous named entity recognition method: Y. Zhen et al. | |
| CN116486936B (en) | Methods, apparatus, equipment and media for training drug target interaction prediction models | |
| CN113515947B (en) | A training method for cascaded place name entity recognition model | |
| Shao et al. | TBPM-DDIE: Transformer based pretrained method for predicting drug-drug interactions events | |
| Yan et al. | A review and outlook for relation extraction |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |














