CN112579063B - Acceleration method for exploring optimization space in deep learning compiler - Google Patents

Acceleration method for exploring optimization space in deep learning compiler Download PDF

Info

Publication number
CN112579063B
CN112579063B CN202110223874.3A CN202110223874A CN112579063B CN 112579063 B CN112579063 B CN 112579063B CN 202110223874 A CN202110223874 A CN 202110223874A CN 112579063 B CN112579063 B CN 112579063B
Authority
CN
China
Prior art keywords
operator
optimization
graph
space
operators
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110223874.3A
Other languages
Chinese (zh)
Other versions
CN112579063A (en
Inventor
潘秋红
何水兵
陈刚
杨弢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202110223874.3A priority Critical patent/CN112579063B/en
Publication of CN112579063A publication Critical patent/CN112579063A/en
Application granted granted Critical
Publication of CN112579063B publication Critical patent/CN112579063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an acceleration method for exploring an optimized space in a deep learning compiler, aiming at optimizing a neural network effect through a compiling technology and greatly reducing the time consumption of the compiler for exploring an operator optimized space. The method first abstracts the neural network into the form of a computational graph. And secondly, carrying out graph optimization on the calculation graph, and defining an optimization space for each operator in the optimized calculation graph. And then, based on an operator containing optimized spatial information, providing an optimized spatial similarity calculation method. And finally, providing an operator state space exploration method based on similarity, clustering operators based on the similarity, carrying out full space exploration on the core operator in each cluster, exploring the rest operators of the same class in the optimal scheme of the core operator, and determining the optimal scheme of each operator of the whole neural network.

Description

Acceleration method for exploring optimization space in deep learning compiler
Technical Field
The invention relates to the application field of deep learning, compiling technology and high-performance computing cross technology, in particular to an acceleration method for exploring an optimized space in a deep learning compiler.
Background
Today, Deep Neural Networks (DNNs) have found wide application in the fields of image classification, natural language processing, autopilot, augmented reality, and other AI. Particularly, with the rapid development of computing devices, such as GPU, FPGA and specially designed neural network accelerator, the computing power of DNN is becoming more powerful, and the demand for efficient DNN in the field of artificial intelligence is also becoming stronger, so how to improve the operating efficiency of DNN is a very important research problem in recent years.
Now, there are many deep learning frameworks such as TensorFlow, PyTorch, Caffe, MXNet, etc. which can represent neural networks in the form of computational graphs, perform graph-level optimization on the computational graphs, and then map operators in DNN to third-party acceleration libraries such as CuDNN and MKL-DNN to obtain efficient DNN operation effects. However, the optimization at the graph level is generally independent of hardware, and cannot obtain finer-grained optimization effect according to hardware characteristics. Furthermore, the third party acceleration libraries that are relied upon are generally non-open source, which prevents programmers from having effective control and from easily porting DNNs across hardware devices. In addition, for operators not supported by the third-party library, optimization cannot be achieved, or a programmer is required to spend a lot of work to perform manual tuning.
In the research aiming at DNN acceleration, neural networks under various different frameworks are mapped to various hardware platforms through a compiling technology, the neural networks are accelerated in the mapping process, and the method for generating the optimized target platform codes achieves remarkable effects. A relatively efficient neural network compiler comprises the following execution flows: the neural network under various deep learning frameworks is firstly expressed into a computational graph through a high-level intermediate language, and graph-level optimization is carried out on the computational graph. The optimized computation graph is then converted to a low-level intermediate language representation and operator-level optimized. And finally, generating corresponding optimized codes according to the target hardware platform.
When the neural network is optimized at an operator level, the optimization space of each operator is defined in advance, and then the optimization space is explored by adopting a machine learning method to obtain the best optimization scheme. The optimization space of each operator is very large, for example, hundreds of millions of optimization schemes are possible for a Conv operator, so that the exploration of the optimization space of the operators is time-consuming, for example, a Yolo network needs more than one day to explore the optimization schemes.
Disclosure of Invention
In order to solve the defects of the prior art and achieve the purpose of greatly reducing the time consumption of an operator exploration optimization space of a compiler under the sacrifice of increasing the acceptable deep learning network reasoning time, the invention adopts the following technical scheme:
an acceleration method for exploring an optimized space in a deep learning compiler, comprising the steps of:
s1, abstracting the neural network and representing the neural network in the form of a calculation graph;
s2, carrying out graph optimization on the calculation graph;
s3, defining an optimization space for each operator in the optimized calculation graph, and performing optimization space similarity calculation based on the operator containing optimization space information;
s4, searching operator state space based on similarity, clustering operators based on similarity, searching full space of core operators in each cluster, searching other operators of the same class in the optimal scheme of the core operators, determining the optimal scheme of each operator of the whole neural network, and generating target platform codes according to a hardware platform, comprising the following steps:
s41, calculating a similarity matrix of the operators;
s42, the similarity matrix is used as input to execute AP clustering, the AP clustering algorithm does not need to determine the clustering number in advance, the center of each cluster after clustering is an input operator, and an operator does not need to be selected for each cluster again to serve as a core;
s43, for each clustered core operator, searching the complete optimization space of the operator, and storing n optimization schemes with shortest inference time in the searching process;
s44, for each non-core operator of each cluster, only n optimal schemes searched by traversing the core operators depended on by the non-core operators are needed;
and S45, generating a target platform code for each operator according to the optimization scheme, and deploying the operator codes to hardware to run a neural network according to the sequence in the calculation diagram.
Therefore, the time consumption of the compiler for exploring an operator optimization space is greatly reduced under the sacrifice of the increase of the acceptable deep learning network reasoning time.
Further, the neural network computational graph representation method in the step S1 includes the following steps:
s11, mapping the neural network constructed in the deep learning framework to a well-defined high-level intermediate language HIR;
and S12, analyzing the attribute of each operator based on the high-level intermediate language, and constructing a computation graph according to the data dependence relationship among the operators, wherein the constructed computation graph is a directed acyclic graph, each node in the graph represents one operator in the neural network, and edges in the graph represent the data dependence among the operators. A high-level intermediate language (HIR) is realized, the HIR is a specific domain language (DSL) and can represent a neural network computing and control flow, and a neural network in TensorFlow, PyTorch or ONNX format is mapped onto the HIR and represented by the HIR.
Further, the graph optimization method based on the computation graph in step S2 includes the following steps:
s21, performing operator fusion according to the calculation type of the operator, wherein the operator fusion is to combine a plurality of basic operators into a composite operator without storing intermediate results, thereby reducing unnecessary memory read-write time and improving cache locality;
s22, performing data layout optimization on the calculation graph after operator fusion according to hardware characteristics;
and S23, merging parallel operators for the calculation graph after the data layout optimization.
Further, the specific content of step S21 includes: firstly, constructing a domination tree, then traversing nodes in the domination tree, and merging the operators into a new compound operator if the nodes from one node to the domination node meet a predefined merging rule.
Further, the specific content of step S22 includes: in the calculation graph after operator fusion, whether a specified data layout scheme exists when the calculation graph is input is judged, if yes, the specified scheme is directly applied, and if not, the optimal data layout scheme is selected according to hardware characteristics, wherein the data layout scheme comprises row-first storage or column-first storage.
Further, the specific content of step S23 includes: the method has the advantages that a plurality of operators sharing the same input are combined into a larger operator, a larger kernel module is generated for the GPU, the GPU kernel starting expense is reduced, and the GPU utilization rate is improved.
Further, in step S3, the graph-optimized computation graph is mapped onto the LIR, represented by using the LIR, and an optimization space for each operator is defined. A low-level intermediate language LIR is realized, the LIR is a finer-grained intermediate language form, and operator-level optimization and target platform code generation can be performed based on the LIR.
Further, the operator performs tiling of multiple dimensions and multiple cyclic expansion optimization, the dimensions of the operator are expanded, the original length of the dimension is l, the dimension is tiled into m dimensions, k tiling schemes are provided in total, and by analogy, the number of possible selection schemes for each optimization operation is k, and the optimization space of the whole operator is the product of the number of all optimization operation schemes.
Further, in step S3, the method for calculating the optimized spatial similarity includes the following steps:
s31, defining a hash method for each optimization operation according to the optimization space attribute of the operator, and abstracting the optimization operation into a hash value;
s32, vectorizing each pair of operators, sequentially superposing Hash values as vector values according to an optimization operation sequence, and converting the vectors of the two operators into equal length through 0 filling operation; sequentially splicing the vector values of each optimized operation in the space to form a vector value of the whole space;
and S33, performing similarity calculation on the vector values of the pair of operators after splicing.
Furthermore, the similarity calculation is to take a cosine value as the similarity of the pair of operators for the pair of vector values after the concatenation. The classification result using the cosine value as the similarity measure is optimal.
The invention has the advantages and beneficial effects that:
the neural network generated by various deep learning frames is mapped to a uniform intermediate language, codes of various hardware platforms can be generated, the expenditure of programmers caused by model conversion due to different development frames is saved, and the capability of deploying the neural network across hardware equipment is realized; the front end optimizes the neural network at a graph level, the rear end optimizes the neural network at an operator level, the whole optimization process is automatically carried out, efficient optimized codes can be generated for a hardware platform, and a programmer does not need to spend a large amount of time and energy to carry out manual optimization; and when the back end carries out operator optimization, an optimization space exploration scheme based on clustering is executed, so that the time consumption generated by exploring the optimization space can be greatly reduced.
Drawings
FIG. 1 is a flow chart of an acceleration method for exploring an optimization space in a deep learning compiler according to the present invention.
FIG. 2 is a schematic diagram of the Conv-BN-Relu module calculation in the present invention.
FIG. 3 is a schematic diagram of pre-optimization calculations in the present invention.
FIG. 4 is a schematic diagram of calculation after optimization of an operator in the invention.
FIG. 5 is a schematic diagram of the parallel Conv operator optimized calculation in the invention.
FIG. 6 is a schematic diagram of an operator optimization space in the invention.
FIG. 7 is a schematic diagram of the operator optimization space vectorization in the invention.
FIG. 8 is a flow chart of neural network operator level optimization in the invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
As shown in FIG. 1, an accelerated method for exploring optimization space in deep learning compiler aims to greatly reduce the time consumption of the compiler for exploring the operator optimization space at the sacrifice of acceptable increase of deep learning network inference time. The method first abstracts the neural network into the form of a computational graph. And secondly, carrying out graph optimization on the calculation graph, and defining an optimization space for each operator in the optimized calculation graph. And then, based on an operator containing optimized spatial information, providing an optimized spatial similarity calculation method. And finally, providing an operator state space exploration method based on similarity, clustering operators based on the similarity, carrying out full space exploration on a core operator in each cluster, exploring the rest operators of the same class in an optimal scheme of the core operator, determining an optimal scheme of each operator of the whole neural network, and generating a target platform code according to a hardware platform.
The method provided by the invention comprises a front end and a back end, wherein the front end takes a model generated by a deep learning framework as input, abstracts the model into a calculation graph expressed by a high-level intermediate language and performs graph optimization. The back end takes the calculation graph after the front end optimization as input, the calculation graph is expressed by a low-level intermediate language, then operator optimization is carried out, the exploration process is accelerated in the optimization process, and finally target platform codes are generated according to a hardware platform.
The specific implementation mode of the invention is as follows:
1) the model generated by the deep learning framework is represented as a computational graph.
1.1) the method realizes a high-level intermediate language HIR, which is a specific field language (DSL) and can represent a neural network computing and control flow.
1.2) mapping the neural network in TensorFlow, PyTorch or ONNX format onto the HIR, and expressing by the HIR.
1.3) constructing a computational graph based on the converted HIR, wherein the computational graph is a directed acyclic graph, each node in the graph represents one operator in the neural network, and edges in the graph represent data dependence among the operators. The computational graph establishes the dependency relationship between control flow and operators and data, and provides an interface for graph-level optimization. Fig. 2 shows a calculation graph generated by a simple Conv-BN-Relu module in a neural network, where each rounded rectangle in the graph represents an operator node, in this example, three nodes are included, each edge in the graph represents a data dependency between operators, for example, the data of the Conv operator dependency is input data and weight W1 data, and the BN operator dependency is a calculation result of a preceding Conv operator.
2) And optimizing the neural network at the graph level based on the computational graph.
2.1) carrying out operator fusion optimization on the calculation graph. The operator fusion is an optimization technology which combines a plurality of basic operators into a composite operator, does not need to store intermediate results, reduces unnecessary memory reading and writing and improves the cache locality. For a given computational graph, a domination tree is constructed first, then nodes in the domination tree are traversed, and if the nodes from one node to its domination node satisfy a predefined fusion rule, the operators are fused into a new compound operator. As shown in fig. 3, it is a calculation graph before operator fusion optimization, and a dominance tree is first calculated for it, for example, the dominance node of node 2 is node 1, and the dominance node of node 3 is node 2. And then traversing the dominating tree, performing rule matching, and fusing to form a new node 1 when the dominating node of the node 2 meets the fusion condition to the dominating node 1 thereof, wherein the dominating node of the node 3 becomes the fused node 1 and meets the new fusion condition with the node 1. Based on this, the Conv-BN-Relu three nodes formed by the nodes 1, 2 and 3 can be fused into a new node, which is denoted as CBR, and the rule is applied to the whole calculation graph, and is fused into the form shown in FIG. 4.
2.2) performing data layout optimization on the calculation graph subjected to the operator fusion optimization. Each operator in the computation graph may be stored in the physical device in a number of ways. For the calculation graph after operator fusion optimization, firstly, judging whether a specified data layout scheme exists when the calculation graph is input, and if the calculation graph exists, directly applying the scheme. When the data layout scheme is not specified, an optimal data layout scheme is selected according to hardware characteristics, such as row-first storage or column-first storage. The most basic is whether the exploration data should be stored in the format of NHWC or NCHW.
2.3) carrying out parallel Conv operator combination on the calculation graph subjected to the data layout optimization. If a plurality of Conv operators sharing the same input exist in the calculation graph, the Conv operators are combined into a larger Conv operator, a larger kernel module is generated for the GPU, the GPU kernel starting expense is reduced, and the GPU utilization rate is improved. For example, for 3 CBR operators of 1x1 in fig. 4, they accept the same input and do the same, so they can be merged to form a larger 1x1 CBR operator, as shown in fig. 5.
3) And calculating the optimized spatial similarity of the neural network operator.
3.1) the method realizes a low-level intermediate language LIR, wherein the LIR is a finer-grained intermediate language form, and operator-level optimization and target platform code generation can be carried out based on the LIR.
3.2) mapping the calculation graph subjected to graph optimization onto the LIR, representing by using the LIR, and defining an optimization space of each operator. As shown in FIG. 6, the Conv operator for the NCHW layout is shown on the left side of the diagram and defines the optimization space as shown on the right side of the diagram. The Conv operator can perform 6-dimensional tiling and two cyclic expansion optimizations, and we take the first optimization operation "tile _ f" as an example, which means that the operator "f" dimension is expanded, the original length of the dimension is 32, and the dimension is tiled into 4 dimensions, so that there are 56 tiling schemes in total. By analogy, the number of possible alternatives for each optimization operation is the value in the rightmost line, and the optimization space of the entire Conv operator is the product of the number of all the alternatives, which is about 1.3 hundred million.
The tiling operation is to divide a certain dimension in the original space into m dimensions, for example, tile the dimension x in the original space to [ x1, x2, x3, x4 ]. In the above example, a dimension f with a length of 32 in the original space is selected and tiled into 4 dimensions, where the number of dimensions after tiling is set to 4, and when it is determined that the original space is to be tiled into 4 dimensions, it is equivalent to divide the length 32 into 4 layers of cycles, and there are several possible divisions of [ (1 × 32), (1 × 2 × 16), (1 × 4 × 8), (1 × 2 × 8), (1 × 2 × 4), (2 × 4) ], and there are [4, 12, 12, 12, 4] tiling schemes, respectively, and there are 56 tiling schemes in total when they are added to each other.
3.3) calculating operator to optimize the spatial similarity. Firstly, defining a Hash method for each optimization operation according to the optimization space attribute of an operator, and abstracting the optimization operation into a Hash value. And then vectorizing each pair of operators, sequentially superposing the hash values as vector values according to the optimization operation sequence, and converting the vectors of the two operators into equal length through 0 filling operation. As shown in fig. 7, an example of calculating a pair of operator optimized space vector values is given, for two space 1 and space 2 with the same optimization operation sequence, a set of optimization operations tile _ rc 1 and tile _ rc 2 in the space are sequentially selected, hash values of the optimization operations are selected for vectorization, after vectorization, the vector value length of tile _ rc 2 is smaller than tile _ rc 1, 0 is filled at the end of tile _ rc 2 vector to expand to be as long as the vector value of tile _ rc 1, and the calculated vector values are vec 1 and vec 2 respectively. By analogy, vector values of each optimization operation in the space are sequentially spliced to form vector values of the whole space. And finally, measuring the cosine value of the pair of vectors as the similarity of the pair of operators, wherein the search space similarity measurement is tried to be carried out by using the similarity of Jaccard and the like, but the classification result is not as good as the cosine value, so the cosine value is finally adopted as a similarity measurement index.
4) And performing operator level optimization on the neural network, and generating target platform codes. The execution flow is shown in fig. 8.
4.1) calculating the similarity matrix of the operators.
4.2) performing AP Clustering (Affinity probability Clustering) by taking the similarity matrix as input, wherein the AP Clustering algorithm does not need to determine the number of clusters in advance, the center of each cluster after Clustering is an input operator, and a core operator does not need to be calculated for each cluster again.
4.3) for each core operator of each cluster, exploring the complete optimization space of the operator, and storing n optimization schemes with the shortest inference time in the exploration process. For the Yolo v3-tiny model, for example, the clustering algorithm can divide the original 20 operators into 8 classes. Then we only need to do complete optimization space exploration to the 8 clustered core operators and save the 10 optimization schemes with the minimum inference time of each operator in the exploration process.
4.4) for the non-core operators in each cluster, n optimal schemes need to be explored by traversing the dependent core operators. For example, for a Yolo v3-tiny model, for 12 operators which are not the cluster center, we only need to search in 10 optimal optimization schemes of the core operator of the class where each operator is located.
And 4.5) generating target platform codes for each operator according to the optimization scheme, and deploying the operator codes to hardware to run a neural network according to the sequence in the computational graph. For codes needing to be deployed on a CPU, a third-party tool LLVM is called to generate corresponding C codes, and for an Nvidia GPU, corresponding CUDA codes are generated and then deployed to the GPU to run.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1.一种用于深度学习编译器中探索优化空间的加速方法,其特征在于包括如下步骤:1. an acceleration method for exploring optimization space in a deep learning compiler, it is characterized in that comprising the steps: S1,对神经网络进行抽象,将其表示为计算图的形式;S1, abstract the neural network and express it in the form of a computational graph; S2,对计算图进行图优化;S2, perform graph optimization on the computational graph; S3,为优化后计算图中的每个算子定义优化空间,基于包含优化空间信息的算子,进行优化空间相似度计算;S3, define an optimization space for each operator in the optimized calculation graph, and perform the optimization space similarity calculation based on the operator containing the optimization space information; S4,基于相似度的算子状态空间探索,基于相似度对算子进行聚类,对每一个聚类中的核心算子进行全空间探索,同类的其余算子在核心算子最优方案中进行探索,确定整个神经网络每个算子的优化方案,并根据硬件平台产生目标平台代码,包括如下步骤:S4, operator state space exploration based on similarity, cluster operators based on similarity, and perform full space exploration for the core operators in each cluster, and the rest of the operators of the same type are in the optimal scheme of core operators Carry out exploration, determine the optimization scheme of each operator of the entire neural network, and generate the target platform code according to the hardware platform, including the following steps: S41,计算算子的相似度矩阵;S41, calculate the similarity matrix of the operator; S42,将相似度矩阵作为输入执行AP聚类,聚类后每个类的中心就是输入的一个算子;S42, perform AP clustering with the similarity matrix as input, and the center of each class after clustering is an input operator; S43,对于每个聚类的核心算子,探索该算子的完整优化空间,并在探索过程中保存推理时间最短的n个优化方案;S43, for the core operator of each cluster, explore the complete optimization space of the operator, and save n optimization schemes with the shortest inference time during the exploration process; S44,对于每个聚类的非核心算子,遍历其依赖的核心算子探索出的n个最优方案;S44, for each clustered non-core operator, traverse the n optimal solutions explored by its dependent core operators; S45,根据优化方案为每个算子生成目标平台代码,按照计算图中的顺序将算子代码部署到硬件上运行神经网络。S45, generate target platform code for each operator according to the optimization scheme, and deploy the operator code to the hardware to run the neural network according to the sequence in the calculation diagram. 2.根据权利要求1所述的一种用于深度学习编译器中探索优化空间的加速方法,其特征在于所述步骤S1中神经网络计算图表示方法,包括如下步骤:2. a kind of acceleration method for exploring optimization space in deep learning compiler according to claim 1, is characterized in that in described step S1, neural network computation graph representation method, comprises the steps: S11,将深度学习框架中构造的神经网络映射到定义好的高级中间语言HIR上;S11, map the neural network constructed in the deep learning framework to the defined high-level intermediate language HIR; S12,基于高级中间语言分析每个算子的属性,并根据算子间的数据依赖关系构造计算图,构造的计算图是一个有向无环图,图中每个节点代表神经网络中的一个算子,图中的边表示算子间的数据依赖。S12, analyze the attributes of each operator based on the high-level intermediate language, and construct a calculation graph according to the data dependencies between the operators. The constructed calculation graph is a directed acyclic graph, and each node in the graph represents one of the neural networks. Operators, the edges in the graph represent data dependencies between operators. 3.根据权利要求1所述的一种用于深度学习编译器中探索优化空间的加速方法,其特征在于所述步骤S2中基于计算图的图优化方法,包括如下步骤:3. a kind of acceleration method for exploring optimization space in deep learning compiler according to claim 1, it is characterized in that the graph optimization method based on computation graph in described step S2, comprises the steps: S21,根据算子的计算类型进行算子融合,所述算子融合是将多个基本的算子组合为一个复合算子,不需存储中间结果;S21, perform operator fusion according to the calculation type of the operator, where the operator fusion is to combine a plurality of basic operators into a composite operator without storing intermediate results; S22,对算子融合后的计算图,根据硬件特性,进行数据布局优化;S22, optimize the data layout according to the hardware characteristics of the computation graph after operator fusion; S23,对数据布局优化后的计算图,进行并行算子合并。S23, perform parallel operator merging on the calculation graph after the data layout has been optimized. 4.根据权利要求3所述的一种用于深度学习编译器中探索优化空间的加速方法,其特征在于所述步骤S21的具体内容,包括:首先构造支配树,然后遍历支配树中的节点,如果一个节点到其支配节点的这些节点满足了预先定义好的融合规则,就将这些算子融合为新的复合算子。4. A kind of acceleration method for exploring optimization space in a deep learning compiler according to claim 3, it is characterized in that the specific content of described step S21, comprises: first constructing a domination tree, then traversing the nodes in the domination tree , if these nodes from a node to its dominant node satisfy the pre-defined fusion rules, these operators are fused into a new composite operator. 5.根据权利要求3所述的一种用于深度学习编译器中探索优化空间的加速方法,其特征在于所述步骤S22的具体内容,包括:对于算子融合后的计算图中,首先判断是否输入时有指定的数据布局方案,如果有则直接应用指定方案,如果没有则根据硬件特征选择最优的数据布局方案,包括行优先存储或者列优先存储。5. a kind of acceleration method for exploring optimization space in deep learning compiler according to claim 3, it is characterized in that the concrete content of described step S22, comprises: For the calculation graph after operator fusion, first judge Whether there is a specified data layout scheme when inputting, if so, the specified scheme is directly applied, if not, the optimal data layout scheme is selected according to the hardware characteristics, including row-first storage or column-first storage. 6.根据权利要求3所述的一种用于深度学习编译器中探索优化空间的加速方法,其特征在于所述步骤S23的具体内容,包括:将多个共享同一输入的算子合并为更大的算子。6. The acceleration method for exploring optimization space in a deep learning compiler according to claim 3, wherein the specific content of step S23 comprises: merging multiple operators sharing the same input into more big operator. 7.根据权利要求1所述的一种用于深度学习编译器中探索优化空间的加速方法,其特征在于所述步骤S3中,将图优化后的计算图映射到LIR上,使用LIR进行表示,并定义每个算子的优化空间。7. a kind of acceleration method for exploring optimization space in deep learning compiler according to claim 1, it is characterized in that in described step S3, the computation graph after graph optimization is mapped on LIR, uses LIR to express , and define the optimization space for each operator. 8.根据权利要求7所述的一种用于深度学习编译器中探索优化空间的加速方法,其特征在于所述算子进行多个维度的平铺以及多种循环展开优化,对所述算子的维度进行展开,所述维度原始长度为l,将其平铺为m个维度,则一共有k种平铺方案,以此类推,每一个优化操作选择方案数量为k,整个算子的优化空间就是所有优化操作方案数量的乘积。8. The acceleration method for exploring optimization space in a deep learning compiler according to claim 7, wherein the operator performs multiple dimensional tiling and multiple loop unrolling optimizations, The dimensions of the sub-dimension are expanded, the original length of the dimension is l, and if it is tiled into m dimensions, there are k tiling schemes in total, and so on. The optimization space is the product of the number of all optimization operations. 9.根据权利要求7所述的一种用于深度学习编译器中探索优化空间的加速方法,其特征在于所述步骤S3中,优化空间相似度计算方法,包括如下步骤:9. A kind of acceleration method for exploring optimization space in deep learning compiler according to claim 7, is characterized in that in described step S3, optimizing space similarity calculation method, comprises the steps: S31,根据算子的优化空间属性,为每种优化操作定义哈希化方法,将优化操作抽象为哈希值;S31, according to the optimization space attribute of the operator, define a hashing method for each optimization operation, and abstract the optimization operation into a hash value; S32,将每对算子进行向量化,按照优化操作顺序依次叠加哈希值作为向量值,并通过填0操作将两个算子的向量转换为等长;将空间中每个优化操作的向量值依次拼接,构成整体空间的向量值;S32, vectorize each pair of operators, superimpose the hash values as vector values in the order of optimization operations, and convert the vectors of the two operators into equal lengths by filling 0 operations; convert the vectors of each optimization operation in the space The values are spliced in sequence to form the vector value of the overall space; S33,对拼接后的这对算子的向量值进行相似度计算。S33, perform similarity calculation on the vector values of the pair of operators after splicing. 10.根据权利要求9所述的一种用于深度学习编译器中探索优化空间的加速方法,其特征在于所述相似度计算,是对拼接后的这对向量值取cosine值作为这对算子的相似度。10. A kind of acceleration method for exploring optimization space in deep learning compiler according to claim 9, it is characterized in that described similarity calculation is to take cosine value as this pair of vector values after splicing. child similarity.
CN202110223874.3A 2021-03-01 2021-03-01 Acceleration method for exploring optimization space in deep learning compiler Active CN112579063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110223874.3A CN112579063B (en) 2021-03-01 2021-03-01 Acceleration method for exploring optimization space in deep learning compiler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110223874.3A CN112579063B (en) 2021-03-01 2021-03-01 Acceleration method for exploring optimization space in deep learning compiler

Publications (2)

Publication Number Publication Date
CN112579063A CN112579063A (en) 2021-03-30
CN112579063B true CN112579063B (en) 2021-06-08

Family

ID=75114093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110223874.3A Active CN112579063B (en) 2021-03-01 2021-03-01 Acceleration method for exploring optimization space in deep learning compiler

Country Status (1)

Country Link
CN (1) CN112579063B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115169522B (en) * 2021-04-01 2025-08-29 浙江菜鸟供应链管理有限公司 Method, device and apparatus for determining graph structure of Pytorch neural network
CN113031966B (en) * 2021-05-20 2021-09-21 之江实验室 Deep learning compilation optimization method for intelligently selecting compilation acceleration library
CN113703768B (en) * 2021-07-13 2024-10-11 清华大学 Tensor program optimization method and device
CN113656563B (en) * 2021-07-15 2024-06-28 华为技术有限公司 A neural network search method and related equipment
CN113885871B (en) * 2021-09-13 2025-03-28 清华大学 Dedicated backend code generation method and device supporting machine learning training
CN115983356B (en) * 2021-10-14 2026-04-14 上海交通大学 Tensor calculation unit-oriented convolution operator optimization implementation method
CN113961267B (en) * 2021-10-15 2023-08-25 杭州海康威视数字技术股份有限公司 Service processing method, device and equipment
CN116011468A (en) * 2021-10-20 2023-04-25 珠海金山办公软件有限公司 Reasoning method of deep learning model, machine translation method and device
CN113703741B (en) * 2021-10-29 2022-02-22 深圳思谋信息科技有限公司 Neural network compiler configuration method and device, computer equipment and storage medium
CN114219081A (en) * 2021-12-19 2022-03-22 复旦大学 Neural network precompilation algorithm for dedicated accelerator
CN114418114B (en) * 2021-12-30 2025-06-03 深圳云天励飞技术股份有限公司 Operator fusion method, device, terminal equipment and storage medium
CN114330668B (en) * 2021-12-31 2024-11-26 成都商汤科技有限公司 Model processing method, device, electronic device and computer storage medium
CN114115834B (en) * 2022-01-25 2022-04-26 之江实验室 A kind of software and hardware co-compiling processing method and system
CN114428616A (en) * 2022-04-01 2022-05-03 北京清微智能信息技术有限公司 Method for optimizing replacement cost in neural network compiling stage
CN114841327B (en) * 2022-05-27 2025-02-18 抖音视界有限公司 Computational graph processing method, device, readable medium and electronic device
CN114995822B (en) * 2022-06-07 2024-08-23 重庆大学 Deep learning compiler optimization method specifically for CNN accelerator
CN114995823B (en) * 2022-06-07 2024-08-20 重庆大学 A deep learning compiler optimization method for CNN dedicated accelerator
CN114936099B (en) * 2022-07-25 2022-09-30 之江实验室 Graph optimization method and device for neural network calculation
CN117709403A (en) * 2022-09-07 2024-03-15 华为云计算技术有限公司 Model optimization method and device and computing equipment
CN115509539A (en) * 2022-09-28 2022-12-23 苏州浪潮智能科技有限公司 Data calling method, device, equipment and medium
CN115563581B (en) * 2022-10-17 2026-04-24 上海壁仞科技股份有限公司 Operator fusion method, computing device, computing equipment, and readable storage medium
CN115659281B (en) * 2022-11-16 2023-10-27 之江实验室 Method and device for fusing adaptive acceleration operators
CN116301904B (en) * 2023-05-18 2023-08-22 之江实验室 An operator optimization acceleration method and device for deep learning compiler
CN116976431A (en) * 2023-07-27 2023-10-31 九识(苏州)智能科技有限公司 Compilation optimization methods, devices, storage media and equipment for deep learning compilers
CN119002931B (en) * 2024-10-25 2024-12-27 浪潮电子信息产业股份有限公司 Method, device, equipment and storage medium for circularly expanding kernel codes
CN120430354B (en) * 2025-07-03 2025-08-26 沪渝人工智能研究院 Operator scheduling and high-low scanning lightweight acceleration method for dynamic and static combination
CN121255287B (en) * 2025-12-04 2026-03-17 山东大学 Register dynamic grouping method and system based on vector expansion

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9710238B2 (en) * 2015-03-26 2017-07-18 IfWizard Corporation Automatically optimizing analytics database server
CN110766147B (en) * 2018-07-25 2022-10-11 赛灵思公司 Neural network compiler architecture and compiling method
CN110490309B (en) * 2019-08-14 2022-06-07 中科寒武纪科技股份有限公司 Operator fusion method for neural network and related product thereof

Also Published As

Publication number Publication date
CN112579063A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN112579063B (en) Acceleration method for exploring optimization space in deep learning compiler
US11803404B2 (en) Deep learning algorithm compiling method, device, and related product
Bik et al. Compiler support for sparse tensor computations in MLIR
CN113031966B (en) Deep learning compilation optimization method for intelligently selecting compilation acceleration library
US10901715B1 (en) Lazy compilation and kernel fusion in dynamic computation graphs
CN112183735B (en) Method, device and related products for generating operation data
CN115423082B (en) An Automatic Optimization Method for Deep Model Computation Graphs Related to Hardware Characteristics
CN116368494B (en) A neural network compilation optimization method and related device
Le et al. Tflms: Large model support in tensorflow by graph rewriting
CN114691148B (en) Model reasoning acceleration method, device, electronic device and storage medium
CN119512557B (en) Domain-specific language compiler optimization method and system based on multi-layer intermediate representation
CN111563586A (en) Splitting method of neural network model and related product
CN119990265A (en) Large language model performance optimization method, device, computer equipment, readable storage medium and program product
US20240256241A1 (en) Composable kernels
CN111563584B (en) Splitting method of neural network model and related product
Ivanov et al. Sten: Productive and efficient sparsity in pytorch
CN120832935A (en) Neural network model processing method, device, equipment and storage medium
CN120406911A (en) Automatic generation method and device for high-performance sparse tensor program
JP7624056B2 (en) Method, apparatus and electronic device for generating instructions for neural network accelerators
CN111563585B (en) Splitting method of neural network model and related product
Bachiri et al. F.: A Literature review on combining neural architecture search and compiler optimizations for neural network acceleration
CN121209886B (en) Deep learning model-oriented graph level compiling optimization method and device
Qu et al. AutoSparse: A Source-to-Source Format and Schedule Auto-Tuning Framework for Sparse Tensor Program
Yao et al. Hiperti: high performance system for cross-platform code generation of transformer model inference based on MLIR: J. Yao et al.
Long et al. FusionStitching: Boosting execution efficiency of memory intensive computations for DL workloads

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant