CN112579063B - Acceleration method for exploring optimization space in deep learning compiler - Google Patents
Acceleration method for exploring optimization space in deep learning compiler Download PDFInfo
- Publication number
- CN112579063B CN112579063B CN202110223874.3A CN202110223874A CN112579063B CN 112579063 B CN112579063 B CN 112579063B CN 202110223874 A CN202110223874 A CN 202110223874A CN 112579063 B CN112579063 B CN 112579063B
- Authority
- CN
- China
- Prior art keywords
- operator
- optimization
- graph
- space
- operators
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/37—Compiler construction; Parser generation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an acceleration method for exploring an optimized space in a deep learning compiler, aiming at optimizing a neural network effect through a compiling technology and greatly reducing the time consumption of the compiler for exploring an operator optimized space. The method first abstracts the neural network into the form of a computational graph. And secondly, carrying out graph optimization on the calculation graph, and defining an optimization space for each operator in the optimized calculation graph. And then, based on an operator containing optimized spatial information, providing an optimized spatial similarity calculation method. And finally, providing an operator state space exploration method based on similarity, clustering operators based on the similarity, carrying out full space exploration on the core operator in each cluster, exploring the rest operators of the same class in the optimal scheme of the core operator, and determining the optimal scheme of each operator of the whole neural network.
Description
Technical Field
The invention relates to the application field of deep learning, compiling technology and high-performance computing cross technology, in particular to an acceleration method for exploring an optimized space in a deep learning compiler.
Background
Today, Deep Neural Networks (DNNs) have found wide application in the fields of image classification, natural language processing, autopilot, augmented reality, and other AI. Particularly, with the rapid development of computing devices, such as GPU, FPGA and specially designed neural network accelerator, the computing power of DNN is becoming more powerful, and the demand for efficient DNN in the field of artificial intelligence is also becoming stronger, so how to improve the operating efficiency of DNN is a very important research problem in recent years.
Now, there are many deep learning frameworks such as TensorFlow, PyTorch, Caffe, MXNet, etc. which can represent neural networks in the form of computational graphs, perform graph-level optimization on the computational graphs, and then map operators in DNN to third-party acceleration libraries such as CuDNN and MKL-DNN to obtain efficient DNN operation effects. However, the optimization at the graph level is generally independent of hardware, and cannot obtain finer-grained optimization effect according to hardware characteristics. Furthermore, the third party acceleration libraries that are relied upon are generally non-open source, which prevents programmers from having effective control and from easily porting DNNs across hardware devices. In addition, for operators not supported by the third-party library, optimization cannot be achieved, or a programmer is required to spend a lot of work to perform manual tuning.
In the research aiming at DNN acceleration, neural networks under various different frameworks are mapped to various hardware platforms through a compiling technology, the neural networks are accelerated in the mapping process, and the method for generating the optimized target platform codes achieves remarkable effects. A relatively efficient neural network compiler comprises the following execution flows: the neural network under various deep learning frameworks is firstly expressed into a computational graph through a high-level intermediate language, and graph-level optimization is carried out on the computational graph. The optimized computation graph is then converted to a low-level intermediate language representation and operator-level optimized. And finally, generating corresponding optimized codes according to the target hardware platform.
When the neural network is optimized at an operator level, the optimization space of each operator is defined in advance, and then the optimization space is explored by adopting a machine learning method to obtain the best optimization scheme. The optimization space of each operator is very large, for example, hundreds of millions of optimization schemes are possible for a Conv operator, so that the exploration of the optimization space of the operators is time-consuming, for example, a Yolo network needs more than one day to explore the optimization schemes.
Disclosure of Invention
In order to solve the defects of the prior art and achieve the purpose of greatly reducing the time consumption of an operator exploration optimization space of a compiler under the sacrifice of increasing the acceptable deep learning network reasoning time, the invention adopts the following technical scheme:
an acceleration method for exploring an optimized space in a deep learning compiler, comprising the steps of:
s1, abstracting the neural network and representing the neural network in the form of a calculation graph;
s2, carrying out graph optimization on the calculation graph;
s3, defining an optimization space for each operator in the optimized calculation graph, and performing optimization space similarity calculation based on the operator containing optimization space information;
s4, searching operator state space based on similarity, clustering operators based on similarity, searching full space of core operators in each cluster, searching other operators of the same class in the optimal scheme of the core operators, determining the optimal scheme of each operator of the whole neural network, and generating target platform codes according to a hardware platform, comprising the following steps:
s41, calculating a similarity matrix of the operators;
s42, the similarity matrix is used as input to execute AP clustering, the AP clustering algorithm does not need to determine the clustering number in advance, the center of each cluster after clustering is an input operator, and an operator does not need to be selected for each cluster again to serve as a core;
s43, for each clustered core operator, searching the complete optimization space of the operator, and storing n optimization schemes with shortest inference time in the searching process;
s44, for each non-core operator of each cluster, only n optimal schemes searched by traversing the core operators depended on by the non-core operators are needed;
and S45, generating a target platform code for each operator according to the optimization scheme, and deploying the operator codes to hardware to run a neural network according to the sequence in the calculation diagram.
Therefore, the time consumption of the compiler for exploring an operator optimization space is greatly reduced under the sacrifice of the increase of the acceptable deep learning network reasoning time.
Further, the neural network computational graph representation method in the step S1 includes the following steps:
s11, mapping the neural network constructed in the deep learning framework to a well-defined high-level intermediate language HIR;
and S12, analyzing the attribute of each operator based on the high-level intermediate language, and constructing a computation graph according to the data dependence relationship among the operators, wherein the constructed computation graph is a directed acyclic graph, each node in the graph represents one operator in the neural network, and edges in the graph represent the data dependence among the operators. A high-level intermediate language (HIR) is realized, the HIR is a specific domain language (DSL) and can represent a neural network computing and control flow, and a neural network in TensorFlow, PyTorch or ONNX format is mapped onto the HIR and represented by the HIR.
Further, the graph optimization method based on the computation graph in step S2 includes the following steps:
s21, performing operator fusion according to the calculation type of the operator, wherein the operator fusion is to combine a plurality of basic operators into a composite operator without storing intermediate results, thereby reducing unnecessary memory read-write time and improving cache locality;
s22, performing data layout optimization on the calculation graph after operator fusion according to hardware characteristics;
and S23, merging parallel operators for the calculation graph after the data layout optimization.
Further, the specific content of step S21 includes: firstly, constructing a domination tree, then traversing nodes in the domination tree, and merging the operators into a new compound operator if the nodes from one node to the domination node meet a predefined merging rule.
Further, the specific content of step S22 includes: in the calculation graph after operator fusion, whether a specified data layout scheme exists when the calculation graph is input is judged, if yes, the specified scheme is directly applied, and if not, the optimal data layout scheme is selected according to hardware characteristics, wherein the data layout scheme comprises row-first storage or column-first storage.
Further, the specific content of step S23 includes: the method has the advantages that a plurality of operators sharing the same input are combined into a larger operator, a larger kernel module is generated for the GPU, the GPU kernel starting expense is reduced, and the GPU utilization rate is improved.
Further, in step S3, the graph-optimized computation graph is mapped onto the LIR, represented by using the LIR, and an optimization space for each operator is defined. A low-level intermediate language LIR is realized, the LIR is a finer-grained intermediate language form, and operator-level optimization and target platform code generation can be performed based on the LIR.
Further, the operator performs tiling of multiple dimensions and multiple cyclic expansion optimization, the dimensions of the operator are expanded, the original length of the dimension is l, the dimension is tiled into m dimensions, k tiling schemes are provided in total, and by analogy, the number of possible selection schemes for each optimization operation is k, and the optimization space of the whole operator is the product of the number of all optimization operation schemes.
Further, in step S3, the method for calculating the optimized spatial similarity includes the following steps:
s31, defining a hash method for each optimization operation according to the optimization space attribute of the operator, and abstracting the optimization operation into a hash value;
s32, vectorizing each pair of operators, sequentially superposing Hash values as vector values according to an optimization operation sequence, and converting the vectors of the two operators into equal length through 0 filling operation; sequentially splicing the vector values of each optimized operation in the space to form a vector value of the whole space;
and S33, performing similarity calculation on the vector values of the pair of operators after splicing.
Furthermore, the similarity calculation is to take a cosine value as the similarity of the pair of operators for the pair of vector values after the concatenation. The classification result using the cosine value as the similarity measure is optimal.
The invention has the advantages and beneficial effects that:
the neural network generated by various deep learning frames is mapped to a uniform intermediate language, codes of various hardware platforms can be generated, the expenditure of programmers caused by model conversion due to different development frames is saved, and the capability of deploying the neural network across hardware equipment is realized; the front end optimizes the neural network at a graph level, the rear end optimizes the neural network at an operator level, the whole optimization process is automatically carried out, efficient optimized codes can be generated for a hardware platform, and a programmer does not need to spend a large amount of time and energy to carry out manual optimization; and when the back end carries out operator optimization, an optimization space exploration scheme based on clustering is executed, so that the time consumption generated by exploring the optimization space can be greatly reduced.
Drawings
FIG. 1 is a flow chart of an acceleration method for exploring an optimization space in a deep learning compiler according to the present invention.
FIG. 2 is a schematic diagram of the Conv-BN-Relu module calculation in the present invention.
FIG. 3 is a schematic diagram of pre-optimization calculations in the present invention.
FIG. 4 is a schematic diagram of calculation after optimization of an operator in the invention.
FIG. 5 is a schematic diagram of the parallel Conv operator optimized calculation in the invention.
FIG. 6 is a schematic diagram of an operator optimization space in the invention.
FIG. 7 is a schematic diagram of the operator optimization space vectorization in the invention.
FIG. 8 is a flow chart of neural network operator level optimization in the invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
As shown in FIG. 1, an accelerated method for exploring optimization space in deep learning compiler aims to greatly reduce the time consumption of the compiler for exploring the operator optimization space at the sacrifice of acceptable increase of deep learning network inference time. The method first abstracts the neural network into the form of a computational graph. And secondly, carrying out graph optimization on the calculation graph, and defining an optimization space for each operator in the optimized calculation graph. And then, based on an operator containing optimized spatial information, providing an optimized spatial similarity calculation method. And finally, providing an operator state space exploration method based on similarity, clustering operators based on the similarity, carrying out full space exploration on a core operator in each cluster, exploring the rest operators of the same class in an optimal scheme of the core operator, determining an optimal scheme of each operator of the whole neural network, and generating a target platform code according to a hardware platform.
The method provided by the invention comprises a front end and a back end, wherein the front end takes a model generated by a deep learning framework as input, abstracts the model into a calculation graph expressed by a high-level intermediate language and performs graph optimization. The back end takes the calculation graph after the front end optimization as input, the calculation graph is expressed by a low-level intermediate language, then operator optimization is carried out, the exploration process is accelerated in the optimization process, and finally target platform codes are generated according to a hardware platform.
The specific implementation mode of the invention is as follows:
1) the model generated by the deep learning framework is represented as a computational graph.
1.1) the method realizes a high-level intermediate language HIR, which is a specific field language (DSL) and can represent a neural network computing and control flow.
1.2) mapping the neural network in TensorFlow, PyTorch or ONNX format onto the HIR, and expressing by the HIR.
1.3) constructing a computational graph based on the converted HIR, wherein the computational graph is a directed acyclic graph, each node in the graph represents one operator in the neural network, and edges in the graph represent data dependence among the operators. The computational graph establishes the dependency relationship between control flow and operators and data, and provides an interface for graph-level optimization. Fig. 2 shows a calculation graph generated by a simple Conv-BN-Relu module in a neural network, where each rounded rectangle in the graph represents an operator node, in this example, three nodes are included, each edge in the graph represents a data dependency between operators, for example, the data of the Conv operator dependency is input data and weight W1 data, and the BN operator dependency is a calculation result of a preceding Conv operator.
2) And optimizing the neural network at the graph level based on the computational graph.
2.1) carrying out operator fusion optimization on the calculation graph. The operator fusion is an optimization technology which combines a plurality of basic operators into a composite operator, does not need to store intermediate results, reduces unnecessary memory reading and writing and improves the cache locality. For a given computational graph, a domination tree is constructed first, then nodes in the domination tree are traversed, and if the nodes from one node to its domination node satisfy a predefined fusion rule, the operators are fused into a new compound operator. As shown in fig. 3, it is a calculation graph before operator fusion optimization, and a dominance tree is first calculated for it, for example, the dominance node of node 2 is node 1, and the dominance node of node 3 is node 2. And then traversing the dominating tree, performing rule matching, and fusing to form a new node 1 when the dominating node of the node 2 meets the fusion condition to the dominating node 1 thereof, wherein the dominating node of the node 3 becomes the fused node 1 and meets the new fusion condition with the node 1. Based on this, the Conv-BN-Relu three nodes formed by the nodes 1, 2 and 3 can be fused into a new node, which is denoted as CBR, and the rule is applied to the whole calculation graph, and is fused into the form shown in FIG. 4.
2.2) performing data layout optimization on the calculation graph subjected to the operator fusion optimization. Each operator in the computation graph may be stored in the physical device in a number of ways. For the calculation graph after operator fusion optimization, firstly, judging whether a specified data layout scheme exists when the calculation graph is input, and if the calculation graph exists, directly applying the scheme. When the data layout scheme is not specified, an optimal data layout scheme is selected according to hardware characteristics, such as row-first storage or column-first storage. The most basic is whether the exploration data should be stored in the format of NHWC or NCHW.
2.3) carrying out parallel Conv operator combination on the calculation graph subjected to the data layout optimization. If a plurality of Conv operators sharing the same input exist in the calculation graph, the Conv operators are combined into a larger Conv operator, a larger kernel module is generated for the GPU, the GPU kernel starting expense is reduced, and the GPU utilization rate is improved. For example, for 3 CBR operators of 1x1 in fig. 4, they accept the same input and do the same, so they can be merged to form a larger 1x1 CBR operator, as shown in fig. 5.
3) And calculating the optimized spatial similarity of the neural network operator.
3.1) the method realizes a low-level intermediate language LIR, wherein the LIR is a finer-grained intermediate language form, and operator-level optimization and target platform code generation can be carried out based on the LIR.
3.2) mapping the calculation graph subjected to graph optimization onto the LIR, representing by using the LIR, and defining an optimization space of each operator. As shown in FIG. 6, the Conv operator for the NCHW layout is shown on the left side of the diagram and defines the optimization space as shown on the right side of the diagram. The Conv operator can perform 6-dimensional tiling and two cyclic expansion optimizations, and we take the first optimization operation "tile _ f" as an example, which means that the operator "f" dimension is expanded, the original length of the dimension is 32, and the dimension is tiled into 4 dimensions, so that there are 56 tiling schemes in total. By analogy, the number of possible alternatives for each optimization operation is the value in the rightmost line, and the optimization space of the entire Conv operator is the product of the number of all the alternatives, which is about 1.3 hundred million.
The tiling operation is to divide a certain dimension in the original space into m dimensions, for example, tile the dimension x in the original space to [ x1, x2, x3, x4 ]. In the above example, a dimension f with a length of 32 in the original space is selected and tiled into 4 dimensions, where the number of dimensions after tiling is set to 4, and when it is determined that the original space is to be tiled into 4 dimensions, it is equivalent to divide the length 32 into 4 layers of cycles, and there are several possible divisions of [ (1 × 32), (1 × 2 × 16), (1 × 4 × 8), (1 × 2 × 8), (1 × 2 × 4), (2 × 4) ], and there are [4, 12, 12, 12, 4] tiling schemes, respectively, and there are 56 tiling schemes in total when they are added to each other.
3.3) calculating operator to optimize the spatial similarity. Firstly, defining a Hash method for each optimization operation according to the optimization space attribute of an operator, and abstracting the optimization operation into a Hash value. And then vectorizing each pair of operators, sequentially superposing the hash values as vector values according to the optimization operation sequence, and converting the vectors of the two operators into equal length through 0 filling operation. As shown in fig. 7, an example of calculating a pair of operator optimized space vector values is given, for two space 1 and space 2 with the same optimization operation sequence, a set of optimization operations tile _ rc 1 and tile _ rc 2 in the space are sequentially selected, hash values of the optimization operations are selected for vectorization, after vectorization, the vector value length of tile _ rc 2 is smaller than tile _ rc 1, 0 is filled at the end of tile _ rc 2 vector to expand to be as long as the vector value of tile _ rc 1, and the calculated vector values are vec 1 and vec 2 respectively. By analogy, vector values of each optimization operation in the space are sequentially spliced to form vector values of the whole space. And finally, measuring the cosine value of the pair of vectors as the similarity of the pair of operators, wherein the search space similarity measurement is tried to be carried out by using the similarity of Jaccard and the like, but the classification result is not as good as the cosine value, so the cosine value is finally adopted as a similarity measurement index.
4) And performing operator level optimization on the neural network, and generating target platform codes. The execution flow is shown in fig. 8.
4.1) calculating the similarity matrix of the operators.
4.2) performing AP Clustering (Affinity probability Clustering) by taking the similarity matrix as input, wherein the AP Clustering algorithm does not need to determine the number of clusters in advance, the center of each cluster after Clustering is an input operator, and a core operator does not need to be calculated for each cluster again.
4.3) for each core operator of each cluster, exploring the complete optimization space of the operator, and storing n optimization schemes with the shortest inference time in the exploration process. For the Yolo v3-tiny model, for example, the clustering algorithm can divide the original 20 operators into 8 classes. Then we only need to do complete optimization space exploration to the 8 clustered core operators and save the 10 optimization schemes with the minimum inference time of each operator in the exploration process.
4.4) for the non-core operators in each cluster, n optimal schemes need to be explored by traversing the dependent core operators. For example, for a Yolo v3-tiny model, for 12 operators which are not the cluster center, we only need to search in 10 optimal optimization schemes of the core operator of the class where each operator is located.
And 4.5) generating target platform codes for each operator according to the optimization scheme, and deploying the operator codes to hardware to run a neural network according to the sequence in the computational graph. For codes needing to be deployed on a CPU, a third-party tool LLVM is called to generate corresponding C codes, and for an Nvidia GPU, corresponding CUDA codes are generated and then deployed to the GPU to run.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110223874.3A CN112579063B (en) | 2021-03-01 | 2021-03-01 | Acceleration method for exploring optimization space in deep learning compiler |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110223874.3A CN112579063B (en) | 2021-03-01 | 2021-03-01 | Acceleration method for exploring optimization space in deep learning compiler |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN112579063A CN112579063A (en) | 2021-03-30 |
| CN112579063B true CN112579063B (en) | 2021-06-08 |
Family
ID=75114093
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110223874.3A Active CN112579063B (en) | 2021-03-01 | 2021-03-01 | Acceleration method for exploring optimization space in deep learning compiler |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN112579063B (en) |
Families Citing this family (27)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115169522B (en) * | 2021-04-01 | 2025-08-29 | 浙江菜鸟供应链管理有限公司 | Method, device and apparatus for determining graph structure of Pytorch neural network |
| CN113031966B (en) * | 2021-05-20 | 2021-09-21 | 之江实验室 | Deep learning compilation optimization method for intelligently selecting compilation acceleration library |
| CN113703768B (en) * | 2021-07-13 | 2024-10-11 | 清华大学 | Tensor program optimization method and device |
| CN113656563B (en) * | 2021-07-15 | 2024-06-28 | 华为技术有限公司 | A neural network search method and related equipment |
| CN113885871B (en) * | 2021-09-13 | 2025-03-28 | 清华大学 | Dedicated backend code generation method and device supporting machine learning training |
| CN115983356B (en) * | 2021-10-14 | 2026-04-14 | 上海交通大学 | Tensor calculation unit-oriented convolution operator optimization implementation method |
| CN113961267B (en) * | 2021-10-15 | 2023-08-25 | 杭州海康威视数字技术股份有限公司 | Service processing method, device and equipment |
| CN116011468A (en) * | 2021-10-20 | 2023-04-25 | 珠海金山办公软件有限公司 | Reasoning method of deep learning model, machine translation method and device |
| CN113703741B (en) * | 2021-10-29 | 2022-02-22 | 深圳思谋信息科技有限公司 | Neural network compiler configuration method and device, computer equipment and storage medium |
| CN114219081A (en) * | 2021-12-19 | 2022-03-22 | 复旦大学 | Neural network precompilation algorithm for dedicated accelerator |
| CN114418114B (en) * | 2021-12-30 | 2025-06-03 | 深圳云天励飞技术股份有限公司 | Operator fusion method, device, terminal equipment and storage medium |
| CN114330668B (en) * | 2021-12-31 | 2024-11-26 | 成都商汤科技有限公司 | Model processing method, device, electronic device and computer storage medium |
| CN114115834B (en) * | 2022-01-25 | 2022-04-26 | 之江实验室 | A kind of software and hardware co-compiling processing method and system |
| CN114428616A (en) * | 2022-04-01 | 2022-05-03 | 北京清微智能信息技术有限公司 | Method for optimizing replacement cost in neural network compiling stage |
| CN114841327B (en) * | 2022-05-27 | 2025-02-18 | 抖音视界有限公司 | Computational graph processing method, device, readable medium and electronic device |
| CN114995822B (en) * | 2022-06-07 | 2024-08-23 | 重庆大学 | Deep learning compiler optimization method specifically for CNN accelerator |
| CN114995823B (en) * | 2022-06-07 | 2024-08-20 | 重庆大学 | A deep learning compiler optimization method for CNN dedicated accelerator |
| CN114936099B (en) * | 2022-07-25 | 2022-09-30 | 之江实验室 | Graph optimization method and device for neural network calculation |
| CN117709403A (en) * | 2022-09-07 | 2024-03-15 | 华为云计算技术有限公司 | Model optimization method and device and computing equipment |
| CN115509539A (en) * | 2022-09-28 | 2022-12-23 | 苏州浪潮智能科技有限公司 | Data calling method, device, equipment and medium |
| CN115563581B (en) * | 2022-10-17 | 2026-04-24 | 上海壁仞科技股份有限公司 | Operator fusion method, computing device, computing equipment, and readable storage medium |
| CN115659281B (en) * | 2022-11-16 | 2023-10-27 | 之江实验室 | Method and device for fusing adaptive acceleration operators |
| CN116301904B (en) * | 2023-05-18 | 2023-08-22 | 之江实验室 | An operator optimization acceleration method and device for deep learning compiler |
| CN116976431A (en) * | 2023-07-27 | 2023-10-31 | 九识(苏州)智能科技有限公司 | Compilation optimization methods, devices, storage media and equipment for deep learning compilers |
| CN119002931B (en) * | 2024-10-25 | 2024-12-27 | 浪潮电子信息产业股份有限公司 | Method, device, equipment and storage medium for circularly expanding kernel codes |
| CN120430354B (en) * | 2025-07-03 | 2025-08-26 | 沪渝人工智能研究院 | Operator scheduling and high-low scanning lightweight acceleration method for dynamic and static combination |
| CN121255287B (en) * | 2025-12-04 | 2026-03-17 | 山东大学 | Register dynamic grouping method and system based on vector expansion |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9710238B2 (en) * | 2015-03-26 | 2017-07-18 | IfWizard Corporation | Automatically optimizing analytics database server |
| CN110766147B (en) * | 2018-07-25 | 2022-10-11 | 赛灵思公司 | Neural network compiler architecture and compiling method |
| CN110490309B (en) * | 2019-08-14 | 2022-06-07 | 中科寒武纪科技股份有限公司 | Operator fusion method for neural network and related product thereof |
-
2021
- 2021-03-01 CN CN202110223874.3A patent/CN112579063B/en active Active
Also Published As
| Publication number | Publication date |
|---|---|
| CN112579063A (en) | 2021-03-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112579063B (en) | Acceleration method for exploring optimization space in deep learning compiler | |
| US11803404B2 (en) | Deep learning algorithm compiling method, device, and related product | |
| Bik et al. | Compiler support for sparse tensor computations in MLIR | |
| CN113031966B (en) | Deep learning compilation optimization method for intelligently selecting compilation acceleration library | |
| US10901715B1 (en) | Lazy compilation and kernel fusion in dynamic computation graphs | |
| CN112183735B (en) | Method, device and related products for generating operation data | |
| CN115423082B (en) | An Automatic Optimization Method for Deep Model Computation Graphs Related to Hardware Characteristics | |
| CN116368494B (en) | A neural network compilation optimization method and related device | |
| Le et al. | Tflms: Large model support in tensorflow by graph rewriting | |
| CN114691148B (en) | Model reasoning acceleration method, device, electronic device and storage medium | |
| CN119512557B (en) | Domain-specific language compiler optimization method and system based on multi-layer intermediate representation | |
| CN111563586A (en) | Splitting method of neural network model and related product | |
| CN119990265A (en) | Large language model performance optimization method, device, computer equipment, readable storage medium and program product | |
| US20240256241A1 (en) | Composable kernels | |
| CN111563584B (en) | Splitting method of neural network model and related product | |
| Ivanov et al. | Sten: Productive and efficient sparsity in pytorch | |
| CN120832935A (en) | Neural network model processing method, device, equipment and storage medium | |
| CN120406911A (en) | Automatic generation method and device for high-performance sparse tensor program | |
| JP7624056B2 (en) | Method, apparatus and electronic device for generating instructions for neural network accelerators | |
| CN111563585B (en) | Splitting method of neural network model and related product | |
| Bachiri et al. | F.: A Literature review on combining neural architecture search and compiler optimizations for neural network acceleration | |
| CN121209886B (en) | Deep learning model-oriented graph level compiling optimization method and device | |
| Qu et al. | AutoSparse: A Source-to-Source Format and Schedule Auto-Tuning Framework for Sparse Tensor Program | |
| Yao et al. | Hiperti: high performance system for cross-platform code generation of transformer model inference based on MLIR: J. Yao et al. | |
| Long et al. | FusionStitching: Boosting execution efficiency of memory intensive computations for DL workloads |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |