Disclosure of Invention
The invention provides a special back-end code generation method and device supporting machine learning training, which are used for solving the problem that the operation efficiency of a GPU applied to neural network training and pushing needs to be optimized.
In a first aspect, the present invention provides a method for generating a specific back-end code supporting machine learning training, including:
Acquiring a first calculation map, and analyzing and optimizing the first calculation map to obtain a second calculation map, wherein the first calculation map is a neural network model formed by at least one operator according to a first sequence;
Acquiring at least one operator of a second computational graph, distributing the at least one operator to an operator mapping module based on the operator type of the at least one operator, and outputting an operator calling code corresponding to each operator;
acquiring memory configuration information of the at least one operator, inputting the memory configuration information of the at least one operator into a memory management module, and outputting a memory management code;
generating a target back-end code based on the operator calling code and the memory management code corresponding to each operator, and compiling the target back-end code based on an operator library to obtain a deployment file, wherein the deployment file is used for executing the computing task of the first computing graph.
Optionally, the operator mapping module includes at least one operator code generator and an operator mapper;
The operator type based on the at least one operator distributes the at least one operator to an operator mapping module, and outputs an operator calling code corresponding to each operator, which specifically includes:
Acquiring an operator type of at least one operator based on the operator mapper;
distributing the at least one operator to an operator code generator corresponding to each operator based on the operator type of the at least one operator;
And generating a template class based on the at least one operator and codes in the operator code generator, and outputting an operator calling code corresponding to each operator.
Optionally, before the distributing the at least one operator to the operator mapping module based on the operator type of the at least one operator, the method further includes:
Acquiring a first operator, wherein the first operator is an operator in the second calculation graph;
generating a code generation template class corresponding to the first operator based on the operator type of the first operator and a preset code generation template base class;
and generating and registering an operator code generator corresponding to the first operator based on the code generation template class corresponding to the first operator.
Optionally, the operator calling code includes at least one of an operator function, an input parameter, and a function calling code;
The input parameters include at least one of:
the first tensor data of the function is transmitted in the form of a pointer;
Second tensor data obtained from the at least one operator based on an advanced AOT compilation method;
Integer vectors are derived based on a vector group mechanism.
Optionally, the integer vector is obtained based on a vector group mechanism, specifically including:
inputting a second operator into an operator code generator corresponding to the second operator, and outputting an integer vector, wherein the second operator is an operator of which the input parameters in the second calculation graph comprise the integer vector;
the integer vector is sent to a vector manager, variable names of the integer vector are generated based on the vector manager, and the variable names of the integer vector are sent to an operator code generator corresponding to the second operator;
and outputting the variable name to an operator calling code corresponding to the second operator based on an operator code generator corresponding to the second operator.
Optionally, the memory management code is configured to:
acquiring a memory pool based on the memory configuration information of the at least one operator;
And acquiring the input and output memory requests of the at least one operator, and distributing the memory pool based on the input and output memory requests of the at least one operator.
Optionally, the memory management code is further configured to:
Acquiring memory information corresponding to an input/output memory request of the at least one operator;
And acquiring weight parameters corresponding to the at least one operator based on the memory information, and updating the weight parameters based on a weight parameter updating mechanism.
Optionally, the compiling processing is performed on the target back-end code based on the operator library, which specifically includes:
acquiring a preset subtask division method corresponding to an operator type of a third operator in an operator library, wherein the third operator is a supplementary operator;
Performing subtask division processing on an operator calling code corresponding to the third operator based on the preset subtask division method to obtain at least one subtask;
calculating the thread number of each subtask based on the at least one subtask, and executing the at least one subtask in parallel based on the thread number of each subtask;
and the target back-end code comprises an operator calling code corresponding to the third operator.
In a second aspect, the present invention provides a special back-end code generating apparatus supporting machine learning training, comprising:
the processing unit is used for acquiring a first calculation graph, analyzing and optimizing the first calculation graph to obtain a second calculation graph, wherein the first calculation graph is a neural network model formed by at least one operator according to a first sequence;
An operator calling code generating unit, configured to obtain at least one operator of the second computation graph, distribute the at least one operator to an operator mapping module based on an operator type of the at least one operator, and output an operator calling code corresponding to each operator;
The memory management code generation unit is used for acquiring the memory configuration information of the at least one operator, inputting the memory configuration information of the at least one operator into the memory management module and outputting a memory management code;
the compiling unit is used for generating a target back-end code based on the operator calling code and the memory management code corresponding to each operator, compiling the target back-end code based on an operator library to obtain a deployment file, and the deployment file is used for executing the computing task of the first computing graph.
In a third aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for generating a dedicated back-end code supporting machine learning training as described in the first aspect when the program is executed.
In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the dedicated back-end code generation method supporting machine learning training as described in the first aspect.
In a fifth aspect, the present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the dedicated back-end code generation method supporting machine learning training as described in the first aspect.
According to the special back-end code generation method and device supporting machine learning training, the second computation graph is obtained by analyzing and optimizing the first computation graph, the computation efficiency of the computation graph is improved, the target back-end code is generated based on the operator calling code corresponding to each operator generated by the operator mapping module and the memory management code generated by the memory management module, and the target back-end code is compiled based on the operator library to obtain the deployment file.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the problem that the operation efficiency of the GPU applied to the neural network training and the pushing needs to be optimized, the embodiment of the invention provides a special back-end code generation method supporting the machine learning training, and fig. 1 is one of the flow diagrams of the special back-end code generation method supporting the machine learning training provided by the embodiment of the invention. As shown in fig. 1, the method comprises the steps of:
step 100, a first calculation map is obtained, and analysis and optimization processing are performed on the first calculation map to obtain a second calculation map.
It should be noted that, in order to solve the problem that the operation efficiency of the GPU applied to the neural network training and pushing needs to be optimized, the embodiment of the invention provides a special back-end code generating method supporting machine learning training, which realizes an end-to-end compiler supporting the neural network training, is applied to the GPU, improves the operation efficiency of the GPU applied to the neural network training and pushing, and expands the application range of the GPU.
Where operators represent an operation and a computation, such as a convolution calculation or a matrix product.
Optionally, the first computational graph is a neural network model composed of at least one operator in a first order.
It should be noted that the first computation graph is a graph-level structure describing the entire computation task, and the computation task includes executing the operators in the first order of the first computation graph. The first sequence is a topological sequence which is set at will, and at least one operator can obtain different calculation graphs by adopting different topological sequences.
It can be appreciated that the operator provided by the embodiments of the present invention can support various general mathematical operations and prediction and training of various neural network models.
Optionally, the parsing and optimization process includes constant folding or operator fusion.
In one embodiment, the first computational graph is subjected to constant folding or operator fusion processing to obtain a second computational graph.
Step 101, obtaining at least one operator of a second computational graph, distributing the at least one operator to an operator mapping module based on the operator type of the at least one operator, and outputting an operator calling code corresponding to each operator.
It should be noted that, the method for generating a special back-end code supporting machine learning training provided by the embodiment of the invention is applied to a compiling system based on a TVM framework, the front end of the system is an open-source neural network compiling framework TVM in the prior art, the back end of the system includes a code generating module for generating a target back-end code, and the whole system is operated on an electronic device corresponding to a GPU.
Optionally, the code generation module comprises a computation graph traversing module, an operator mapping module, a memory management module and a code output module.
The computation graph traversing module is used for traversing the second computation graph to obtain at least one operator of the second computation graph and obtain memory configuration information of the at least one operator.
The operator mapping module is used for generating an operator calling code corresponding to each operator based on at least one operator of the second computational graph, wherein the operator mapping module comprises an operator mapper and a code generator.
The memory management module is used for generating a memory management code based on the memory configuration information of at least one operator of the second computational graph.
The code output module is used for generating and outputting a target back-end code based on the operator calling code and the memory management code corresponding to each operator.
It should be noted that, the operator mapping module provided in the embodiment of the present invention classifies at least one operator based on the operator type, simplifies the design and maintenance work of the code generator, adopts an advanced (Ahead Of Time, AOT) compiling method, traverses the computation graph before generating the operator calling code, counts the input and output shape and the memory information of at least one operator in the computation graph, optimizes the computation overhead of reshape operators, and directly composes a vector for some fixed array parameters as the input data of the operators.
Optionally, the operator types include an Element-wise operator, a Reducing operator, and a Broadcasting operator.
The Element-wise operator is used to perform the same operation on each location in the input tensor data, such as a unigram operator (not or negative), a binary operator (add, subtract, multiply or divide), or a ternary operator (if a; then b; else c).
The Reducing operator is used for solving elements at specific positions of input tensor data by adopting a preset function and outputting a result.
The Broadcasting operator is used to rearrange the input tensor data and store it in the appropriate place to output tensor data, such as broadcast or slice.
Wherein the operator invoking code is for performing a computational task of the operator.
In one embodiment, traversing the second computational graph based on the computational graph traversing module, obtaining at least one operator of the second computational graph, distributing the at least one operator to the operator mapping module based on an operator type of the at least one operator, and outputting an operator calling code corresponding to each operator, wherein the operator calling code is used for executing a computational task of the operator.
Step 102, obtaining the memory configuration information of the at least one operator, inputting the memory configuration information of the at least one operator into a memory management module, and outputting a memory management code.
It can be understood that, because the input data or the output result of each operator is stored in the memory of the chip, a large amount of data transmission overhead exists between the host and the chip in the process of executing the calculation task of the calculation graph, the embodiment of the invention also provides a memory management module for producing the memory management code based on the memory configuration information of at least one operator, thereby realizing a complete and efficient memory management mechanism and greatly reducing the overhead of memory allocation, release and data transmission based on the memory pool and the newly expanded operator.
Optionally, the memory configuration information is used for recording the input data memory information and the output data memory information of at least one operator.
The input data of the operator is a variable node, and the output result of the operator is an intermediate result, for example, the weight parameter may be either the input data of the operator or the output result of the operator.
The memory management code is used for realizing memory allocation, memory copying or memory release of the input data memory information and the output result memory information of at least one operator.
In one embodiment, the second computation graph is input into the computation graph traversing module, the computation graph traversing module traverses the second computation graph to obtain the input data memory information and the output data memory information of at least one operator, the input data memory information and the output data memory information are recorded into the memory configuration information, the memory configuration information is input into the memory management module, and the memory management code is output.
And 103, generating a target back-end code based on the operator calling code and the memory management code corresponding to each operator, and compiling the target back-end code based on an operator library to obtain a deployment file, wherein the deployment file is used for executing the calculation task of the first calculation graph.
Optionally, the target back-end code includes a computational graph main function, at least one operator function, at least one global variable, header declarations, and wrapper functions, e.g., the target back-end code may be implemented based on the C++ language.
The main function of the computation graph is used for indicating the second computation graph to execute computation tasks according to a first sequence, and comprises at least one operator input and output memory request and the execution flow of the second computation graph, and specifically comprises memory allocation, memory copying, input parameters, function call codes and memory release.
It can be understood that the memory management code comprises memory allocation, memory copying and memory release, which are all generated by the memory management module, and the operator calling code comprises input parameters, function calling codes and operator functions, which are all generated by the operator mapping module.
Wherein the memory copying includes performing memory copying from the host to the chip and performing memory copying from the chip to the host.
The global variable is mainly used in the main function of the calculation graph, is used for controlling the operation logic of the weight parameter updating mechanism, and assists the memory management module in performing memory allocation, memory copying and memory release.
The operator functions are used for being called by the main function of the computation graph according to the function interfaces of each type of operators in sequence, and executing the computation task of the second computation graph through corresponding operation logic.
The header file and the wrapper function provide underlying support, wherein the wrapper function is capable of calling a computational graph main function to provide underlying support.
Optionally, the operator library comprises a compilation method of at least one operator of the second computational graph.
It may be understood that, in the embodiment of the present invention, the operator calling code corresponding to each operator is combined with the memory management code to obtain the target back end code, and after the operator library compiles and links the target back end code, a deployment file is obtained, where the deployment file may be directly run on a chip and used to execute the computing task of the first computation graph.
According to the special back-end code generation method supporting machine learning training, the second calculation diagram is obtained through analyzing and optimizing the first calculation diagram, calculation efficiency of the calculation diagram is improved, the target back-end code is generated based on the operator calling code corresponding to each operator generated by the operator mapping module and the memory management code generated by the memory management module, and compiling processing is conducted on the target back-end code based on the operator library, so that a deployment file is obtained, the deployment file is used for executing calculation tasks of the first calculation diagram, an end-to-end compiler supporting neural network training is realized, calculation efficiency of the GPU applied to the neural network training and pushing is improved, and application range of the GPU is expanded.
Fig. 2 is a schematic structural diagram of a target backend code according to an embodiment of the present invention. As shown in fig. 2, the target back-end code includes a header declaration, a global variable, an operator function, a computation graph main function and a wrapper function, where the computation graph main function includes memory allocation, memory copy, input parameters, function call code and memory release, and the memory copy is mainly divided into a host executing memory copy to an electronic device with a GPU and a host executing memory copy to an electronic device with a GPU.
Based on the foregoing embodiment, the operator mapping module includes at least one operator code generator and an operator mapper;
The operator type based on the at least one operator distributes the at least one operator to an operator mapping module, and outputs an operator calling code corresponding to each operator, which specifically includes:
Acquiring an operator type of at least one operator based on the operator mapper;
distributing the at least one operator to an operator code generator corresponding to each operator based on the operator type of the at least one operator;
And generating a template class based on the at least one operator and codes in the operator code generator, and outputting an operator calling code corresponding to each operator.
It should be noted that, the operator mapping module provided in the embodiment of the present invention includes at least one operator code generator and one operator mapper, where each operator code generator and each operator form a corresponding relationship.
It can be understood that in order to solve the problems of high number of operator types of the computational graph, resulting in reduced design efficiency and higher maintenance cost of the code generator corresponding to each operator, the embodiment of the invention provides an operator mapper, which classifies operators according to the computational characteristics of the operators in the computational graph, such as an Element-wise operator, a Reducing operator or a Broadcasting operator, and by adopting the operator mapper provided by the embodiment of the invention, the design efficiency of the code generator can be improved, the maintenance workload of the code generator can be reduced, and the extensibility of the operators can be improved.
Optionally, the operator mapper is configured to obtain an operator type of at least one operator of the second computational graph, and distribute the at least one operator to an operator code generator corresponding to each operator based on the operator type.
It can be understood that the operator code generator provided by the embodiment of the invention comprises a code generation template class corresponding to an operator, and the code generation template class generates an operator calling code corresponding to the operator according to specific parameters of the operator.
According to the special back-end code generation method supporting machine learning training, at least one operator is distributed to the operator code generator corresponding to each operator based on the operator mapper, the operator calling code corresponding to each operator is generated according to the code generation template class in the operator code generator, the design efficiency of the code generator can be improved, the maintenance workload of the code generator can be reduced, and the expansibility of the operators can be improved.
Based on the foregoing embodiment, before the distributing the at least one operator to the operator mapping module based on the operator type of the at least one operator, the method further includes:
Acquiring a first operator, wherein the first operator is an operator in the second calculation graph;
generating a code generation template class corresponding to the first operator based on the operator type of the first operator and a preset code generation template base class;
and generating and registering an operator code generator corresponding to the first operator based on the code generation template class corresponding to the first operator.
It should be noted that, in the case that the operator is extended, the operator mapping module does not include a code generator corresponding to the new extended operator, and then an operator calling code corresponding to the new extended operator cannot be generated through the existing operator mapping module.
Wherein the first operator belongs to a newly extended operator in the second computational graph.
The operator mapping module comprises a template base class generated by a preset code.
In one embodiment, a first operator is obtained, a preset code generation template base class is expanded based on an operator type of the first operator, a code generation template class corresponding to the first operator is generated, an operator code generator is defined through a general tool function or a macro definition method based on the code generation template class corresponding to the first operator, and the operator code generator is registered in an operator mapping module, wherein the operator code generator corresponds to the first operator.
According to the special back-end code generation method supporting machine learning training, the code generation template class corresponding to the first operator is generated according to the operator type of the first operator and the preset code generation template base class, and registration of the operator code generator is achieved based on the code generation template class, expansion of the code generator is achieved, expansibility of the operator is improved, and the application range of the GPU is further expanded.
FIG. 3 is a second flowchart of a method for generating a dedicated back-end code supporting machine learning training according to an embodiment of the present invention. As shown in fig. 3, the specific flow of the method includes:
The computational graph is input into a computational graph traversing module, at least one operator node of the computational graph is output, an operator mapper maps at least one operator to an operator code generator corresponding to each operator, and operator calling codes corresponding to each operator are generated based on the operator code generator, wherein the operator calling codes comprise input parameters, operator functions and function calling codes.
And inputting the calculation graph into a calculation graph traversing module, outputting the variable and the memory information of the intermediate result, and generating a memory management code through a memory management module.
And inputting an operator calling code and a memory management code corresponding to each operator to a code output module, and outputting a target back-end code.
Based on the content of the above embodiment, the operator calling code includes at least one of an operator function, an input parameter, and a function calling code;
The input parameters include at least one of:
the first tensor data of the function is transmitted in the form of a pointer;
Second tensor data obtained from the at least one operator based on an advanced AOT compilation method;
Integer vectors are derived based on a vector group mechanism.
It can be appreciated that the operator mapping module generates an operator function corresponding to each operator type based on the operator type of at least one operator, generates a function call code corresponding to each operator in the main function for at least one operator of the second computational graph, and generates input parameters in the main function for a fixed array parameter that requires additional definition.
Alternatively, the second tensor data is a tensor size, typically represented by a single number.
It should be noted that, in the embodiment of the present invention, an AOT compiling method is adopted, and in a compiling stage, second tensor data is obtained from at least one operator, and the second tensor is input into a function calling code.
Based on the content of the above embodiment, the integer vector is obtained based on a vector group mechanism, and specifically includes:
inputting a second operator into an operator code generator corresponding to the second operator, and outputting an integer vector, wherein the second operator is an operator of which the input parameters in the second calculation graph comprise the integer vector;
the integer vector is sent to a vector manager, variable names of the integer vector are generated based on the vector manager, and the variable names of the integer vector are sent to an operator code generator corresponding to the second operator;
and outputting the variable name to an operator calling code corresponding to the second operator based on an operator code generator corresponding to the second operator.
It should be noted that, the input parameters include integer vectors, the integer vectors are obtained in the compiling stage and output to the function call code in the running stage, so the embodiment of the invention provides a method of a vector group mechanism.
In one implementation manner, in the vector group mechanism method provided by the embodiment of the present invention, a vector manager is maintained in a compiling stage, and if an operator code generator corresponding to a second operator processes the second operator, the operator code generator generates an integer vector of the second operator and sends the integer vector of the second operator to the vector manager if an input parameter of the second operator is the integer vector.
Further, the vector manager stores information of integer vectors of the second operators, assigns variable names of the integer vectors to the integer vectors, returns the variable names to the operator code generator, and the operator code generator outputs operator calling codes corresponding to the second operators based on the variable names of the integer vectors.
And reading the information of the integer vector recorded in the vector manager and outputting the integer vector in the process that the code output module outputs the operator calling code corresponding to the second operator.
According to the special back-end code generation method supporting machine learning training, which is provided by the embodiment of the invention, the information of integer vectors is stored based on the vector manager, the variable names of the integer vectors are generated, and the operator code generator outputs operator calling codes based on the variable names of the integer vectors, so that the functions of acquiring the integer vectors in a compiling stage and using the integer vectors in an operating stage are realized.
Based on the content of the above embodiment, the memory management code is configured to:
acquiring a memory pool based on the memory configuration information of the at least one operator;
And acquiring the input and output memory requests of the at least one operator, and distributing the memory pool based on the input and output memory requests of the at least one operator.
It should be noted that, when executing the calculation task of the calculation graph in the operation stage, the calculation task of the calculation graph will generate the overhead in the calculation graph, that is, the memory required for executing the calculation graph, and when executing the calculation task of the calculation graph in the calling mode of the TVM framework and the chip operator library in the prior art, redundant data operation will be generated, so that the calculation efficiency is reduced.
Optionally, the input-output memory request is a request for the memory size required for the input data and the output result.
In the operation stage, the memory management code is used for acquiring the total memory required by the input data and the output result of at least one operator based on the memory configuration information of the at least one operator, applying for a memory pool corresponding to the total memory to a chip, and distributing the memory corresponding to the input and output memory request of each operator from the memory pool under the condition of acquiring the input and output memory request of the at least one operator.
According to the special back-end code generation method supporting machine learning training, the memory management codes generated by the memory management module are used for realizing memory allocation, so that redundant operation can be reduced, the operation efficiency of executing a calculation graph is improved, and the overhead in the calculation graph is effectively reduced.
Based on the content of the above embodiment, the memory management code is further configured to:
Acquiring memory information corresponding to an input/output memory request of the at least one operator;
And acquiring weight parameters corresponding to the at least one operator based on the memory information, and updating the weight parameters based on a weight parameter updating mechanism.
It should be noted that, the computing task executing the computation graph in the operation stage also generates the overhead between computation graphs, that is, the data overhead between the computation graphs executed iteratively, so that the memory management code generated based on the memory management module in the embodiment of the invention can update the weight parameters in the chip memory based on the weight update mechanism, thereby reducing the data communication overhead from the chip to the host and the data communication overhead from the host to the chip, and improving the operation efficiency of the computation graph.
It can be understood that in the process of executing the neural network training (the second calculation graph), in order to improve the operation efficiency of the training, the iteration number of the training and the iteration interval of the output result need to be determined, so that a command operator is set in the memory management code implemented in the embodiment of the invention, and is used for controlling and executing the operation logic for the neural network training, and executing the weight parameter updating mechanism, thereby realizing the updating of the weight parameter and effectively reducing the cost of the whole training process.
The weight parameter updating mechanism operates on a chip, and the specific process is that an updating operator acquires memory information corresponding to a weight matrix to be updated, generates a tensor program corresponding to the weight matrix, updates the weight parameter based on the tensor program, and stores the updated weight parameter in a memory corresponding to the weight matrix.
According to the special back-end code generation method supporting machine learning training, the memory management code generated by the memory management module is used for updating the weight parameters, so that memory copying is realized, the overhead between calculation graphs is reduced, and the operation efficiency of the calculation graphs is improved.
Based on the content of the foregoing embodiment, compiling the target backend code based on the operator library specifically includes:
acquiring a preset subtask division method corresponding to an operator type of a third operator in an operator library, wherein the third operator is a supplementary operator;
Performing subtask division processing on an operator calling code corresponding to the third operator based on the preset subtask division method to obtain at least one subtask;
calculating the thread number of each subtask based on the at least one subtask, and executing the at least one subtask in parallel based on the thread number of each subtask;
and the target back-end code comprises an operator calling code corresponding to the third operator.
It should be noted that, the operator library on the chip includes the existing operator compiling method, and the new extended operator (i.e. the supplementary operator) cannot be compiled, so the embodiment of the invention provides the supplementary operator compiling method, realizes the extension of the operator library, and improves the efficiency of the operator compiling process.
In one embodiment, a subtask division method is preset for each operator type, for example, if the operator type is an Element-wise type, the subtask division method is preset to process a continuous input data area for each thread, if the operator is a Rotate function, each thread processes at least one continuous matrix, and if the operator is a group convolution function, each thread processes at least one continuous group.
It may be understood that, according to the compiling method for the third operator provided by the embodiment of the present invention, the ingress keyel function and the external interface function are respectively implemented, the operator calling code of the third operator is subjected to subtask division in the external interface function, at least one subtask corresponding to the third operator is obtained, and the at least one subtask is sent to the ingress keyel function, the thread number of each subtask is calculated in the ingress keyel function, and at least one subtask corresponding to the third operator is executed in parallel according to the thread number of each subtask, so as to obtain and integrate at least one operation result, and the integrated operation result is sent to the external interface function.
According to the special back-end code generation method supporting machine learning training, sub-task division processing is carried out on the operator calling codes corresponding to the third operator based on the preset sub-task division method corresponding to the operator type, at least one sub-task is obtained, at least one sub-task is executed in parallel based on the thread number of each sub-task, compiling processing of the third operator is achieved, expansion of an operator library is achieved, and efficiency of a compiling process of the operator is improved.
Fig. 4 is a schematic flow chart of compiling operator calling codes corresponding to supplementary operators according to an embodiment of the present invention. As shown in fig. 4, the method comprises the steps of:
step 400, the host computer obtains at least one supplementary operator and judges the operator type of the at least one supplementary operator;
Step 401, a host machine performs subtask division processing on operator calling codes corresponding to each supplementary operator according to a preset subtask division method corresponding to an operator type of at least one supplementary operator to obtain at least one subtask corresponding to each supplementary operator;
Step 402, the electronic equipment acquires at least one subtask corresponding to each supplementary operator, and calculates the thread number of each subtask;
step 403, the electronic device executes at least one subtask corresponding to each supplementary operator in parallel based on the thread number of each subtask, and obtains and integrates a calculation result;
step 404, the host obtains the integrated calculation result sent by the electronic device.
The specific back-end code generating device supporting machine learning training provided by the invention is described below, and the specific back-end code generating device supporting machine learning training described below and the specific back-end code generating method supporting machine learning training described above can be referred to correspondingly.
Fig. 5 is a schematic structural diagram of a special back-end code generating device supporting machine learning training according to an embodiment of the present invention. As shown in fig. 5, the dedicated back-end code generating apparatus supporting machine learning training includes a processing unit 500, an operator calling code generating unit 510, a memory management code generating unit 520, and a compiling unit 530, wherein,
The processing unit 500 is configured to obtain a first computational graph, analyze and optimize the first computational graph to obtain a second computational graph, where the first computational graph is a neural network model formed by at least one operator according to a first order;
An operator calling code generating unit 510, configured to obtain at least one operator of the second computation graph, distribute the at least one operator to the operator mapping module based on an operator type of the at least one operator, and output an operator calling code corresponding to each operator;
The memory management code generating unit 520 is configured to obtain memory configuration information of the at least one operator, input the memory configuration information of the at least one operator into the memory management module, and output a memory management code;
The compiling unit 530 is configured to generate a target back-end code based on the operator calling code and the memory management code corresponding to each operator, and compile the target back-end code based on an operator library to obtain a deployment file, where the deployment file is used to execute the computing task of the first computation graph.
According to the special back-end code generation device supporting machine learning training, the second computation graph is obtained by analyzing and optimizing the first computation graph, the computation efficiency of the computation graph is improved, the target back-end code is generated based on the operator calling code corresponding to each operator generated by the operator mapping module and the memory management code generated by the memory management module, and then compiling processing is carried out on the target back-end code based on the operator library, so that a deployment file is obtained, the deployment file is used for executing the computation task of the first computation graph, an end-to-end compiler supporting neural network training is realized, the computation efficiency of the GPU applied to the neural network training and pushing is improved, and the application range of the GPU is expanded.
Optionally, the operator mapping module includes at least one operator code generator and an operator mapper;
The operator type based on the at least one operator distributes the at least one operator to an operator mapping module, and outputs an operator calling code corresponding to each operator, which specifically includes:
Acquiring an operator type of at least one operator based on the operator mapper;
distributing the at least one operator to an operator code generator corresponding to each operator based on the operator type of the at least one operator;
And generating a template class based on the at least one operator and codes in the operator code generator, and outputting an operator calling code corresponding to each operator.
Optionally, before the distributing the at least one operator to the operator mapping module based on the operator type of the at least one operator, the method further includes:
Acquiring a first operator, wherein the first operator is an operator in the second calculation graph;
generating a code generation template class corresponding to the first operator based on the operator type of the first operator and a preset code generation template base class;
and generating and registering an operator code generator corresponding to the first operator based on the code generation template class corresponding to the first operator.
The operator calling code comprises at least one of an operator function, an input parameter and a function calling code;
The input parameters include at least one of:
the first tensor data of the function is transmitted in the form of a pointer;
Second tensor data obtained from the at least one operator based on an advanced AOT compilation method;
Integer vectors are derived based on a vector group mechanism.
The integer vector is obtained based on a vector group mechanism, and specifically comprises the following steps:
inputting a second operator into an operator code generator corresponding to the second operator, and outputting an integer vector, wherein the second operator is an operator of which the input parameters in the second calculation graph comprise the integer vector;
the integer vector is sent to a vector manager, variable names of the integer vector are generated based on the vector manager, and the variable names of the integer vector are sent to an operator code generator corresponding to the second operator;
and outputting the variable name to an operator calling code corresponding to the second operator based on an operator code generator corresponding to the second operator.
The memory management code is configured to:
acquiring a memory pool based on the memory configuration information of the at least one operator;
And acquiring the input and output memory requests of the at least one operator, and distributing the memory pool based on the input and output memory requests of the at least one operator.
The memory management code is further configured to:
Acquiring memory information corresponding to an input/output memory request of the at least one operator;
And acquiring weight parameters corresponding to the at least one operator based on the memory information, and updating the weight parameters based on a weight parameter updating mechanism.
The compiling processing is carried out on the target back-end code based on the operator library, and the compiling processing specifically comprises the following steps:
acquiring a preset subtask division method corresponding to an operator type of a third operator in an operator library, wherein the third operator is a supplementary operator;
Performing subtask division processing on an operator calling code corresponding to the third operator based on the preset subtask division method to obtain at least one subtask;
calculating the thread number of each subtask based on the at least one subtask, and executing the at least one subtask in parallel based on the thread number of each subtask;
and the target back-end code comprises an operator calling code corresponding to the third operator.
The special back-end code generating device supporting machine learning training provided by the invention can realize each process realized by the method embodiments of fig. 1 to 4 and achieve the same technical effects, and is not repeated here.
Fig. 6 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 6, the electronic device may include a processor (processor) 610, a communication interface (Communications Interface) 620, a memory 630, and a communication bus 640, where the processor 610, the communication interface 620, and the memory 630 complete communication with each other through the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a dedicated back-end code generation method supporting machine learning training, the method comprising:
Acquiring a first calculation map, and analyzing and optimizing the first calculation map to obtain a second calculation map, wherein the first calculation map is a neural network model formed by at least one operator according to a first sequence;
Acquiring at least one operator of a second computational graph, distributing the at least one operator to an operator mapping module based on the operator type of the at least one operator, and outputting an operator calling code corresponding to each operator;
acquiring memory configuration information of the at least one operator, inputting the memory configuration information of the at least one operator into a memory management module, and outputting a memory management code;
generating a target back-end code based on the operator calling code and the memory management code corresponding to each operator, and compiling the target back-end code based on an operator library to obtain a deployment file, wherein the deployment file is used for executing the computing task of the first computing graph.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method for generating a dedicated back-end code supporting machine learning training provided by the above methods, the method comprising:
Acquiring a first calculation map, and analyzing and optimizing the first calculation map to obtain a second calculation map, wherein the first calculation map is a neural network model formed by at least one operator according to a first sequence;
Acquiring at least one operator of a second computational graph, distributing the at least one operator to an operator mapping module based on the operator type of the at least one operator, and outputting an operator calling code corresponding to each operator;
acquiring memory configuration information of the at least one operator, inputting the memory configuration information of the at least one operator into a memory management module, and outputting a memory management code;
generating a target back-end code based on the operator calling code and the memory management code corresponding to each operator, and compiling the target back-end code based on an operator library to obtain a deployment file, wherein the deployment file is used for executing the computing task of the first computing graph.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the specific back-end code generation method supporting machine learning training provided in the above embodiments, the method comprising:
Acquiring a first calculation map, and analyzing and optimizing the first calculation map to obtain a second calculation map, wherein the first calculation map is a neural network model formed by at least one operator according to a first sequence;
Acquiring at least one operator of a second computational graph, distributing the at least one operator to an operator mapping module based on the operator type of the at least one operator, and outputting an operator calling code corresponding to each operator;
acquiring memory configuration information of the at least one operator, inputting the memory configuration information of the at least one operator into a memory management module, and outputting a memory management code;
generating a target back-end code based on the operator calling code and the memory management code corresponding to each operator, and compiling the target back-end code based on an operator library to obtain a deployment file, wherein the deployment file is used for executing the computing task of the first computing graph.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.