CN120408524A - Operator fusion method, device, electronic device and storage medium - Google Patents

Operator fusion method, device, electronic device and storage medium

Info

Publication number
CN120408524A
CN120408524A CN202510570933.2A CN202510570933A CN120408524A CN 120408524 A CN120408524 A CN 120408524A CN 202510570933 A CN202510570933 A CN 202510570933A CN 120408524 A CN120408524 A CN 120408524A
Authority
CN
China
Prior art keywords
tensor
operator
data
dimension
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510570933.2A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bilin Technology Development Co ltd
Original Assignee
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bilin Technology Development Co ltd filed Critical Beijing Bilin Technology Development Co ltd
Priority to CN202510570933.2A priority Critical patent/CN120408524A/en
Publication of CN120408524A publication Critical patent/CN120408524A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及人工智能芯片技术领域,提供一种算子融合方法、装置、电子设备和存储介质,其中方法包括:确定待融合算子的输入张量、输出张量和覆盖范围,所述待融合算子包括多个视图类算子;基于所述输入张量的元数据、所述输出张量的元数据以及所述覆盖范围内各视图类算子间的逻辑关系,确定所述输入张量和所述输出张量各维度间的映射关系;基于所述映射关系和所述输入张量中的各数据,对所述输出张量进行数据填充,并将填充后的所述输出张量作为所述待融合算子的融合结果。本发明通过对连续多个视图类算子进行高效快速融合,有效解决了模型中频繁数据搬运造成的性能下降问题。

The present invention relates to the field of artificial intelligence chip technology, and provides an operator fusion method, device, electronic device, and storage medium, wherein the method comprises: determining the input tensor, output tensor, and coverage of the operator to be fused, wherein the operator to be fused includes multiple view-class operators; based on the metadata of the input tensor, the metadata of the output tensor, and the logical relationship between the view-class operators within the coverage, determining the mapping relationship between the input tensor and the dimensions of the output tensor; based on the mapping relationship and the data in the input tensor, filling the output tensor with data, and using the filled output tensor as the fusion result of the operator to be fused. The present invention effectively solves the performance degradation problem caused by frequent data handling in the model by efficiently and quickly fusing multiple consecutive view-class operators.

Description

Operator fusion method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligent chips, and in particular, to an operator fusion method, an operator fusion device, an electronic device, and a storage medium.
Background
View operators play a vital role in deep learning frames and large model applications, and can assist in calculation optimization and ensure that data between different network layers can be correctly circulated and converted. However, in practical large model applications, partial View class operators (e.g., transpose, permute, etc.) may cause the tensors to be discontinuous in memory. Once such a discontinuity occurs, the deep learning framework (e.g., pyTorch) automatically invokes a configuration operation to perform data handling to restore tensor continuity on memory.
However, when these View class operators are frequently used in the model, the configuration operations are frequently triggered, resulting in a large amount of data handling. Tensor data handling is a very time-consuming operation, and particularly in the case of large-scale computing tasks and tensor dimensions, is very likely to be a performance bottleneck in model reasoning or training processes.
Disclosure of Invention
The invention provides an operator fusion method, an operator fusion device, electronic equipment and a storage medium, which are used for solving the problem of performance degradation caused by frequent data handling when a View operator is used in a model.
The invention provides an operator fusion method, which comprises the following steps:
determining an input tensor, an output tensor and a coverage range of an operator to be fused, wherein the operator to be fused comprises a plurality of view operators;
determining a mapping relation between each dimension of the input tensor and the output tensor based on the metadata of the input tensor, the metadata of the output tensor and the logic relation among view operators in the coverage range;
And based on the mapping relation and each data in the input tensor, carrying out data filling on the output tensor, and taking the filled output tensor as a fusion result of the operator to be fused.
According to the operator fusion method provided by the invention, the coverage area is automatically detected based on user input or a frame, and the determining step of the logic relationship among view operators in the coverage area comprises the following steps:
Determining the logic relationship among the view type operators based on the connection relationship of the view type operators in the coverage area, the operator attribute of the view type operators and the tensor attribute connected among the view type operators.
According to the operator fusion method provided by the invention, the data filling is carried out on the output tensor based on the mapping relation and each data in the input tensor, and the operator fusion method comprises the following steps:
Traversing the target index on each dimension of the output tensor;
Determining a source index corresponding to the currently traversed target index in the input tensor based on the mapping relation;
Based on the source index, corresponding data is read from the input tensor, and the data is filled into the output tensor according to the currently traversed target index.
According to the operator fusion method provided by the invention, based on the source index, corresponding data is read from the input tensor, and the data is filled into the output tensor according to the currently traversed target index, and the method comprises the following steps:
based on the source index, reading corresponding data from the input tensor to a temporary storage area, wherein the temporary storage area is any one of a register, a shared memory and a cache;
Writing data in the temporary storage area into the output tensor based on the currently traversed target index.
According to the operator fusion method provided by the invention, the data filling is carried out on the output tensor based on the mapping relation and each data in the input tensor, and the operator fusion method comprises the following steps:
determining a corresponding dimension of any dimension of the output tensor in the input tensor based on the mapping relation;
and under the condition that the target index of any dimension is continuous and the sequence is the same as the source index of the corresponding dimension, reading each data of the corresponding dimension of the input tensor by adopting a data block reading instruction, and filling each data into any dimension of the output tensor.
According to one operator fusion method provided by the invention, the input tensor comprises one or more, and the output tensor comprises one or more.
According to the operator fusion method provided by the invention, the metadata comprises the shape, dimension and step size of tensors.
The invention also provides an operator fusion device, which comprises:
the range determining unit is used for determining an input tensor, an output tensor and a coverage range of an operator to be fused, wherein the operator to be fused comprises a plurality of view operators;
A relationship determining unit, configured to determine a mapping relationship between each dimension of the input tensor and the output tensor based on metadata of the input tensor, metadata of the output tensor, and a logical relationship between each view operator in the coverage area;
and the operator fusion unit is used for carrying out data filling on the output tensor based on the mapping relation and each data in the input tensor, and taking the filled output tensor as a fusion result of the operator to be fused.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, the processor implementing any one of the operator fusion methods described above when executing the computer program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an operator fusion method as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements an operator fusion method as described in any one of the above.
According to the operator fusion method, device, electronic equipment and storage medium, through metadata based on the time input tensor and the output tensor and the logic relation between the view operators, the mapping relation between the input tensor and the output tensor of the operator to be fused can be deduced accurately, so that the data in the input tensor can be filled into the corresponding position of the output tensor accurately according to the mapping relation, operator fusion is realized efficiently, namely, a plurality of view operators which are needed to be executed continuously are integrated and optimized into one fusion operator. By means of the fusion mode, the execution times of view transformation operation and the data carrying times are obviously reduced in the running process of the model, so that the additional expense caused by data carrying is effectively reduced, and the problem of performance degradation caused by frequent data carrying in the model is solved.
Drawings
In order to more clearly illustrate the invention or the technical solutions in the related art, the following description will briefly explain the drawings used in the embodiments or the related art description, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a schematic diagram of a graphics processor according to the present invention;
FIG. 2 is a flow diagram of an operator fusion method provided by the present invention;
FIG. 3 is a schematic diagram of the connection of view class operators provided by the present invention;
FIG. 4 is a schematic diagram of a specific structure and coverage of a fusion operator provided by the present invention;
FIG. 5 is a schematic diagram of a mapping relationship between input tensors and output tensors provided by the present invention;
FIG. 6 is a schematic diagram of an operator fusion apparatus provided by the present invention;
Fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the training and reasoning process of the large model, parallel strategies such as data parallelism, tensor parallelism, sequence parallelism and the like are often adopted to improve efficiency, and all the parallel strategies need to carry out segmentation processing on tensors. In order to ensure that the tensor after the segmentation process can adapt to the interface requirements of different subsequent operators, tensor transformation operation is frequently introduced into the model. Such tensor transformations are also called View class transformations, which are mainly implemented by View class operators, such as reshape (remodelling), transferring, view (morphing), permute (permutation), repeat, split, etc., and the core of these View class operators is to change the shape, arrangement order, or memory layout of the tensors to adapt to different computing requirements.
In large models, view class operations mainly involve the transformation and adjustment of tensor shapes, which aims to optimize the computation flow, strengthen memory management, and ensure smooth adaptation of data between different network layers. The following are several common View class operators and application scenarios thereof:
reshape operator for changing the shape of tensors, but not the memory layout of the data. For example, the shape of the tensor is reshaped from (2, 6) to (3, 4). In the AI (ARTIFICIAL INTELLIGENCE ) model, there are often differences in input and output shapes between different layers, so the data shape needs to be adjusted by reshape operator, so that the data can be smoothly transferred to the next layer. For example, between fully connected and convolutional layers, the tensor output by the convolutional layer is typically converted to a one-dimensional tensor by reshape operators to meet the input requirements of the fully connected layer.
A transfer operator for exchanging the dimensions of the tensor. For example, tensors with the shape (2, 3) are transposed to (3, 2). The operation is widely applied to scenes such as matrix multiplication, convolution operation and the like, and particularly when processing input and output of different dimensions, the transposition operation can ensure that data are processed according to the correct dimension sequence, so that calculation errors caused by dimension mismatch are avoided.
View operator-creating a new view by changing the tensor shape, usually without involving memory copying. For example, tensors of the shape (2, 6) are converted to (12). In the deep learning model, the tensor shape needs to be dynamically adjusted according to task requirements, but users often want to avoid additional memory overhead. The view operation flexibly adjusts the tensor shape by sharing the same memory without copying data, thereby improving the computing efficiency and reducing the memory consumption.
Permute operator for rearranging the dimensions of tensors. Some operations impose strict requirements on the order of tensor dimensions, e.g. in image processing, it is often necessary to adjust the position of the channel dimensions, e.g. to adjust the tensor shaped as (C, H, W) to (H, W, C), where C represents the channel dimension of the tensor, H represents the height dimension and W represents the width dimension. This operation helps to optimize matrix operation performance because different hardware platforms (e.g., GPUs) have different requirements for the data layout, and permute operations ensure that the data layout meets the computational requirements.
And the repeat operator is used for repeatedly expanding the tensor in the appointed dimension to generate a new tensor with larger dimension. For example, in an image processing task, the image feature map may be repeated in the channel dimension by a repeat operator, e.g., a feature map having a shape of (C, H, W) is repeated k times in the channel dimension, generating a feature map having a shape of (CK, H, W) for increasing the diversity of training samples.
The split operator is used to split the tensor into multiple sub-tensors along a specified dimension, and can be implemented by specifying the number of splits or the size of each sub-tensor. For example, in a convolutional neural network, a feature map can be split into multiple sub-feature maps along a channel dimension by using split operators, and the sub-feature maps are respectively sent to different branch networks for processing, for example, the feature map with the shape (C, H, W) is split into k sub-feature maps along the channel dimension, and each sub-feature map with the shape (C/k, H, W) is used for realizing multi-branch feature extraction.
By reasonably using the View operator, the large model can process data more efficiently, and the overall calculation performance is improved.
The deep learning framework serves as a bridge connecting the large model theory and practical application, and provides a series of convenient APIs (Application Programming Interface, application programming interfaces), modules and tools for developers so as to easily realize the construction, training, optimization and deployment of the model. Currently, common deep learning frameworks include TensorFlow, pyTorch, keras, caffe and the like. The various data handled by the large model often exist in the form of multi-dimensional arrays, which are typically modeled as tensors in a deep learning framework.
In the deep learning framework, some operations (e.g., the view operation of PyTorch) require that the tensors be contiguous in memory, i.e., that the data elements in the tensors be stored sequentially in memory, otherwise errors may be reported. If the tensor is not continuous, pyTorch automatically invokes a configuration operation to ensure continuity of the tensor on memory. Specifically, pyTorch applies for a new memory space, and copies the data in the tensor to the new memory space according to the logical data arrangement.
However, in practical applications, partial View class operators (e.g., transpose, permute, etc.) may cause tensors to be discontinuous in memory. When such View class operators are frequently used in the model, the configuration operations are frequently invoked, resulting in a large amount of data handling. Data handling of tensors is a very time-consuming operation, especially in the case of large-scale computation and high tensor dimensions. Moreover, frequent invocation of View class operators can increase computational graph complexity, especially for each back propagation step in the model training process. In addition, each time a View class operator is invoked to perform a View transformation on a tensor, the size and memory layout of the tensor may need to be recalculated in the background, thereby affecting the performance of model training or reasoning.
In this regard, the invention provides an operator fusion method for View operators in a large model, which solves the problem of performance degradation caused by frequent data handling in the model by carrying out efficient and rapid fusion on a plurality of continuous View operators, thereby overcoming the defects. The technical scheme provided by the invention will be described in detail.
It should be noted that, the execution body of the operator fusion method provided by the present invention may be an artificial intelligent chip, such as GPU (Graphics Processing Unit, graphics processor), GPGPU (General-purpose computing on Graphics Processing Units, general graphics processor), TPU (Tensor Processing Unit, tensor processor), etc. The following will briefly describe the structure of the execution body of the present invention, taking a GPU as an example.
Fig. 1 is a schematic diagram of a graphics processor according to the present invention, and as shown in fig. 1, the graphics processor 100 at least includes a plurality of stream processor clusters (STREAMING PROCESSOR CLUSTER, SPC) 101 and a global memory 102, where each stream processor cluster 101 includes a plurality of computing units 103, and each computing unit 103 includes a plurality of cores (cores). The graphics processor 100 of the present invention may include other structures in addition to the above-described structures, and the present invention is not limited thereto. It should be appreciated that GPUs possess a large number of computational cores (i.e., cores within a compute unit), are adept at parallel computing, and are extremely efficient in handling large-scale data computation tasks. In computationally intensive fields such as deep learning, GPUs have become the dominant computing hardware. Many deep learning frameworks are optimized for GPUs so that operations such as operator fusion can be performed efficiently on GPUs.
It will be appreciated that in the related art, when the model frequently uses View class algorithms, the configuration operation is frequently invoked, in which the GPU core needs to read discontinuous data from the original memory element by element and write the discontinuous data into the new memory, and this process involves the reading and writing of the global memory, which generates a large amount of data handling, thereby affecting the performance of model training and reasoning. In this regard, the present invention can effectively reduce data handling by performing efficient and rapid fusion on a plurality of continuous View operators, thereby reducing the number of accesses to the global memory and improving the overall performance.
Fig. 2 is a flow chart of an operator fusion method provided by the present invention, as shown in fig. 2, the method includes:
Step 210, determining an input tensor, an output tensor and a coverage range of an operator to be fused, wherein the operator to be fused comprises a plurality of view class operators.
Specifically, the operator to be fused refers to a plurality of View operators which need to be fused together. The input tensor of the operator to be fused refers to a data source for the operator to be fused to operate, which is a multidimensional array containing data required by operator processing. The output tensor of the operator to be fused refers to the result obtained by processing the input tensor by the operator to be fused, and the shape of the output tensor may be changed according to the operation of the operator. The coverage range of operators to be fused refers to the operation ranges related to a plurality of View type operators participating in fusion. By determining the coverage, it is possible to ascertain which operators need to be fused, and the order of operations and logical relationships between them.
It can be understood that the operator fusion method provided by the invention can be applied to the fields of image processing, text processing, voice processing and the like, and the input tensor and the output tensor have different physical meanings according to different application scenes.
Illustratively, in the field of image processing, the input tensor may be a tensor representing an image in the shape of [ B, C, H, W ], where B is the batch dimension, which represents the batch size, C is the channel dimension, which represents the number of channels, H and W are the height dimension and the width dimension, respectively, which represent the height and width of the image. For example, an RGB image may be represented as an input tensor of the shape [1, 3, 224, 224], where a batch size of 1,3 represents three channels of RGB and 224 represent the height and width of the image, respectively. The shape of the output tensor then depends on the processing operation on the image, e.g. if the number of channels and spatial dimensions of the image are rearranged using reshape and permute operators, resulting in a shape of [1, 224]224, 3), Which is the output tensor.
For another example, in the text processing field, the input tensor may be a tensor representing a sequence of text. For the sentence "I love natural language processing", it can be encoded as an input tensor of the shape [1, 5, 768], where a batch size of 1,5 represents the length of the sentence (assuming the sentence is split into 5 words), 768 represents the vector dimension of each word. The shape of the output tensor depends on the processing task on the text, e.g. if the dimensions of the text sequence are transformed using the permute operator, changing the shape from [1, 5, 768] to [5, 1, 768], this tensor can be the output tensor.
In an embodiment, the input tensor, the output tensor and the coverage range of the operator to be fused can be automatically detected through a deep learning framework. Specifically, the computational graph can be analyzed and traversed through the internal graph structure of the deep learning framework, the View operators in the computational graph are identified and detected, and when all operators from one node to the other node in the computational graph are detected to be View operators, it can be determined that the View operators corresponding to the nodes are combined to form an operator to be fused, and the coverage range of the operator to be fused is all View operators from the starting node to the ending node. Further, from the detected initial node, the first non-View operator (such as convolution, full connection, etc.) is found along the computational graph in a backward trace, and the output of the operator is the input tensor of the operator to be fused. If the starting node is itself the input node of the model, the tensor of the node is the input tensor. Similarly, starting from the detected end node of the View class operator combination, forward advancing along the flow direction of the computational graph, and finding the output node of the first non-View class operator or model, wherein the input of the operator is the output tensor of the operator to be fused. If the end node itself is the output node of the model, the tensor of the node is the output tensor.
In another embodiment, the input tensor, the output tensor and the coverage range of the operator to be fused can also be obtained by user input, that is, when the user calls the operators in the upper layer framework, the operator to be fused can be directly specified to include which View operators, the connection relationship between the operators, the corresponding input tensor, the corresponding output tensor and the like.
Fig. 3 is a schematic diagram of connection of view operators provided by the present invention, as shown in fig. 3, tensor 1 is processed by an arrangement operator to obtain tensor 2, and the tensor 2 and tensor 3 are respectively remolded to obtain corresponding tensor 4 and tensor 5, and the tensor 4 and tensor 5 are spliced by a splicing operator to obtain tensor 6. In the connection schematic diagram shown in fig. 3, since the permutation operator, the remodeling operator and the splicing operator are continuous multiple View operators, they form an operator to be fused, the input tensor of the operator to be fused comprises tensor 1 and tensor 3, the output tensor is tensor 6, and the connection path between the input tensor and the output tensor is the coverage of the operator to be fused.
Step 220, determining a mapping relationship between each dimension of the input tensor and the output tensor based on the metadata of the input tensor, the metadata of the output tensor and the logic relationship between each view operator in the coverage area.
It should be noted that metadata of a tensor refers to information describing an attribute or structure of the tensor itself, and is not actual data stored in the tensor. The metadata of the tensor includes, but is not limited to, shape, dimension, step size, etc. of the tensor, where shape is used to describe the dimensional structure and dimension size of the tensor, and dimension is used to describe the number of dimensions of the tensor, for example, the shape of the tensor may be [3, 4], which represents that the tensor is a matrix of 3 rows and 4 columns, and the dimension of the tensor is 2, which represents that it is a two-dimensional tensor. The step size refers to the number of bytes that need to be skipped when accessing the tensor element in the memory and moving one unit along each dimension, for example (4, 1) indicates that adjacent row elements are separated by 4 bytes and adjacent column elements are separated by 1 byte in the memory. It should be appreciated that in the deep learning framework, tensor metadata may be obtained by invoking the corresponding interface function (e.g., shape, stride).
The logical relationship among View class operators in the coverage range refers to the dependency relationship and the operation sequence among a plurality of View class operators. For example, as shown in FIG. 3, the logical relationship between the permutation operator and the remodelling operator is that the output of the permutation operator is the input of the remodelling operator and the permutation operator must be performed before the remodelling operator. The logical relationships between other View class operators are similar and will not be described in detail herein.
Specifically, according to metadata of the input tensor, metadata of the output tensor, and a logic relationship between View operators in the coverage area, a mapping relationship between dimensions of the input tensor and the output tensor can be determined. Here, the mapping relationship refers to a correspondence relationship between each dimension of the input tensor and a dimension of the output tensor, for example, assuming that a shape of the input tensor is [ B, C, H, W ], a dimension 0 thereof is a batch dimension, a dimension 1 thereof is a channel dimension, a dimension 2 thereof is a height dimension, a dimension 3 thereof is a width dimension, a shape of the output tensor is [ B, H, W, C ], a dimension 0 thereof is a batch dimension, a dimension 1 thereof is a height dimension, a dimension 2 thereof is a width dimension, a dimension 3 thereof is a channel dimension, and a mapping relationship between the two dimensions is that the input dimension 0 corresponds to the output dimension 0 (batch dimension is unchanged), the input dimension 1 corresponds to the output dimension 3 (channel dimension is changed to the last dimension), the input dimension 2 corresponds to the output dimension 1 (height dimension is changed to the second dimension), and the input dimension 3 corresponds to the output dimension 2 (width dimension is changed to the third dimension).
It can be understood that, in order to determine the mapping relationship between each dimension of the input tensor and the output tensor, the method can be implemented by performing deduction analysis according to the obtained metadata such as the shape, step length and the like of the input tensor and the output tensor and the logic relationship of each View operator in the coverage area, gradually deducing according to the operation function and the operation sequence of each View operator in the analysis process, and finally obtaining the dimension mapping relationship from the input tensor to the output tensor, and describing the corresponding relationship between the input dimension and the output dimension by using a dimension index mapping table.
As shown in fig. 3, assume that the shape of tensor 1 is [ n, h, w ], and the shape of tensor 3 is [ n, 2 ]W, h ], where n represents the batch dimension, h represents the height dimension, and w represents the width dimension. Tensors 1 to 6 are respectively expressed as tensor1 to tensor, and the transformation relation between the tensors can be expressed as follows:
tensor2 = tensor1.permute(0, 2, 1);
tensor4 = tensor2.reshape(nw, h);
tensor5 = tensor3.reshape(2nw, h);
tensor6 = torch.cat((tensor4, tensor5), dim=0);
Since the shape of the input tensor tensor is [ n, h, w ] and tensor2 is obtained by rearranging the dimensions of tensor1 by permute operator, permute (0, 2, 1) above represents that the second dimension (dim=1) and the third dimension (dim=2) of the tensor are interchanged, and therefore, the shape of tensor2 can be determined to be [ n, w, h ] according to the shape of tensor1 and the operation function of permute operator. tensor4 is obtained by reshaping tensor2 by reshape operator, reshape (n W, h) represents flattening the tensor into two dimensions, where the first dimension is nW, the second dimension is h, and likewise, from the shape of tensor and the operational function of the reshape operator, the shape of tensor4 can be determined to be [ n ]w, h]。
Since the input tensor tensor has the shape of [ n, 2W, h ], and tensor is obtained by reshaping tensor3 by reshape operator, reshape (2nW, h) represents flattening the tensor into two dimensions, where the first dimension is 2nW, the second dimension is h, and from the shape of tensor and the operational function of the reshape operator, the shape of tensor5 can be determined to be [2nW, h ]. Finally, by concatenating tensor and tensor5 along the first dimension (dim=0) by the cat operator, the shape of tensor6 can be determined as [ n ] from the shape of tensor and tensor5 and the operational function of the cat operatorw+2nw, h]=[3nw, h]。
According to the above gradual analysis and derivation process, the mapping relationship between each dimension of the input tensor and the output tensor can be determined, i.e. the first dimension of the output tensor is divided into two parts, wherein the first part nW is determined by the batch dimension (n) and the width dimension (w) of the input tensor tensor, the second part 2nW is defined by the batch dimension (n) and the width dimension (2) of the input tensor tensor3W) is determined, while the second dimension of the output tensor is determined directly from tensor and tensor3 height dimensions.
In addition, after the mapping relation between each dimension of the input tensor and the output tensor is determined, the mapping relation can be verified according to the metadata of the input tensor and the metadata of the output tensor, so as to ensure that the mapping relation accords with the logic definition of an operator.
And 230, performing data filling on the output tensor based on the mapping relation and each data in the input tensor, and taking the filled output tensor as a fusion result of the operator to be fused.
Specifically, after determining and obtaining the mapping relation between each dimension of the input tensor and the output tensor, each data mapping in the input tensor can be filled into the corresponding position of the output tensor according to the mapping relation. Here, each data in the input tensor refers to a specific value stored in the input tensor, and the tensor is a multidimensional array, and each element of the tensor is a scalar value (such as float32, int64, etc.).
Specifically, the mapping relationship defines a correspondence relationship between the input tensor dimension and the output tensor dimension. According to the mapping relation, each data in the input tensor can be filled into the output tensor by firstly traversing the index of each data point in each dimension in the output tensor, determining the corresponding index of the currently traversed index in the input tensor according to the mapping relation, then reading the corresponding data from the input tensor according to the corresponding index, and storing the data in the corresponding position of the output tensor. And after the traversing is completed, obtaining the filled output tensor. Here, the filled output tensor refers to a tensor obtained by rearranging data in the input tensor according to a mapping relationship, and may have a different shape from the input tensor, the data content is the same as the input tensor, but the arrangement order may be different. It should be appreciated that the data population of the output tensor based on the mapping relationship and the respective data in the input tensor may be performed by the computational core of the GPU. The GPU computing core reads each data from the global memory address corresponding to the input tensor, and then writes the read input tensor data into the global memory corresponding to the output tensor according to the mapping relation determined before.
It can be understood that after the filled output tensor is obtained, the output tensor can be used as a final fusion result of the operator to be fused. Here, the fusion result is an output tensor directly obtained after merging the operations of a plurality of View operators into one operator, which is logically equivalent to the result of the original computational graph, but has higher computational efficiency. For example, as shown in fig. 3, according to the mapping relationship, the data in the input tensors (i.e. tensor 1 and tensor 3) are directly filled into the output tensor (i.e. tensor 6), and the fused operator directly outputs the result, without separately executing the permutation operator, the remodelling operator, the splicing operator, and the like. The data content of the filled output tensor is consistent with the result of the original computational graph, but the computational process is more efficient.
According to the method provided by the embodiment of the invention, the mapping relation between the input tensor and the output tensor of the operator to be fused can be accurately deduced through the metadata based on the time input tensor and the output tensor and the logic relation between the view operators, so that each data in the input tensor can be accurately filled into the corresponding position of the output tensor according to the mapping relation, operator fusion is effectively realized, namely, a plurality of view operators which are originally required to be continuously executed are integrated and optimized into one fusion operator. By means of the fusion mode, the execution times of view transformation operation and the data carrying times are obviously reduced in the running process of the model, so that the additional expense caused by data carrying is effectively reduced, and the problem of performance degradation caused by frequent data carrying in the model is solved.
Based on the above embodiments, the coverage is automatically detected based on user input or a frame.
Specifically, the coverage range of the operator to be fused can be specified by user input or can be automatically detected by the deep learning framework, and the embodiment of the invention is not limited in particular. It should be noted that, the specific implementation manner of determining the coverage through the user input or the automatic frame detection may refer to the above embodiment, which is not described herein.
Based on any one of the above embodiments, the determining the logical relationship between view class operators in the coverage area includes:
Determining the logic relationship among the view type operators based on the connection relationship of the view type operators in the coverage area, the operator attribute of the view type operators and the tensor attribute connected among the view type operators.
It should be noted that, the connection relationship of each View operator in the coverage range refers to the association of the sequence, the dependency relationship, the data flow direction and the like in the calculation flow among a plurality of View operators in the same coverage range under the operator fusion scene.
Operator attributes of each view-class operator refer to various parameters and features that describe the characteristics and behavior of the view-class operator itself. These attributes reflect the functions, limitations and features that operators have in the computation process. The operator attributes may include operation types, input and output requirements, and parameter configuration, where an operation type refers to a specific operation performed by an operator on a tensor, such as permutation, remodeling, splicing, etc., an input and output requirement refers to a requirement of an operator on a shape, a data type, etc. of an input tensor and an output tensor, and a parameter configuration refers to additional information required by an operator when the operator is executed, for example, parameters of permute operators are tuples in a dimensional order, such as parameters (0, 2, 1) indicate that 0-th dimension of the input tensor is kept unchanged, and 1-th and 2-nd dimensions exchange positions.
Tensor attributes connected between view class operators refer to various characteristics and parameters of tensors connected with two adjacent view class operators, and the tensor attributes describe information such as the shape, data type, storage layout and the like of the tensors.
Specifically, according to the connection relation of the view operators, the operator attribute of the view operators and the tensor attribute of the connection between the view operators in the coverage range, the logic relation between the view operators can be determined by firstly determining the sequence and the dependency relation between the operators based on the connection relation. For example, if the output tensor of operator a is the input tensor of operator B, it may be determined that a must be performed before B, which depends on the output result of a. Meanwhile, according to the connection relation, the operator from which the data flows to which operator and the transfer mode of the data among operators can be known. Then, for each view class operator in the coverage area, according to the operation type and parameter configuration thereof and tensor attributes connected with adjacent view class operators, a logic operation mode between operators can be determined, for example, if A is permute operator and B is reshape operator, which dimensions of the input tensor are rearranged can be determined according to the parameter configuration of A and the attributes of the input tensor, and how B specifically reshapes the output tensor of A is further determined according to the parameter configuration of B and the attributes of the output tensor of A. The analysis process is repeated until the logic relation among the view operators in the coverage area is obtained, so that an accurate basis is provided for the subsequent operator fusion.
Based on any of the foregoing embodiments, in step 230, the data filling the output tensor based on the mapping relationship and each data in the input tensor includes:
Step 231, traversing the target index in each dimension of the output tensor.
Specifically, the target index in each dimension of the output tensor refers to an index combination for uniquely identifying each element position in the multidimensional array structure of the output tensor. For example, for a three-dimensional output tensor, whose shape is [ B, H, W ], the position of an element can be represented by a triplet (i, j, k), where i is the index of the first dimension, j is the index of the second dimension, and k is the index of the third dimension. These indices (i, j, k) are target indices that specify the target locations of the data in the output tensor.
It will be appreciated that traversing the target index across the dimensions of the output tensor may be accomplished by a nested loop, the particular implementation depending on the number of dimensions of the output tensor. For example, for a three-dimensional output tensor, the target index for each dimension may be traversed by the following nested loop:
for i in range(output_tensor.shape[0]):
for j in range(output_tensor.shape[1]):
for k in range(output_tensor.shape[2]):
In the above example, output_ tensor represents the output tensor, shape [0] represents the first dimension of the tensor, shape [1] represents the second dimension of the tensor, shape [2] represents the third dimension of the tensor. For higher-dimension tensors, the same can be done with more nested loops, and will not be repeated here.
Step 232, determining a source index corresponding to the currently traversed target index in the input tensor based on the mapping relation.
The target index currently traversed refers to the combination of indexes currently being processed when traversing each dimension of the output tensor in step 231. For example, when traversing a three-dimensional output tensor, the target index of the current traversal may be (2, 3, 1), indicating that the element of the output tensor with the first dimension index of 2, the second dimension index of 3, and the third dimension index of 1is currently being processed.
The source index corresponding to the target index of the current traversal in the input tensor refers to mapping the target index of the current traversal to the corresponding index combination in the input tensor according to the mapping relation. This source index is used to read data from the input tensor to populate the target position currently traversed in the output tensor.
In particular, the mapping relationship characterizes the correspondence between the dimensions of the input tensor and the output tensor, which may be a function, a lookup table or a mathematical formula, the specific form depending on the shape and transformation of the input and output tensors.
For example, if the mapping is expressed by a mathematical formula, the source index may be calculated directly by a function, and assuming that the target index (i, j, k) of the output tensor, the source index (i ', j', k ') corresponding to the input tensor, and the mapping is i' =f1 (i, j, k), j '=f2 (i, j, k), k' =f3 (i, j, k), the source index may be calculated by calling these functions.
For another example, if the mapping relationship is complex, the mapping relationship between all the target indexes and the source indexes may be calculated in advance and stored in one lookup table. When traversing the output tensor, the corresponding source index is directly looked up from the lookup table.
Step 233, based on the source index, reading corresponding data from the input tensor, and filling the data into the output tensor according to the currently traversed target index.
Specifically, after determining to obtain the source index corresponding to the target index, the source index may be used to read data from the input tensor. For example, if the source index is (i ', j', k '), the data of the corresponding position in the input tensor may be read through input_ tensor [ i', j ', k' ]. The read data is then stuffed into the output tensor using the currently traversed target index. For example, if the target index of the current traversal is (i, j, k), the data may be padded to the corresponding position in the output tensor by output_ tensor [ i, j, k ] =data.
Based on any of the above embodiments, step 233 specifically includes:
step 2331, based on the source index, reading corresponding data from the input tensor to a temporary storage area, wherein the temporary storage area is any one of a register, a shared memory and a cache;
Step 2332, writing the data in the scratch pad into the output tensor based on the currently traversed target index.
It should be noted that, considering that in some hardware devices (such as GPU), the access delay of the global memory is high and the bandwidth is limited, if the data is directly read from the global memory and written back to the global memory, the memory bandwidth is a bottleneck, which reduces the computing efficiency. In this regard, the embodiment of the present invention proposes that, based on the consideration of hardware architecture and performance optimization, in the process of data mapping and filling, data in an input tensor may be read into a temporary storage area first, and then written into an output tensor from the temporary storage area.
Specifically, the temporary storage area is a memory area for temporarily storing data, and can accelerate data access and processing. Here, the temporary storage area may be a register, a shared memory, a cache, or the like, which has access delay far lower than that of the global memory, and is suitable as a data transfer station. The method comprises the steps of reading data from a global memory address corresponding to a source index of an input tensor through a memory access instruction, storing the data in a temporary storage area, and then writing the data in the temporary storage area back to the global memory address corresponding to a target index of an output tensor through the memory access instruction.
Based on any of the foregoing embodiments, in step 230, the data filling the output tensor based on the mapping relationship and each data in the input tensor includes:
determining a corresponding dimension of any dimension of the output tensor in the input tensor based on the mapping relation;
and under the condition that the target index of any dimension is continuous and the sequence is the same as the source index of the corresponding dimension, reading each data of the corresponding dimension of the input tensor by adopting a data block reading instruction, and filling each data into any dimension of the output tensor.
It should be noted that, considering that the GPU generally has a dedicated optimized access manner to a tensor of a specific layout (layout), for example, for a tensor type of Matrix3D (three-dimensional Matrix), the GPU kernel may read and write multiple data at a time through merging accesses during the accessing. By using the special access instruction, the access performance of the fusion operator can be further improved.
Specifically, when the output tensor is filled with data based on the mapping relationship and each data in the input tensor, the corresponding relationship between each dimension of the output tensor and each dimension of the input tensor may be determined according to the mapping relationship. Then, a relationship between the target index in each output dimension and the source index in the corresponding input dimension is determined, if the target index in a certain output dimension is continuous and in the same order as the source index in the corresponding input dimension, the data in the output dimension is continuously stored in the corresponding input dimension, and the target index sequence of the output dimension is the same as the source index sequence of the corresponding input dimension. In this case, since the data are continuous and in the same order, the data block reading instruction can be used to read a plurality of data at a time, and the access operation is combined, so that the memory access efficiency is improved, and the occupation of the memory bandwidth is reduced. Here, the data block read instruction refers to an instruction capable of reading a plurality of continuous data at once.
If the target index of a certain output dimension is not continuous or sequential with the source index of the corresponding input dimension, it indicates that the data in that output dimension is not continuously stored in the corresponding dimension of the input tensor, in this case, since the data is not continuous or sequential, a data block read instruction cannot be used, and it is necessary to read element by element and fill the corresponding data element into the output tensor.
In the embodiment of the invention, the intelligent analysis and optimization of the tensor memory layout can ensure that continuous view operation is completed in one memory access, avoid unnecessary memory copying and layout conversion and improve the overall performance.
Based on any of the above embodiments, the input tensor includes one or more and the output tensor includes one or more.
In particular, in practical applications, operators may need to process different amounts and types of input data. For example, in an image processing task, an operator may need to process multiple input images (e.g., multi-view images or images of different scales) simultaneously, or in a deep learning model, a layer may need to receive output from multiple predecessor layers. The operator fusion method provided by the embodiment of the invention can support the scenes of one or more input tensors and one or more output tensors, so that the fusion operator can flexibly adapt to various complex scenes.
Based on any of the embodiments above, the metadata includes a shape, a dimension, and a step size of the tensor.
In particular, metadata of a tensor refers to information describing the properties or structure of the tensor itself, rather than the actual data stored in the tensor. The metadata of the tensor includes, but is not limited to, shape, dimension, step size, etc. of the tensor, where shape is used to describe the dimensional structure and dimension size of the tensor, and dimension is used to describe the number of dimensions of the tensor, for example, the shape of the tensor may be [3, 4], which represents that the tensor is a matrix of 3 rows and 4 columns, and the dimension of the tensor is 2, which represents that it is a two-dimensional tensor.
The step size refers to the number of bytes that need to be skipped when accessing the tensor element in the memory and moving one unit along each dimension, for example (4, 1) indicates that adjacent row elements are separated by 4 bytes and adjacent column elements are separated by 1 byte in the memory. It should be appreciated that in the deep learning framework, tensor metadata may be obtained by invoking the corresponding interface function (e.g., shape, stride).
Based on any one of the embodiments, the embodiment of the invention provides an operator fusion method for View operators in a large model, which solves the problem of performance degradation caused by frequent data handling in the model by carrying out efficient and rapid fusion on a plurality of continuous View operators. The core of the operator fusion optimization strategy is that the operator fusion is realized rapidly, and the position relation of data points in the input tensor and the output tensor in the fusion operator is deduced according to the shape transformation relation of the tensor, so that a plurality of continuous View operators are optimized into one fusion operator. In this way, the model not only reduces the number of view transformations, but also effectively reduces the overhead of data handling when executing.
The application scene suitable for the View class operator fusion optimization strategy provided by the embodiment of the invention is that a plurality of continuous View class operators appear in a model structure. The specific implementation steps of the fusion optimization strategy comprise:
In step S1, the coverage area, the input tensor and the output tensor of the fusion operator are determined, where more than one input tensor or output tensor may be used.
It should be noted that, the coverage of the above fusion operator may be specified by a user, that is, when the user invokes the operator on the upper layer frame, the user directly specifies which View operators are included in the fusion operator and the connection relationship between the operators, or may determine through automatic detection of the View operator combination inside the frame, and when all operators from a certain node to a certain node end are detected as View operators, the View operators may be automatically replaced to be combined into the corresponding fusion operators.
Step S2, obtaining relevant information of the input tensor, wherein the relevant information comprises the shape, dimension, step length and the like of the input tensor, and the relevant information is used as a reference.
Step S3, obtaining relevant information of the output tensor, wherein the relevant information comprises the shape, dimension, step length and the like of the output tensor, and the relevant information is used as a reference.
Step S4, determining the corresponding relation between each dimension of the input tensor and the output tensor according to the logic relation between the View operators used in the coverage range of the fusion operator and the related information of the input tensor and the output tensor;
And S5, traversing the data points of each dimension of the output tensor according to the corresponding relation, directly reading corresponding original data from the input tensor, and storing the corresponding original data into the data points of the output tensor.
The implementation of the optimization strategy is described below by taking a fusion operator in a specific model application (e.g., deepSeek model) as an example.
The DeepSeek model architecture uses an MLA (multiple-potential Attention) mechanism, which is a novel approach for handling Attention mechanisms in deep learning, especially when handling complex inputs with multiple potential spaces. While conventional attention mechanisms, such as self-attention mechanisms, use the same matrix of queries (Q), keys (K), and values (V) in computation, MLA employs a different approach that introduces the concept of multiple potential spaces in its attention computation, allowing models to learn more fine-grained in different potential subspaces.
In the conventional self-attention mechanism, Q, K and V tensors are identical in shape, typically both (batch_size, sequence_length, embedding _dim). Here, the first dimension batch_size is a batch size representing the number of samples processed simultaneously, the second dimension sequence_length is a sequence length representing the number of elements of the sequence in each sample, and the third dimension embedding _dim is an embedding dimension representing the dimension of the embedding vector of each sequence element. But in MLA, the K and V tensors are shaped differently because they use different potential spatial mappings. The K and V tensors require a series of transformations, including multiple sequential View-type operators, before entering the MLA. By fusing these sequential View-type operators into one fusion operator, data handling to memory can be significantly reduced.
FIG. 4 is a schematic diagram showing a specific structure and coverage of a fusion operator provided in the present invention, where as shown in FIG. 4, an input tensor includes KV_B and K_PE, where KV_B is [ batch_size, seqlen, headdim _b ], K_PE is [ batch_size, seqlen, headdim _pe ], and an output tensor is K and V, where K is [ batch_size ]Repeat, seqlen, headdim _k, V has the shape of [ batch_size ]Repeat, seqlen, headdim _ v ], where, headdim _k= headdim \u pe+ headdim _v.
In the process of transforming an input tensor to obtain an output tensor, the related operations of the View operator include:
1) KV_B is converted into [ batch_size ] through reshape (remodelling) operator and permute (permutation) operator repeat, seqlen, 2Headdim _ v ], wherein, headdim _b=repeat(2Headdim _v), noted KV (only reshape, actually reshape + permute, is shown in fig. 4);
2) The KV is split into two tensors through a split operator, namely K_NOPE and V, the split dimension is the innermost dimension (namely dim=2), and the V tensor is used as an output tensor to be directly output;
3) The K_PE repeats the repeat times in the dimension of Dim=0 through a repeat operator to obtain tensor K_PE_repeat;
4) The k_pe_repeat and the k_nope are spliced on the innermost dimension (i.e., dim=2) through a concat operator, combined into K, and directly output as an output tensor.
Assuming that the shape of the input tensor k_pe is [ batch_size, seqlen, headdim _pe ] = [ B, 4096, 64], the k_pe is repeated 128 times in dim=0 dimension through the repeat operator, and the resulting tensor k_pe_repeat shape is [ batch_sizerepeat, seqlen, headdim_pe]= [B128, 4096, 64]。
Assuming that the shape of the input tensor kv_b is [ batch_size, seqlen, headdim _b ] = [ B, 4096, 32768], the tensor KV obtained after passing through reshape and permute is [ batch_size ]repeat, seqlen, 2headdim_v]= [B128, 4096, 256], Then splitting the second dimension of KV using split operator to obtain two tensors K_NOPE and V each having the shape of [ batch_size ]repeat, seqlen, headdim_v] = [B128, 4096, 128], Outputting tensor V directly as output tensor. Splicing the second dimension of the K_PE_repeat and the second dimension of the K_NOPE by using a concat operator to obtain an output tensor K as [ batch_size ]repeat, seqlen, headdim_pe+ headdim_v]= [B128, 4096, 192]。
As described above, since a discontinuous memory processing operator such as permute, split, repeat, concat appears in the fusion operator range (as shown by a broken line box in fig. 4), it is impossible to process the operation by the frame-level view conversion, and an actual memory movement is necessary. Operators in this range need to perform 4 tensor readings and 4 tensor storages on the global memory, and for DeepSeek models, the related memory transportation has a larger negative effect on the performance due to the larger shape of the tensor.
After the operator fusion optimization strategy provided by the embodiment of the invention is used, the fusion operator shown in figure 5 can be obtained. Fig. 5 is a schematic diagram of a mapping relationship between an input tensor and an output tensor, as shown in fig. 5, in which a green tensor is the input tensor, and blue and orange tensors are the K and V output tensors, respectively. The gray tensor is a virtual tensor, which no longer applies for the actual global memory space, but rather performs data handling through registers inside the GPU kernel. After the fusion operator is determined, the specific processing flow is as follows:
1) Traversing the target indexes on each dimension of the output tensor K and V;
2) Determining corresponding source indexes of the currently traversed target indexes in the input tensors KV_B and K_PE according to the corresponding relation, and reading corresponding data from the input tensors KV_B and K_PE to a register according to the source indexes;
3) And writing the data in the register into K and V according to the currently traversed target index.
Compared with the initial multiple memory accesses, the flow greatly reduces the overhead of data handling. Furthermore, GPUs typically have a dedicated optimized memory approach for tensors of a particular layout. For example, for a Matrix3D tensor type, GPU Kernel may read and write multiple data at a time by merging accesses at the time of the accesses. By using the special access instruction, the access performance of the fusion operator can be further improved.
The operator fusion optimization strategy provided by the embodiment of the invention can carry out intelligent analysis and optimization on the tensor memory layout, ensures that continuous view operation can be completed in one memory access, and avoids unnecessary memory copying and layout conversion. Moreover, the invention also considers the requirements of the hardware architecture on memory access, such as the memory access mode of the GPU and a cache optimization mechanism. In the implementation process of the strategy, aiming at the characteristics of different hardware platforms, the video fusion can be specially scheduled and optimized so as to improve the memory access efficiency of data handling to the greatest extent. For a large number of frequent View operations in a large model, the method and the device can effectively reduce the times of memory access and the cost of data handling, thereby optimizing the problem of performance degradation caused by frequent data transformation.
The operator fusion device provided by the invention is described below, and the operator fusion device described below and the operator fusion method described above can be referred to correspondingly.
Based on any of the above embodiments, fig. 6 is a schematic structural diagram of an operator fusion apparatus provided by the present invention, as shown in fig. 6, where the apparatus includes:
A range determining unit 610, configured to determine an input tensor, an output tensor and a coverage range of an operator to be fused, where the operator to be fused includes a plurality of view class operators;
a relationship determining unit 620, configured to determine a mapping relationship between each dimension of the input tensor and the output tensor based on the metadata of the input tensor, the metadata of the output tensor, and the logical relationship between each view operator in the coverage area;
And an operator fusion unit 630, configured to perform data filling on the output tensor based on the mapping relationship and each data in the input tensor, and take the filled output tensor as a fusion result of the operator to be fused.
According to the device provided by the embodiment of the invention, the mapping relation between the input tensor and the output tensor of the operator to be fused can be accurately deduced through the metadata based on the time input tensor and the output tensor and the logic relation among the view operators, so that each data in the input tensor can be accurately filled into the corresponding position of the output tensor according to the mapping relation, operator fusion is effectively realized, namely, a plurality of view operators which are originally required to be continuously executed are integrated and optimized into one fusion operator. By means of the fusion mode, the execution times of view transformation operation and the data carrying times are obviously reduced in the running process of the model, so that the additional expense caused by data carrying is effectively reduced, and the problem of performance degradation caused by frequent data carrying in the model is solved.
Based on any of the above embodiments, the coverage area is automatically detected based on user input or a frame, and the apparatus further includes a logic relationship determining unit, where the logic relationship determining unit is configured to:
Determining the logic relationship among the view type operators based on the connection relationship of the view type operators in the coverage area, the operator attribute of the view type operators and the tensor attribute connected among the view type operators.
Based on any of the above embodiments, the operator fusion unit 630 includes:
A traversing subunit, configured to traverse the target index on each dimension of the output tensor;
An index determining subunit, configured to determine, based on the mapping relationship, a source index corresponding to a currently traversed target index in the input tensor;
And the data filling subunit is used for reading corresponding data from the input tensor based on the source index and filling the data into the output tensor according to the currently traversed target index.
Based on any of the above embodiments, the data filling subunit is specifically configured to:
based on the source index, reading corresponding data from the input tensor to a temporary storage area, wherein the temporary storage area is any one of a register, a shared memory and a cache;
Writing data in the temporary storage area into the output tensor based on the currently traversed target index.
Based on any of the above embodiments, the operator fusion unit 630 is specifically configured to:
determining a corresponding dimension of any dimension of the output tensor in the input tensor based on the mapping relation;
and under the condition that the target index of any dimension is continuous and the sequence is the same as the source index of the corresponding dimension, reading each data of the corresponding dimension of the input tensor by adopting a data block reading instruction, and filling each data into any dimension of the output tensor.
Based on any of the above embodiments, the input tensor includes one or more and the output tensor includes one or more.
Based on any of the embodiments above, the metadata includes a shape, a dimension, and a step size of the tensor.
Fig. 7 illustrates a physical schematic diagram of an electronic device, which may include a processor 710, a communication interface 720, a memory 730, and a communication bus 740, where the processor 710, the communication interface 720, and the memory 730 communicate with each other via the communication bus 740, as shown in fig. 7. The processor 710 may invoke logic instructions in the memory 730 to perform an operator fusion method, where the method includes determining an input tensor, an output tensor and a coverage area of an operator to be fused, where the operator to be fused includes a plurality of view-class operators, determining a mapping relationship between the input tensor and each dimension of the output tensor based on metadata of the input tensor, metadata of the output tensor and a logic relationship between view-class operators in the coverage area, and performing data filling on the output tensor based on the mapping relationship and each data in the input tensor, and taking the filled output tensor as a fusion result of the operator to be fused.
Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the related art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In another aspect, the invention further provides a computer program product, the computer program product comprises a computer program, the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, the computer program can execute an operator fusion method provided by the above methods, the method comprises the steps of determining an input tensor, an output tensor and a coverage range of an operator to be fused, wherein the operator to be fused comprises a plurality of view operators, determining a mapping relation between the input tensor and each dimension of the output tensor based on metadata of the input tensor, metadata of the output tensor and a logical relation between view operators in the coverage range, and filling data into the output tensor based on the mapping relation and each data in the input tensor, and taking the filled output tensor as a fusion result of the operator to be fused.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, is implemented to perform the operator fusion method provided by the above methods, the method comprising determining an input tensor, an output tensor and a coverage area of an operator to be fused, the operator to be fused comprising a plurality of view-class operators, determining a mapping relationship between the input tensor and each dimension of the output tensor based on metadata of the input tensor, metadata of the output tensor and a logical relationship between view-class operators in the coverage area, and filling data into the output tensor based on each data in the mapping relationship and the input tensor, and taking the filled output tensor as a fusion result of the operator to be fused.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims (10)

1. An operator fusion method, comprising:
determining an input tensor, an output tensor and a coverage range of an operator to be fused, wherein the operator to be fused comprises a plurality of view operators;
determining a mapping relation between each dimension of the input tensor and the output tensor based on the metadata of the input tensor, the metadata of the output tensor and the logic relation among view operators in the coverage range;
And based on the mapping relation and each data in the input tensor, carrying out data filling on the output tensor, and taking the filled output tensor as a fusion result of the operator to be fused.
2. The operator fusion method according to claim 1, wherein the coverage area is automatically detected based on user input or a framework, and the determining step of the logical relationship between view operators in the coverage area includes:
Determining the logic relationship among the view type operators based on the connection relationship of the view type operators in the coverage area, the operator attribute of the view type operators and the tensor attribute connected among the view type operators.
3. The operator fusion method according to claim 1, wherein the data filling the output tensor based on the mapping relation and each data in the input tensor includes:
Traversing the target index on each dimension of the output tensor;
Determining a source index corresponding to the currently traversed target index in the input tensor based on the mapping relation;
Based on the source index, corresponding data is read from the input tensor, and the data is filled into the output tensor according to the currently traversed target index.
4. The operator fusion method of claim 3, wherein the reading corresponding data from the input tensor based on the source index and populating the data into the output tensor according to the currently traversed target index comprises:
based on the source index, reading corresponding data from the input tensor to a temporary storage area, wherein the temporary storage area is any one of a register, a shared memory and a cache;
Writing data in the temporary storage area into the output tensor based on the currently traversed target index.
5. The operator fusion method according to claim 1, wherein the data filling the output tensor based on the mapping relation and each data in the input tensor includes:
determining a corresponding dimension of any dimension of the output tensor in the input tensor based on the mapping relation;
and under the condition that the target index of any dimension is continuous and the sequence is the same as the source index of the corresponding dimension, reading each data of the corresponding dimension of the input tensor by adopting a data block reading instruction, and filling each data into any dimension of the output tensor.
6. The operator fusion method of any one of claims 1 to 5 wherein the input tensor comprises one or more and the output tensor comprises one or more.
7. The operator fusion method of any one of claims 1 to 5 wherein the metadata includes a shape, dimension and step size of tensors.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the operator fusion method of any one of claims 1 to 7 when executing the computer program.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the operator fusion method according to any of claims 1 to 7.
10. A computer program product comprising a computer program which, when executed by a processor, implements the operator fusion method of any one of claims 1 to 7.
CN202510570933.2A 2025-04-30 2025-04-30 Operator fusion method, device, electronic device and storage medium Pending CN120408524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510570933.2A CN120408524A (en) 2025-04-30 2025-04-30 Operator fusion method, device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510570933.2A CN120408524A (en) 2025-04-30 2025-04-30 Operator fusion method, device, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN120408524A true CN120408524A (en) 2025-08-01

Family

ID=96505844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510570933.2A Pending CN120408524A (en) 2025-04-30 2025-04-30 Operator fusion method, device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN120408524A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120653883A (en) * 2025-08-08 2025-09-16 上海壁仞科技股份有限公司 Data processing method, electronic device and storage medium
CN121029432A (en) * 2025-10-29 2025-11-28 上海壁仞科技股份有限公司 Fusion operator execution methods, electronic devices, storage media and program products
CN121257615A (en) * 2025-12-03 2026-01-02 上海壁仞科技股份有限公司 Tensor splicing method, device, equipment, storage medium and program product
CN121277563A (en) * 2025-12-08 2026-01-06 上海东方算芯科技有限公司 Method for operating attention mechanism in chip, electronic device, storage medium, and program product

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120653883A (en) * 2025-08-08 2025-09-16 上海壁仞科技股份有限公司 Data processing method, electronic device and storage medium
CN121029432A (en) * 2025-10-29 2025-11-28 上海壁仞科技股份有限公司 Fusion operator execution methods, electronic devices, storage media and program products
CN121029432B (en) * 2025-10-29 2026-02-27 上海壁仞科技股份有限公司 Fusion operator execution method, electronic device, storage medium and program product
CN121257615A (en) * 2025-12-03 2026-01-02 上海壁仞科技股份有限公司 Tensor splicing method, device, equipment, storage medium and program product
CN121257615B (en) * 2025-12-03 2026-03-06 上海壁仞科技股份有限公司 Tensor splicing method, device, equipment, storage medium and program product
CN121277563A (en) * 2025-12-08 2026-01-06 上海东方算芯科技有限公司 Method for operating attention mechanism in chip, electronic device, storage medium, and program product

Similar Documents

Publication Publication Date Title
US11960934B2 (en) Systems and methods for improved neural network execution
CN110659728B (en) Neural network optimization method, device, computer equipment and storage medium
CN120408524A (en) Operator fusion method, device, electronic device and storage medium
US12159223B2 (en) Processing data of a neural network
US20200020069A1 (en) Compiler techniques for mapping program code to a high performance, power efficient, programmable image processing hardware platform
CN111008040B (en) Cache device and cache method, computing device and computing method
WO2022007265A1 (en) Dilated convolution acceleration calculation method and apparatus
CN120508407B (en) Optimization method for data type conversion operator, computer device, readable storage medium and computer program product
US20230259780A1 (en) Neural network sparsification apparatus and method and related product
CN112463160A (en) Compiling method, compiling device, electronic equipment and storage medium
CN120235254B (en) Data loading method, data storage method, processor, electronic device, and medium
CN118114737A (en) Method and equipment for optimizing back propagation of Attention operator
KR102372869B1 (en) Matrix operator and matrix operation method for artificial neural network
CN120338052A (en) Model performance optimization method, electronic device, storage medium and program product
CN112200310A (en) Intelligent processor, data processing method and storage medium
CN115186796A (en) Automatic convolutional neural network deployment method based on FPGA
CN120892670A (en) Optimization methods, apparatus, computer devices, and readable storage media for matrix storage operators
CN120277011B (en) Data loading method, data storage method, electronic device and storage medium
CN115374912A (en) Fusion operator design method for heterogeneous computing and heterogeneous computing system
CN112257859B (en) Feature data processing method, device, equipment, and storage medium
WO2022179075A1 (en) Data processing method and apparatus, computer device and storage medium
CN118674072A (en) Tensor processing method, tensor processing device, electronic device, storage medium, and program product
WO2023087698A1 (en) Computing apparatus and method for executing convolution operation, and related products
CN115373598A (en) Neural network data storage format conversion method suitable for general hardware circuit
CN121210159B (en) Data processing methods, apparatus, computer equipment, readable storage media and program products

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination