CN111563584A

CN111563584A - A method for splitting a neural network model and related products

Info

Publication number: CN111563584A
Application number: CN201910114927.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-02-14
Filing date: 2019-02-14
Publication date: 2020-08-21
Anticipated expiration: 2039-02-14
Also published as: CN111563584B

Abstract

The present disclosure discloses a method for splitting a neural network model and related products. In this solution, an operator is split into multiple sub-operators with smaller scales, so that the calculation library under the single-core architecture can be directly called, avoiding the need for new Additional work to achieve.

Description

A method for splitting a neural network model and related products

技术领域technical field

本公开的实施例涉及一种神经网络模型的拆分方法及相关产品。Embodiments of the present disclosure relate to a method for splitting a neural network model and related products.

背景技术Background technique

近年来，深度学习加速器被不断提出，并如同通用处理器一样，正在由单核向多核扩展。这种扩展后的多核结构可以在训练阶段支持数据并行的方式来提高数据吞吐量，加快训练速度。然而，在推理阶段，相比吞吐量深度神经网络对端到端的时延有着更高的要求，这往往决定了加速器在某个场景下的可用性。传统的数据并行方案不能满足推理场景下对加速器小数据、低延迟的要求。In recent years, deep learning accelerators have been continuously proposed, and like general-purpose processors, they are expanding from single-core to multi-core. This extended multi-core structure can support data parallelism in the training phase to improve data throughput and speed up training. However, in the inference phase, deep neural networks have higher requirements on end-to-end latency than throughput, which often determines the availability of accelerators in certain scenarios. Traditional data parallel solutions cannot meet the requirements of accelerators for small data and low latency in inference scenarios.

发明内容SUMMARY OF THE INVENTION

为了解决上述所述的技术问题，本公开提出一种神经网络模型的拆分方法及相关产品。In order to solve the above-mentioned technical problems, the present disclosure proposes a method for splitting a neural network model and related products.

为实现上述目的，本公开提供一种神经网络模型拆分方法，其中，所述方法包括：To achieve the above object, the present disclosure provides a method for splitting a neural network model, wherein the method includes:

根据所述神经网络模型中目标层的算子，确定与所述目标层的算子关联的张量数据的拆分状态集合；其中，所述目标层为所述神经网络模型中的至少一层；According to the operator of the target layer in the neural network model, the split state set of the tensor data associated with the operator of the target layer is determined; wherein the target layer is at least one layer in the neural network model ;

根据所述神经网络模型的有向无环图遍历所述拆分状态集合，确定相邻拆分状态集合之间的状态路径及状态路径的权重；其中，所述状态路径表示所述算子的拆分方式；所述拆分状态集合中的每个状态表示一个子张量数据集合，所述状态的所有子张量数据的并集结果为所述张量数据；Traverse the split state set according to the directed acyclic graph of the neural network model, and determine the state paths and the weights of the state paths between adjacent split state sets; wherein, the state path represents the value of the operator Splitting method; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

根据所述状态路径的权重，确定所述目标层的目标拆分路径；Determine the target splitting path of the target layer according to the weight of the state path;

利用所述目标拆分路径对所述神经网络模型的目标层的算子进行拆分。The operator of the target layer of the neural network model is split using the target splitting path.

优选地，确定所述目标层的目标拆分路径的步骤包括：Preferably, the step of determining the target splitting path of the target layer includes:

遍历所述目标层的所有拆分状态集合，对当前拆分状态集合，遍历每一状态，获得所有指向当前状态的状态路径以及所述状态路径的起始状态到所述目标层的输入张量数据的起始状态的拆分路径；Traverse all split state sets of the target layer, traverse each state for the current split state set, and obtain all state paths pointing to the current state and the input tensors from the starting state of the state path to the target layer The split path of the starting state of the data;

根据所述状态路径的权重和所述拆分路径的权重确定所述当前状态到所述目标层的输入张量数据的起始状态的拆分路径；其中，所述拆分路径的权重根据所述拆分路径对应的所有状态路径的权重确定；The splitting path from the current state to the starting state of the input tensor data of the target layer is determined according to the weight of the state path and the weight of the splitting path; wherein the weight of the splitting path is based on the weight of the splitting path. Determine the weights of all state paths corresponding to the split paths;

遍历完所述目标层的所有拆分状态集合后，获得所述目标层的输入张量数据的拆分状态集合与所述目标层的输出张量数据的拆分状态集合之间的目标拆分路径。After traversing all the split state sets of the target layer, obtain the target split between the split state set of the input tensor data of the target layer and the split state set of the output tensor data of the target layer path.

遍历所述目标层的所有拆分状态集合，对当前拆分状态集合，遍历每一状态，获得所有以当前状态为起点的状态路径以及所述状态路径的结束状态到所述目标层的输出张量数据的终止状态的拆分路径；Traverse all split state sets of the target layer, traverse each state for the current split state set, and obtain all the state paths starting from the current state and the output sheets from the end state of the state path to the target layer. The split path of the terminal state of the volume data;

根据所述状态路径的权重和所述拆分路径的权重确定所述当前状态到所述目标层的输出张量数据的终止状态的拆分路径；其中，所述拆分路径的权重根据所述拆分路径对应的所有状态路径的权重确定；The splitting path from the current state to the terminal state of the output tensor data of the target layer is determined according to the weight of the state path and the weight of the splitting path; wherein, the weight of the splitting path is based on the The weights of all state paths corresponding to the split paths are determined;

优选地，所述神经网络模型的目标层的算子拆分后获得的子算子数量为2的整数次幂。Preferably, the number of sub-operators obtained after the operator of the target layer of the neural network model is split is an integer power of 2.

优选地，所述神经网络模型的目标层的算子的输入张量数据的拆分状态集合中的状态根据所述算子的计算逻辑和对应输出张量数据的拆分状态集合中的状态确定。Preferably, the state in the split state set of the input tensor data of the operator of the target layer of the neural network model is determined according to the calculation logic of the operator and the state in the split state set of the corresponding output tensor data .

优选地，所述神经网络模型的目标层的算子的输出张量数据的拆分状态集合中的状态根据所述算子的计算逻辑和对应输入张量数据的拆分状态集合中的状态确定。Preferably, the state in the split state set of the output tensor data of the operator of the target layer of the neural network model is determined according to the calculation logic of the operator and the state in the split state set of the corresponding input tensor data .

优选地，还包括：Preferably, it also includes:

在正向遍历阶段，当所述算子的输出张量数据被至少两个算子作为输入张量数据，或者所述算子具有至少两个输出张量数据时，所述算子的输出张量数据的拆分状态集合中保留一个拆分状态，且所述拆分状态经由所述算子的同一状态路径确定。In the forward traversal stage, when the output tensor data of the operator is used as input tensor data by at least two operators, or the operator has at least two output tensor data, the output tensor data of the operator One split state remains in the set of split states of the volume data, and the split state is determined via the same state path of the operator.

优选地，还包括：Preferably, it also includes:

在反向遍历阶段，当所述算子具有至少两个输入张量数据时，所述算子的输入张量数据的拆分状态集合中保留一个拆分状态，且所述拆分状态经由所述算子的同一状态路径确定。In the reverse traversal stage, when the operator has at least two input tensor data, one split state remains in the split state set of the input tensor data of the operator, and the split state is passed through the The same state path of the predicate operator is determined.

优选地，所述状态路径的权重根据算子的类型和规模、多核处理器硬件参数确定。Preferably, the weight of the state path is determined according to the type and scale of the operator and the hardware parameters of the multi-core processor.

为实现上述目的，本公开提供一种神经网络模型拆分装置，其中，包括：In order to achieve the above object, the present disclosure provides a neural network model splitting device, which includes:

拆分状态集合模块，用于根据所述神经网络模型中目标层的算子，确定与所述目标层的算子关联的张量数据的拆分状态集合；其中，所述目标层为所述神经网络模型中的至少一层；A split state set module, configured to determine a split state set of tensor data associated with the operator of the target layer according to the operator of the target layer in the neural network model; wherein, the target layer is the at least one layer in the neural network model;

状态路径模块，用于根据所述神经网络模型的有向无环图遍历所述拆分状态集合，确定相邻拆分状态集合之间的状态路径及状态路径的权重；其中，所述状态路径表示所述算子的拆分方式；所述拆分状态集合中的每个状态表示一个子张量数据集合，所述状态的所有子张量数据的并集结果为所述张量数据；a state path module, configured to traverse the split state set according to the directed acyclic graph of the neural network model, and determine the state path and the weight of the state path between adjacent split state sets; wherein, the state path Represents the splitting method of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

目标拆分路径模块，用于根据所述状态路径的权重，确定所述目标层的目标拆分路径；a target splitting path module, configured to determine the target splitting path of the target layer according to the weight of the state path;

拆分模块，用于利用所述目标拆分路径对所述神经网络模型的目标层的算子进行拆分。A splitting module, configured to split the operator of the target layer of the neural network model by using the target splitting path.

优选地，所述目标拆分路径模块包括：Preferably, the target splitting path module includes:

第一遍历单元，用于遍历所述目标层的所有拆分状态集合，对当前拆分状态集合，遍历每一状态，获得所有指向当前状态的状态路径以及所述状态路径的起始状态到所述目标层的输入张量数据的起始状态的拆分路径；The first traversal unit is used to traverse all the split state sets of the target layer, traverse each state for the current split state set, and obtain all the state paths pointing to the current state and the starting state of the state path to all the state paths. Describe the splitting path of the starting state of the input tensor data of the target layer;

第一拆分路径确定单元，用于根据所述状态路径的权重和所述拆分路径的权重确定所述当前状态到所述目标层的输入张量数据的起始状态的拆分路径；其中，所述拆分路径的权重根据所述拆分路径对应的所有状态路径的权重确定；a first splitting path determination unit, configured to determine a splitting path from the current state to the starting state of the input tensor data of the target layer according to the weight of the state path and the weight of the splitting path; wherein , the weight of the splitting path is determined according to the weights of all state paths corresponding to the splitting path;

第一选择目标拆分路径单元，用于遍历完所述目标层的所有拆分状态集合后，获得所述目标层的输入张量数据的拆分状态集合与所述目标层的输出张量数据的拆分状态集合之间的目标拆分路径。The first selection target splitting path unit is used to obtain the split state set of the input tensor data of the target layer and the output tensor data of the target layer after traversing all the split state sets of the target layer The target split path between the set of split states.

第二遍历单元，用于遍历所述目标层的所有拆分状态集合，对当前拆分状态集合，遍历每一状态，获得所有以当前状态为起点的状态路径以及所述状态路径的结束状态到所述目标层的输出张量数据的终止状态的拆分路径；The second traversal unit is used to traverse all the split state sets of the target layer, traverse each state for the current split state set, and obtain all the state paths starting from the current state and the end states of the state paths to The splitting path of the terminal state of the output tensor data of the target layer;

第二拆分路径确定单元，用于根据所述状态路径的权重和所述拆分路径的权重确定所述当前状态到所述目标层的输出张量数据的终止状态的拆分路径；其中，所述拆分路径的权重根据所述拆分路径对应的所有状态路径的权重确定；The second splitting path determination unit is configured to determine the splitting path from the current state to the terminal state of the output tensor data of the target layer according to the weight of the state path and the weight of the splitting path; wherein, The weight of the split path is determined according to the weight of all state paths corresponding to the split path;

第二选择目标拆分路径单元，用于遍历完所述目标层的所有拆分状态集合后，获得所述目标层的输入张量数据的拆分状态集合与所述目标层的输出张量数据的拆分状态集合之间的目标拆分路径。The second selection target splitting path unit is used to obtain the split state set of the input tensor data of the target layer and the output tensor data of the target layer after traversing all the split state sets of the target layer The target split path between the set of split states.

优选地，还包括：Preferably, it also includes:

第一拆分状态集合优化模块，用于在正向遍历阶段，当所述算子的输出张量数据被至少两个算子作为输入张量数据，或者所述算子具有至少两个输出张量数据时，所述算子的输出张量数据的拆分状态集合中保留一个拆分状态，且所述拆分状态经由所述算子的同一状态路径确定。The first split state set optimization module is used in the forward traversal stage, when the output tensor data of the operator is used as input tensor data by at least two operators, or the operator has at least two output tensor data. When the amount of data is stored, one split state remains in the split state set of the output tensor data of the operator, and the split state is determined through the same state path of the operator.

优选地，还包括：Preferably, it also includes:

第二拆分状态集合优化模块，用于在反向遍历阶段，当所述算子具有至少两个输入张量数据时，所述算子的输入张量数据的拆分状态集合中保留一个拆分状态，且所述拆分状态经由所述算子的同一状态路径确定。The second split state set optimization module is configured to, in the reverse traversal stage, when the operator has at least two input tensor data, retain one split state set of the input tensor data of the operator split states, and the split states are determined via the same state path of the operator.

为实现上述目的，本公开提供一种神经网络模型拆分硬件设备，包括存储器及处理器，所述存储器上存储有可在处理器上运行的计算机程序，其中，所述处理器执行所述计算机程序时实现上述所述方法的步骤。In order to achieve the above object, the present disclosure provides a neural network model splitting hardware device, including a memory and a processor, where the memory stores a computer program that can run on the processor, wherein the processor executes the computer program. The program implements the steps of the method described above.

为实现上述目的，本公开提供一种计算机可读存储介质，其上存储有计算机程序，其中，所述计算机程序被处理器执行时实现上述所述的方法的步骤。To achieve the above object, the present disclosure provides a computer-readable storage medium on which a computer program is stored, wherein, when the computer program is executed by a processor, the steps of the above-mentioned method are implemented.

本公开的技术方案可以用较小地开销实现深度学习加速器由单核向多核结构上的扩展，并且能够针对给定的网络和底层加速器特点给出一种高效的拆分方案，该方案能有效降低各种网络在多核加速器上的端到端时延。The technical solution of the present disclosure can realize the expansion of the deep learning accelerator from a single core to a multi-core structure with a small overhead, and can provide an efficient splitting solution for a given network and underlying accelerator characteristics, and the solution can effectively Reduce the end-to-end latency of various networks on multi-core accelerators.

附图说明Description of drawings

为了更清楚地说明本公开实施例的技术方案，下面将对实施例的附图作简单地介绍，显而易见地，下面描述中的附图仅仅涉及本公开的一些实施例，而非对本公开的限制。In order to explain the technical solutions of the embodiments of the present disclosure more clearly, the accompanying drawings of the embodiments will be briefly introduced below. Obviously, the drawings in the following description only relate to some embodiments of the present disclosure, rather than limit the present disclosure. .

图1为共享存储多核结构的示意图；1 is a schematic diagram of a shared storage multi-core structure;

图2为本公开实施例提供的一种神经网络模型拆分方法流程图之一；FIG. 2 is one of the flowcharts of a method for splitting a neural network model according to an embodiment of the present disclosure;

图3为串行神经网络模型拆分示意图；Figure 3 is a schematic diagram of the splitting of the serial neural network model;

图4为本公开实施例提供的一种神经网络模型拆分方法流程图之二；FIG. 4 is the second flowchart of a method for splitting a neural network model provided by an embodiment of the present disclosure;

图5为算子与输入张量数据之间引入胶水算子的拆分示意图；Fig. 5 is the split schematic diagram of introducing glue operator between operator and input tensor data;

图6为补偿示意图；6 is a schematic diagram of compensation;

图7为本公开实施例提供的一种神经网络模型拆分方法流程图之三；FIG. 7 is the third flow chart of a method for splitting a neural network model provided by an embodiment of the present disclosure;

图8为金字塔拆分示意图；8 is a schematic diagram of pyramid splitting;

图9为本公开实施例提供的一种神经网络模型拆分方法流程图之四；FIG. 9 is the fourth flowchart of a method for splitting a neural network model provided by an embodiment of the present disclosure;

图10为本公开实施例提供的一种神经网络模型拆分硬件设备示意图。FIG. 10 is a schematic diagram of a hardware device for splitting a neural network model according to an embodiment of the present disclosure.

具体实施方式Detailed ways

下面将结合附图，对本公开实施例中的技术方案进行清楚、完整地描述参考在附图中示出并在以下描述中详述的非限制性示例实施例，更加全面地说明本公开的示例实施例和它们的多种特征及有利细节。应注意的是，图中示出的特征不是必须按照比例绘制。本公开省略了已知材料、组件和工艺技术的描述，从而不使本公开的示例实施例模糊。所给出的示例仅旨在有利于理解本公开示例实施例的实施，以及进一步使本领域技术人员能够实施示例实施例。因而，这些示例不应被理解为对本公开的实施例的范围的限制。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the non-limiting example embodiments shown in the drawings and detailed in the following description to more fully describe the examples of the present disclosure. Examples and their various features and advantageous details. It should be noted that the features shown in the figures are not necessarily drawn to scale. The present disclosure omits descriptions of known materials, components, and process technologies so as not to obscure the example embodiments of the present disclosure. The given examples are only intended to facilitate the understanding of the implementation of the example embodiments of the present disclosure and to further enable those skilled in the art to practice the example embodiments. Accordingly, these examples should not be construed as limiting the scope of the embodiments of the present disclosure.

除非另外特别定义，本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性，而只是用来区分不同的组成部分。此外，在本公开各个实施例中，相同或类似的参考标号表示相同或类似的构件。Unless otherwise specifically defined, technical or scientific terms used in the present disclosure shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. As used in this disclosure, "first," "second," and similar terms do not denote any order, quantity, or importance, but are merely used to distinguish the various components. Furthermore, in various embodiments of the present disclosure, the same or similar reference numerals refer to the same or similar components.

下面结合附图，对本公开实施例提供的一种神经网络模型拆分方法及相关产品的具体实施方式进行详细说明。Specific implementations of a method for splitting a neural network model and related products provided by the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

近年来，得益于深度学习本身在多个领域取得的巨大成功，深度学习加速器成为了快速发展的一个领域。这些新出现的加速器往往在性能功耗比上对比GPU有较大的优势。如同通用处理器的发展一样，深度学习加速器同样可以由单核向多核架构扩展，并且这种扩展非常适合深度学习中数据并行的训练方式。数据并行指的是通过把训练的数据集划分成若干份，使用多个处理核分别处理一部分子数据集来加快训练。在多核结构中采用这种方式，每个核并行地处理训练数据中的不同数据集，从而提高整个系统的吞吐量，加快训练速度。因此，多核的加速器架构在保持了每个核良好的性能功耗比的前提下，又可以方便地扩展整个系统的在训练阶段的计算吞吐量。In recent years, thanks to the great success of deep learning itself in many fields, deep learning accelerators have become a rapidly developing field. These emerging accelerators often have a large advantage over GPUs in terms of performance per watt. Like the development of general-purpose processors, deep learning accelerators can also be extended from single-core to multi-core architectures, and this extension is very suitable for the data-parallel training method in deep learning. Data parallelism refers to speeding up training by dividing the training data set into several parts and using multiple processing cores to process a subset of the data sets separately. Adopting this approach in a multi-core architecture, each core processes different datasets in the training data in parallel, thereby increasing the throughput of the overall system and speeding up training. Therefore, the multi-core accelerator architecture can easily expand the computing throughput of the entire system in the training phase while maintaining a good performance-to-power ratio for each core.

针对多核处理器结构芯片来说，如图1所示，这种共享存储多核结构是一种经典的多核结构。这种结构非常适合数据并行的神经网络训练方法。每个核可以作为数据并行中的一个处理器，分别读取不同的数据然后并行完成神经网络模型的正反向计算。每个核在计算阶段仍能够保持其在之前单核架构下良好的性能功耗比，与此同时整个系统的吞吐量也可以随着核数的扩展而增加。数据并行的问题在于，其扩展性依赖于处理的数据批量的大小。尽管在训练阶段这通常不会是一个问题，但是对于推理阶段这个前提则难以保证。一般来说，用于实时服务领域(包括视频监控，自动驾驶等)的神经网络模型，处理的数据通常是以流的方式串行输入，导致了每次处理的数据规模很小甚至往往是单张图片。在这种情况下，数据并行不能提供任何并行度，所有的工作任务会集中在单个核上，这使得多核带来的计算资源不能转化成处理任务的速度。For a multi-core processor structure chip, as shown in Figure 1, this shared memory multi-core structure is a classic multi-core structure. This structure is very suitable for data-parallel neural network training methods. Each core can be used as a processor in data parallelism to read different data and complete the forward and reverse computation of the neural network model in parallel. Each core can still maintain its good performance-to-power ratio under the previous single-core architecture in the computing stage, and at the same time, the throughput of the entire system can also increase with the expansion of the number of cores. The problem with data parallelism is that its scalability depends on the size of the data batches processed. Although this is usually not a problem during the training phase, this premise is harder to guarantee for the inference phase. Generally speaking, for neural network models used in real-time service fields (including video surveillance, autonomous driving, etc.), the processed data is usually input serially in a streaming manner, resulting in a small scale of data processed each time or even a single picture. In this case, data parallelism cannot provide any degree of parallelism, and all work tasks will be concentrated on a single core, which makes the computing resources brought by multi-core cannot be converted into the speed of processing tasks.

当在线下使用数据集完成了神经网络模型的训练后，就会把模型部署到云端的服务器上来处理外界发来的数据，此时的应用场景就由离线训练变成了在线推理。在在线推理阶段，一个非常重要的指标是时延，也就是从服务器收到待处理数据到返回处理后的结果的时间，进一步来说，是使用神经网络模型处理数据的时间。低时延保证云端服务器能够对客户端发来的数据在最短的时间内做出响应，在一些更加敏感的场景下，直接决定了方案是否可用。因此，在线推理阶段对于加速器的要求就由处理大批量数据、高吞吐量转变为处理小批量数据、低时延。When the training of the neural network model is completed using the data set offline, the model will be deployed to the server in the cloud to process the data sent from the outside world. At this time, the application scenario changes from offline training to online inference. In the online inference stage, a very important indicator is the latency, that is, the time from the server receiving the data to be processed to returning the processed result, and further, the time to process the data using the neural network model. Low latency ensures that the cloud server can respond to the data sent by the client in the shortest time. In some more sensitive scenarios, it directly determines whether the solution is available. Therefore, the requirements for accelerators in the online inference stage have changed from processing large batches of data and high throughput to processing small batches of data and low latency.

在这种情况下，传统的数据并行或者模型并行难以有效降低推理任务的时延。对于数据并行来说，大批量数据是前提，这本身与在线推理小批量数据的特点矛盾。对于模型并行来说，它通常是为了解决一个规模很大的神经网络模型超过了单个设备的内存限制而采用的方法，把算子分配到不同的核上并不能降低网络的时延。为了真正能够在多核加速器上降低推理任务的时延，必须寻找一种方法，能够把对小批量数据甚至单个数据的推理计算任务合理地分配到多核架构的各个核上，保证每一时刻都有尽可能多的核参与计算，才能充分利用多核架构的资源。一种方法是把神经网络中的每个算子的计算任务都拆分到多个核上计算，这种方法即使在处理单张图片的推理任务时也能保证每一时刻都有多个核参与计算，从而达到了利用多核资源降低时延的目的。In this case, traditional data parallelism or model parallelism cannot effectively reduce the delay of inference tasks. For data parallelism, large batches of data are the premise, which in itself contradicts the characteristics of online reasoning of small batches of data. For model parallelism, it is usually a method used to solve a large-scale neural network model that exceeds the memory limit of a single device. Distributing operators to different cores cannot reduce network latency. In order to truly reduce the delay of inference tasks on multi-core accelerators, we must find a way to reasonably allocate inference computing tasks for small batches of data or even single data to each core of the multi-core architecture, ensuring that every moment has Only when as many cores as possible participate in the calculation, the resources of the multi-core architecture can be fully utilized. One method is to split the computing task of each operator in the neural network into multiple cores for calculation. This method can ensure that there are multiple cores at each moment even when processing the reasoning task of a single image. Participate in the calculation, so as to achieve the purpose of using multi-core resources to reduce the delay.

但是，对于多核加速器来说，还有很多要解决的问题。首先，深度学习加速器通过定制化自身的硬件设计来适配深度学习算法本身的数据并行特征，提高计算吞吐量，加速器往往需要足够的数据规模才能达到较高的计算效率，而算子内的进一步拆分会减小每个核上的计算规模。当拆分达到一定粒度，每个核上计算效率的损失会超过拆分增加并行度所带来的收益。因此，必须在拆分并行和计算效率之间，在保证足够计算效率的同时提供足够的并行度。However, for multi-core accelerators, there are still a lot of problems to solve. First of all, the deep learning accelerator adapts its own hardware design to adapt to the data parallel characteristics of the deep learning algorithm itself, and improves the computing throughput. The accelerator often needs enough data scale to achieve high computing efficiency, and further in the operator Splitting reduces the size of the computation on each core. When the split reaches a certain granularity, the loss of computational efficiency on each core will exceed the benefit of increased parallelism by splitting. Therefore, it is necessary to provide sufficient parallelism while ensuring sufficient computational efficiency between splitting parallelism and computational efficiency.

另一方面，神经网络模型可以看做是一个由通常数以百计甚至千记的算子所构成的复杂计算图。不同种类的算子内的算法逻辑各不相同，这就导致对这些算子进行拆分的方法也不一样。每个算子的拆分，除了平衡自身的计算效率和并行度，还要考虑和前后算子的搭配，甚至于对全局的影响。深度学习的快速发展带来的是越来越多的大规模复杂网络，通过手动方式寻找一种好的并行方法是不现实的，因此需要一种自动化的方法来保证来对于不同的网络都能够给出一种较好的拆分并行策略。On the other hand, a neural network model can be viewed as a complex computational graph consisting of usually hundreds or even thousands of operators. The algorithm logic in different types of operators is different, which leads to different methods of splitting these operators. The splitting of each operator, in addition to balancing its own computing efficiency and parallelism, also needs to consider the collocation with the front and rear operators, and even the overall impact. The rapid development of deep learning has brought more and more large-scale and complex networks. It is unrealistic to find a good parallel method manually. Therefore, an automated method is needed to ensure that different networks can be used. A better split parallel strategy is given.

此外，还需要考虑的是对于底层加速器的可移植性。对于没有足够良好的可编程性的加速器来说，由单核扩展到多核，并且实现算子内部的拆分并行所带来的修改软件栈的工作量是非常大的。传统的数据并行和模型并行的实现仍然是基于一个处理核完成一个算子的计算任务，所以并不会带来很多额外的工作，而单个算子的跨核并行需要对算子本身实现进行修改，这种修改的难易程度依赖于加速器的可编程性和原有算子实现逻辑的复杂程度。如何减小在多核架构上实现低时延推理过程中的额外开销，缓解实现过程中工作量对于加速器本身可编程性的依赖，使得方法能够在未来对于不同的多核加速器都有一定的通用性也是一个需要考虑的问题。Another consideration is portability to the underlying accelerator. For accelerators without good enough programmability, the workload of modifying the software stack caused by scaling from a single core to multiple cores and implementing split parallelism inside the operator is very large. The traditional implementation of data parallelism and model parallelism is still based on one processing core to complete the calculation task of one operator, so it does not bring a lot of extra work, and the cross-core parallelism of a single operator needs to modify the implementation of the operator itself , the difficulty of this modification depends on the programmability of the accelerator and the complexity of the original operator implementation logic. How to reduce the extra overhead in the process of implementing low-latency inference on a multi-core architecture, ease the workload’s dependence on the programmability of the accelerator itself, and make the method more versatile for different multi-core accelerators in the future. A question to consider.

基于上述分析描述，针对大规模的神经网络模型自动给出一套端到端的拆分方案，本方案把一个算子拆分成多个规模更小的子算子，这样可以直接调用单核架构下的计算库，避免了重新实现的额外工作量。比如：一个激活算子在经过拆分后可以得到许多更小的激活算子，这意味着只需要在多个核上调用原有的单核激活函数完成每个子任务，而不需要修改或者重新实现一个多核版本的激活函数。在这个过程中，既需要兼顾每个算子本身的拆分后的计算效率和并行度，也要考虑上下文算子彼此之间在拆分上的相互配合。最终目标是得到一个能够有效降低整个神经网络模型端到端的推理时延的拆分并行方案。Based on the above analysis and description, an end-to-end splitting scheme is automatically provided for large-scale neural network models. This scheme splits an operator into multiple smaller sub-operators, so that the single-core architecture can be directly invoked. Computational library below, avoiding the extra effort of re-implementation. For example, an activation operator can be divided into many smaller activation operators, which means that it only needs to call the original single-core activation function on multiple cores to complete each subtask, without modifying or rewriting Implement a multicore version of the activation function. In this process, it is necessary to take into account not only the computational efficiency and parallelism of each operator after splitting, but also the mutual cooperation of context operators in splitting. The ultimate goal is to obtain a split parallel scheme that can effectively reduce the end-to-end inference delay of the entire neural network model.

以自动驾驶应用为例，车辆在自动行驶过程中需要对车载传感器传来的图像、视频、语音等外部信息进行分析处理。为了保证安全性，车辆必须在最短的时间内得到处理的结果，从而做出决策。采用了多核处理器结构芯片的车辆，可以使用本方案把神经网络模型处理小批量外部信息的计算负载均衡地分配到多个处理器核上，在规定的响应时间内完成对信息的处理，返回处理结果辅助车辆自动行驶。本技术方案可以用较小地开销实现深度学习加速器由单核向多核结构上的扩展，并且该方案能有效降低各种网络在多核加速器上的端到端时延。Taking autonomous driving applications as an example, the vehicle needs to analyze and process external information such as images, videos, and voices transmitted by on-board sensors during the process of autonomous driving. In order to ensure safety, the vehicle must get the results of the processing in the shortest possible time to make decisions. Vehicles with multi-core processor structure chips can use this solution to distribute the computational load of the neural network model for processing small batches of external information to multiple processor cores in a balanced manner, complete the processing of the information within the specified response time, and return The processing result assists the automatic driving of the vehicle. The technical solution can realize the extension of the deep learning accelerator from a single core to a multi-core structure with a small overhead, and the solution can effectively reduce the end-to-end delay of various networks on the multi-core accelerator.

在上述应用场景中，多核处理器结构芯片设置在车辆上。实际中，多核处理器结构芯片可以设置在云端服务器上，车辆可以通过3G/4G、WIFI等网络将车载传感器传来的图像、视频、语音等外部信息发生至云端服务器。云端服务器使用本方案把神经网络模型处理小批量外部信息的计算负载均衡地分配到多个处理核上。在车辆行驶规定的响应时间内，云端服务器将处理结果通过3G/4G、WIFI等网络反馈至车辆。在实际中，车载传感器采集到的外部信息的规模不同。在应用之前，根据不同规模的外部信息，车载处理器利用本方案确定相应的算子拆分路径。将不同规模的外部信息对应的算子拆分方案存储对应区域，多核处理器结构芯片获取外部信息后调出对应的算子拆分路径来对神经网络模型中的算子进行拆分，把外部信息的计算负载均衡地分配到多个处理器核上。In the above application scenario, the multi-core processor structure chip is arranged on the vehicle. In practice, the multi-core processor structure chip can be set on the cloud server, and the vehicle can send external information such as images, videos, and voices from the on-board sensors to the cloud server through 3G/4G, WIFI and other networks. The cloud server uses this solution to evenly distribute the computational load of the neural network model for processing small batches of external information to multiple processing cores. Within the specified response time of the vehicle, the cloud server will feed back the processing results to the vehicle through 3G/4G, WIFI and other networks. In practice, the scale of external information collected by on-board sensors varies. Before application, according to external information of different scales, the on-board processor uses this scheme to determine the corresponding operator splitting path. The operator splitting schemes corresponding to external information of different scales are stored in the corresponding areas. After the multi-core processor structure chip obtains the external information, the corresponding operator splitting path is called to split the operators in the neural network model. The computational load of the information is distributed evenly across multiple processor cores.

通常，上层框架需要调用计算库来得到神经网络模型中每个算子在处理器上的指令实现。具体地，框架把每个算子的类型、参数告知计算库，计算库返回每个算子在处理器上执行所需要的机器指令。框架通过驱动把数据和所述机器指令加载到处理器上，启动处理器并完成所述算子的计算。Usually, the upper-layer framework needs to call the computing library to obtain the instruction implementation of each operator in the neural network model on the processor. Specifically, the framework informs the computing library of the type and parameters of each operator, and the computing library returns the machine instructions required for each operator to execute on the processor. The framework loads the data and the machine instructions onto the processor through the driver, starts the processor and completes the calculation of the operator.

如果把所述算子的计算平台由单核加速器变成有着相似甚至相同核结构的多核加速器，相应地，需要对计算库进行重新设计，使其能够产生运行在多个核上的机器指令。具体地，由于多个核需要读取同一输入张量数据的不同部分，同时也需要把各自的输出写回到同一输出张量数据的不同部分，计算库需要修改每个算子的计算指令中所有关于读取和存储部分的指令。If the computing platform of the operator is changed from a single-core accelerator to a multi-core accelerator with similar or even the same core structure, correspondingly, the computing library needs to be redesigned so that it can generate machine instructions running on multiple cores. Specifically, since multiple cores need to read different parts of the same input tensor data, and also need to write their respective outputs back to different parts of the same output tensor data, the computing library needs to modify the calculation instructions of each operator. All instructions about the read and store parts.

本公开实施例所提供的神经网络拆分方法能够尽量避免对单核处理器计算库进行修改，同时也能够实现神经网络模型在多核处理器上的并行执行。具体地，上层框架通过把神经网络模型中的算子拆分成若干个可以并行执行子算子，对每个子算子，框架调用计算库生成所述子算子在单个核上执行的机器指令，通过把所述子算子的机器指令加载到不同核上，实现算子在多核处理器上的并行计算。具体地，因为框架使用单核处理器计算库生成子算子的计算指令，神经网络模型中所述算子的输入和输出张量数据随着所述算子被拆分成子算子同样被拆分成相应的子张量数据。The neural network splitting method provided by the embodiments of the present disclosure can avoid modifying the computing library of the single-core processor as much as possible, and can also realize the parallel execution of the neural network model on the multi-core processor. Specifically, the upper-layer framework divides the operators in the neural network model into several sub-operators that can be executed in parallel, and for each sub-operator, the framework calls the computing library to generate machine instructions for the sub-operators to execute on a single core , and by loading the machine instructions of the sub-operators to different cores, the parallel computing of the operators on the multi-core processor is realized. Specifically, because the framework uses the single-core processor computing library to generate the calculation instructions of the sub-operators, the input and output tensor data of the operator in the neural network model are also split as the operator is split into sub-operators Divide the data into corresponding sub-tensors.

基于上述描述，如图2所示，为本公开实施例提供一种神经网络模型拆分方法流程图。Based on the above description, as shown in FIG. 2 , an embodiment of the present disclosure provides a flowchart of a method for splitting a neural network model.

包括：include:

步骤201)：根据所述神经网络模型中目标层的算子，确定与所述目标层的算子关联的张量数据的拆分状态集合。Step 201): According to the operator of the target layer in the neural network model, determine the split state set of the tensor data associated with the operator of the target layer.

在本实施例中，神经网络模型通常可以看作是一个由算子和多维张量数据所构成的有向无环图，算子和张量数据彼此之间通过有向边相互连接，有向边的指向表明了数据是算子的输入或者是输出。使用op表示算子，tensor表示张量数据。同时，为了统一不同算子的拆分方式的表达，框架统一选择使用与算子相关联的张量数据的拆分方式来说明不同算子的拆分方式。不妨假设网络中所有张量数据都是4维的，对于图像分类网络最后的全连接层和归一化指数回归层的输入数据或输出数据来说，实际维度不到4，仍然将其表示为4维张量。用符号N，C，H，W分别表示这4个维度。其中，N表示批量的大小，C表示特征图像的个数，H表示特征图像的高，W表示特征图像的宽。这种假设是仅仅是出于说明的便捷性，对于框架本身来说，可以支持含有任意维度数量的张量数据的神经网络模型的处理。尽管如此，4维对于相当大一部分的神经网络结构都足够使用。In this embodiment, the neural network model can generally be regarded as a directed acyclic graph composed of operators and multi-dimensional tensor data. The operators and tensor data are connected to each other through directed edges, and the directed The direction of the edge indicates whether the data is the input or output of the operator. Use op to represent operators and tensor to represent tensor data. At the same time, in order to unify the expression of the splitting methods of different operators, the framework uniformly chooses to use the splitting methods of tensor data associated with the operators to illustrate the splitting methods of different operators. It may be assumed that all tensor data in the network are 4-dimensional. For the input data or output data of the last fully connected layer and the normalized exponential regression layer of the image classification network, the actual dimension is less than 4, and it is still expressed as 4-dimensional tensor. These four dimensions are represented by the symbols N, C, H, and W, respectively. Among them, N represents the batch size, C represents the number of feature images, H represents the height of the feature images, and W represents the width of the feature images. This assumption is only for the convenience of illustration, for the framework itself, can support the processing of neural network models containing tensor data of any number of dimensions. Nonetheless, 4 dimensions are sufficient for a sizable portion of neural network structures.

本技术方案对神经网络模型中的算子进行拆分时，算子的种类不同，该算子支持的计算逻辑也不相同，同样有不同的拆分策略。为了能够统一地表达不同算子的拆分策略，本技术方案采用算子的输入张量数据、输出张量数据的拆分状态来表示算子本身计算逻辑的拆分。When the present technical solution splits the operators in the neural network model, the types of operators are different, the calculation logics supported by the operators are also different, and there are also different split strategies. In order to uniformly express the splitting strategies of different operators, the technical solution adopts the splitting state of the input tensor data and output tensor data of the operator to represent the splitting of the computing logic of the operator itself.

对于本技术方案来说，可以对整个神经网络模型中的所有算子进行拆分，也可以对神经网络模型中部分算子进行拆分。且，目前深度学习领域新出现的网络结构和算法已经逐渐模糊各个数据维度的物理意义和彼此之间的界限，本技术方案可以扩展应用到更多维度下的算子拆分。For this technical solution, all operators in the entire neural network model can be split, and some operators in the neural network model can also be split. Moreover, the network structures and algorithms newly emerging in the field of deep learning have gradually blurred the physical meaning of each data dimension and the boundaries between each other, and this technical solution can be extended and applied to operator splitting in more dimensions.

把张量数据的任意一种拆分称为该张量数据的一种状态s，将张量数据拆分后，获得子张量数据集合。状态s通过对应的子张量数据集合表征。所有可能的拆分{s₀,s₁,s₂,…}组成了该张量数据的拆分状态集合S，一般来说，这是一个非常大的状态空间，这意味着由张量数据的拆分状态所表示的算子的可能的拆分方式的空间也非常巨大。Any split of the tensor data is called a state s of the tensor data. After the tensor data is split, a sub-tensor data set is obtained. The state s is represented by the corresponding sub-tensor data set. All possible splits {s ₀ , s ₁ , s ₂ ,…} constitute the split state set S of the tensor data. Generally speaking, this is a very large state space, which means that the tensor data consists of The space of possible splitting modes of the operator represented by the splitting state is also huge.

通过一些合理的假设，可以对张量数据的状态集合进行剪枝。首先，多核加速器完成一个算子的计算的时延取决于执行子任务耗时最长的那个核的时间，而多核架构中各个核在硬件结构上彼此是相互对等的，因此每个核的时间耗费的长短取决于分配给该核的任务负载的多少。所以一个合理的假设是应该保证拆分后的子算子的规模大致均衡，为此，可以从张量数据的状态集合S中省去那些拆分上不均衡的拆分状态。此外，多核架构中核数通常是2的整数次幂，如1，2，4，8，16等等，一个并行度不是2的整数次幂的任务往往会导致核的调度上产生“碎片”，因此拆分后的子算子数量应当保证是2的整数次幂。通过这两个假设，可以极大地缩小算子拆分策略的搜索空间。With some reasonable assumptions, the state set of tensor data can be pruned. First of all, the delay for a multi-core accelerator to complete the calculation of an operator depends on the time of the core that takes the longest to execute the subtask, and the cores in the multi-core architecture are equivalent to each other in terms of hardware structure, so each core’s The amount of time spent depends on the workload assigned to the core. Therefore, a reasonable assumption is that the scale of the split sub-operators should be roughly balanced. For this reason, those split states that are not balanced in splitting can be omitted from the state set S of the tensor data. In addition, the number of cores in a multi-core architecture is usually an integer power of 2, such as 1, 2, 4, 8, 16, etc. A task whose parallelism is not an integer power of 2 often results in "fragmentation" in the scheduling of the cores. Therefore, the number of sub-operators after splitting should be guaranteed to be an integer power of 2. With these two assumptions, the search space of operator splitting strategies can be greatly reduced.

需要说明，并非选择任意与算子关联的张量数据的拆分状态都能表示该算子的一种有效的拆分方式。张量数据的拆分的维度应该被算子所支持，例如归一化指数回归算子(Softmax)的输入数据不应该在待归一化的维度上存在拆分。此外，算子的输入张量和输出张量的拆分应该满足算子的计算逻辑，例如，卷积算子的输出数据在H/W维度上拆分的每一个子块的起止点都应该确实是其对应的输入数据在H/W维度上拆分的子块根据卷积算子的卷积核和位移步长计算出来的；卷积算子的输入数据在C维度上的拆分应该和权值数据在C维度上的拆分完全一致，输出数据在C维度上的拆分应该和权值数据在N维度上的拆分完全一致。在框架中，使用输出状态根据每个算子的具体逻辑来反向推出算子的输入状态，或者使用输入状态根据每个算子的具体逻辑来正向推出算子的输出状态。这保证了相关数据的状态始终能够表示一个有效的算子拆分方式。It should be noted that not selecting the split state of any tensor data associated with an operator can represent an effective splitting method for the operator. The split dimension of tensor data should be supported by the operator. For example, the input data of the normalized exponential regression operator (Softmax) should not be split on the dimension to be normalized. In addition, the splitting of the input tensor and output tensor of the operator should satisfy the calculation logic of the operator. For example, the start and end points of each sub-block where the output data of the convolution operator is split in the H/W dimension should be It is indeed the sub-block whose corresponding input data is split in the H/W dimension calculated according to the convolution kernel and displacement step size of the convolution operator; the input data of the convolution operator is split in the C dimension. It is exactly the same as the splitting of the weight data in the C dimension, and the splitting of the output data in the C dimension should be exactly the same as the splitting of the weight data in the N dimension. In the framework, the input state of the operator is deduced backwards according to the specific logic of each operator using the output state, or the output state of the operator is deduced forwardly using the input state according to the specific logic of each operator. This ensures that the state of the relevant data can always represent an efficient operator split.

步骤202)：根据所述神经网络模型的有向无环图遍历所述拆分状态集合，确定相邻拆分状态集合之间的状态路径及状态路径的权重。Step 202): Traverse the split state set according to the directed acyclic graph of the neural network model, and determine the state paths and the weights of the state paths between adjacent split state sets.

如图3所示，整个神经网络模型的拆分方案P可以看作是由每个算子的输入张量数据的拆分状态集合中的一种拆分状态向输出张量中的一种拆分状态的跳转。前一个算子的输出张量的拆分状态即是后一个算子输入张量的拆分状态。经过算子的每一种可能的跳转对应了在该算子上的一种有效的拆分方式。因此，状态路径表示算子的拆分方式。As shown in Figure 3, the splitting scheme P of the entire neural network model can be regarded as a splitting state in the splitting state set of the input tensor data of each operator to a splitting state in the output tensor jump. The split state of the output tensor of the previous operator is the split state of the input tensor of the latter operator. Every possible jump through an operator corresponds to an efficient split on that operator. Thus, the state path represents how the operator is split.

在本技术方案中，按照一种分解方式对张量数据进行分解获取子张量集合，该子张量集合对应的一种拆分状态，不同的分解方式能够获得多种拆分状态，所有分解方式获得的拆分状态组成拆分状态集合。由此可知，每种拆分状态对应一子张量集合，子张量集合包含了张量数据中的全部元素。并且，在一个子张量集合中，每个子张量的元素可以重叠，也可以不重叠。In this technical solution, the tensor data is decomposed according to a decomposition method to obtain a sub-tensor set, and the sub-tensor set corresponds to a split state, and different decomposition methods can obtain a variety of split states. The split state obtained by the method constitutes a split state set. It can be seen that each split state corresponds to a sub-tensor set, and the sub-tensor set contains all elements in the tensor data. And, in a set of sub-tensors, the elements of each sub-tensor may or may not overlap.

上述已描述，状态路径表示算子的拆分方式，经所述状态路径对应的拆分方式对算子的计算逻辑进行拆分，获得对应的子算子集合。输入张量数据的状态和对应输出张量数据的状态通过状态路径连接，表示该输入张量数据的一拆分状态的子张量数据集合经过子算子集合中的子算子处理，得到该输出张量数据的对应拆分状态的子张量数据集合。As described above, the state path represents the splitting mode of the operator, and the calculation logic of the operator is split through the splitting mode corresponding to the state path to obtain the corresponding sub-operator set. The state of the input tensor data and the state of the corresponding output tensor data are connected by a state path, and the sub-tensor data set representing a split state of the input tensor data is processed by the sub-operators in the sub-operator set to obtain the The set of sub-tensor data corresponding to the split state of the output tensor data.

在本技术方案中，状态路径的权重为算子在某种拆分状态下在多核加速器上并行执行所用的时间进行表征，且多核加速器完成一个算子的计算的时间取决于执行子任务耗时最长的那个核的时间。计算状态路径的权重时使用参数进行估计：In this technical solution, the weight of the state path is represented by the time it takes for the operator to execute in parallel on the multi-core accelerator in a certain split state, and the time for the multi-core accelerator to complete the computation of one operator depends on the time-consuming execution of the subtask The longest time for that nucleus. The weights of the state paths are estimated using parameters:

1)拆分后的n个子算子的计算负载c₁,c₂,…,c_n。其中，c_i根据拆分后第i个子算子的类型和规模计算得到；1) The calculation loads c ₁ , c ₂ , . . . , c _n of the n sub-operators after splitting. Among them, c _i is calculated according to the type and scale of the ith sub-operator after splitting;

2)n个子算子的访存数据量d₁,d₂,…,d_n。其中，d_i根据拆分后第i个子算子的类型和规模计算得到；2) The fetched data volumes d ₁ , d ₂ , . . . , d _n of n sub-operators. Among them, d _i is calculated according to the type and scale of the ith sub-operator after splitting;

3)每个加速器核的计算吞吐速率α。α由加速器本身的性能参数所决定；3) The computational throughput rate α of each accelerator core. α is determined by the performance parameters of the accelerator itself;

4)每个核的访存带宽β。通常来说，多个核共享有限的访存带宽，因此β＝B/n。其中，B是多核加速器的总带宽。4) The memory access bandwidth β of each core. In general, multiple cores share limited memory access bandwidth, so β=B/n. where B is the total bandwidth of the multi-core accelerator.

状态路径的权重的计算公式为：The formula for calculating the weight of the state path is:

t＝max_i＝1,...,n(max(c_i/α，d _i/β)) 式(1)t=max _i=1,...,n (max( _{ci/α, d i} _/ β)) Equation (1)

其中，内侧的取最大值操作是基于算子实现的计算部分和访存部分之间能够相互隐藏，即计算和访存可以做到尽量并发执行。对于一些加速器来说，当子算子的规模过小时会导致每个核的计算吞吐量降低，可以对α进行进一步修正使估值更加准确。外侧的取最大值操作就是多核加速器完成一个算子的计算的时间取决于执行子任务耗时最长的那个核的时间。Among them, the inner maximum value operation is based on the fact that the calculation part and the memory access part implemented by the operator can be hidden from each other, that is, the calculation and the memory access can be performed concurrently as much as possible. For some accelerators, when the size of the sub-operator is too small, the computational throughput of each core will be reduced, and α can be further modified to make the estimation more accurate. The outer maximum value operation is that the time for the multi-core accelerator to complete the calculation of an operator depends on the time of the core that takes the longest time to execute the subtask.

需要说明的是，上述获取状态路径的权重的方式仅仅是例举的部分情况，而不是穷举，本领域技术人员在理解本申请技术方案的精髓的情况下，可能会在本申请技术方案的基础上产生其它的变形或者变换，比如：衡量状态路径的权重不仅仅可以是执行子任务的所花费的时间，也可以是执行子任务的吞吐量。或也可以通过实际测量在多核处理器上执行所述状态路径对应的算子拆分方式下的所有子任务的时间来确定状态路径的权重。但只要其实现的功能以及达到的技术效果与本申请类似，那么均应当属于本申请的保护范围。It should be noted that the above method of obtaining the weight of the state path is only an example of some cases, rather than an exhaustive list. Those skilled in the art may understand the essence of the technical solution of the present application. Based on other deformations or transformations, for example, the weight of the state path can be measured not only by the time spent executing subtasks, but also by the throughput of executing subtasks. Alternatively, the weight of the state path can also be determined by actually measuring the time for executing all subtasks in the operator splitting mode corresponding to the state path on the multi-core processor. However, as long as the functions realized and the technical effects achieved are similar to those of the present application, they should all belong to the protection scope of the present application.

步骤203)：根据所述状态路径的权重，确定所述目标层的目标拆分路径。Step 203): Determine the target splitting path of the target layer according to the weight of the state path.

在步骤203中，利用状态路径的权重，分两种方式来确定目标层的拆分路径。第一种方式为正向遍历来确定拆分路径，步骤包括：In step 203, the weight of the state path is used to determine the split path of the target layer in two ways. The first way is forward traversal to determine the split path. The steps include:

根据所述状态路径和所述拆分路径确定所述当前状态到所述目标层的输入张量数据的起始状态的拆分路径；Determine the splitting path from the current state to the starting state of the input tensor data of the target layer according to the state path and the splitting path;

第二种方式为反向遍历来确定拆分路径，步骤包括：The second way is reverse traversal to determine the split path, the steps include:

下面举例详细说明如何遍历完所述目标层的所有拆分状态集合后获得所述目标层的输入张量数据的拆分状态集合与所述目标层的输出张量数据的拆分状态集合之间的目标拆分路径。The following example describes in detail how to obtain the relationship between the split state set of the input tensor data of the target layer and the split state set of the output tensor data of the target layer after traversing all the split state sets of the target layer. target split path.

图3所示的神经网络模型为串行的，且整个神经网络模型的输入张量数据和输出张量数据均是不拆分状态。将图3整个神经网络模型的算子进行拆分，把一个包含n个算子的串行神经网络模型描述成一个算子序列(op₁,op₂,...,op_n)，假设每个算子只有一个输入和一个输出，前一个算子的输入是后一个算子的输出，那么包括整个神经网络模型的输入张量数据和输出张量数据以及所有的算子之间的中间结果张量在内，所有的张量数据构成集合(tensor₀,tensor₁,...,tensor_n)，op_i的输入是tensor_i-1，输出是tensor_i。对每个数据张量tensor_i，有与之对应的状态集合Sⁱ，搜索策略的目标是寻找一种张量本身和其状态集合的某个状态之间的映射关系tensor_i→sⁱ，通过给神经网络模型中每个张量数据确定一个具体的拆分状态，从而可以确定所有算子的拆分方式，因此，把一个神经网络模型中所有张量数据到其拆分状态的一种映射关系称为该网络模型的一种拆分方案P。在计算阶段，第i个算子op_i以处于拆分状态s的输入数据计算出处于拆分状态r的输出张量数据，具体的并行计算方式由输入张量数据和输出张量数据的状态所决定，同时，该算子的计算时间记为t_s→r，其值的大小取决于相应的拆分方式和底层加速器的硬件特性，则整个网络的时延T的计算公式为：The neural network model shown in FIG. 3 is serial, and the input tensor data and output tensor data of the entire neural network model are in an unsplit state. Divide the operators of the entire neural network model in Figure 3, and describe a serial neural network model containing n operators as a sequence of operators (op ₁ , op ₂ ,..., op _n ), assuming that each Each operator has only one input and one output. The input of the previous operator is the output of the latter operator, so it includes the input tensor data and output tensor data of the entire neural network model and the intermediate results between all operators. Including tensors, all tensor data constitute a set (tensor ₀ , tensor ₁ ,..., tensor _n ), the input of op _i is tensor _i-1 , and the output is tensor _i . For each data tensor tensor _i , there is a corresponding state set S ⁱ . The goal of the search strategy is to find a mapping relationship between the tensor itself and a certain state of its state set tensor _i →s ⁱ , through A specific split state is determined for each tensor data in the neural network model, so that the split mode of all operators can be determined. Therefore, a mapping from all tensor data in a neural network model to its split state The relationship is called a split scheme P of the network model. In the calculation stage, the ith operator op _i calculates the output tensor data in the split state r with the input data in the split state s. The specific parallel calculation method is determined by the state of the input tensor data and the output tensor data. At the same time, the calculation time of this operator is recorded as t _s→r , and the size of its value depends on the corresponding splitting method and the hardware characteristics of the underlying accelerator, then the calculation formula of the delay T of the entire network is:

同样与之对应的还有使用该拆分方式在多核加速器上并行执行该算子所使用的时间t_i，因此，t_i可以被看作是由算子的输入张量数据的状态指向输出张量数据的状态的有向边的权重。同时，作为整个神经网络模型的输入张量数据和输出张量数据，它们对应的拆分状态空间中只有一种不拆分的保持整个数据块连续完整的状态，这使得神经网络模型的拆分方案P由完整的输入数据开始，到完整的输出数据结束，外部的用户始终看到一个完整的输入和输出。此时，对于给定的神经网络模型搜索一个好的拆分方案P，即是寻找一条由输入张量数据的不拆分状态到输出张量数据的不拆分状态的最短路径，该路径在每个中间结果张量的有效状态空间中都要选择一个状态经过。式3、式4给出了这种抽象的公式表示。It also corresponds to the time t _i used to execute the operator in parallel on the multi-core accelerator using the split method. Therefore, t _i can be regarded as the state of the input tensor data of the operator pointing to the output tensor The weights of the directed edges of the state of the data. At the same time, as the input tensor data and output tensor data of the entire neural network model, there is only one state in the corresponding split state space that keeps the entire data block continuous and complete, which makes the splitting of the neural network model. Scheme P starts with complete input data and ends with complete output data, and external users always see a complete input and output. At this time, searching for a good splitting scheme P for a given neural network model is to find a shortest path from the unsplit state of the input tensor data to the unsplit state of the output tensor data. Each intermediate result tensor must select a state to pass through in the valid state space. Equation 3 and Equation 4 give this abstract formula.

同样注意到图3中，存在输入张量数据的一个拆分状态指向输出张量数据的多个拆分状态的情况，这种情况是存在的，进一步说，正是这种情况才导致了神经网络模型拆分空间的巨大。Also note that in Figure 3, there is a situation where one split state of the input tensor data points to multiple split states of the output tensor data. This situation exists. Further, it is this situation that causes the neural network The network model splitting space is huge.

在本技术方案中，设整个神经网络模型的输入张量数据的不拆分状态为起始状态s_root，在初始阶段，神经网络模型的输入张量数据的不拆分状态为起始状态s_root对应的拆分路径的权重为0，其余所有张量数据的所有状态的对应的拆分路径的权重为∞。神经网络模型中任一张量数据的任一状态s都有与之对应的由s_root开始到s的拆分路径权重为l_s。由前往后访问每一个拆分状态集合，在每个拆分状态集合中，依次遍历其中的每一个状态s。对每个状态s，有指向后一拆分状态集合中若干拆分状态的有向边e₁,…,e_ks。以后一拆分状态集合中的拆分状态v为例，使用式(1)获得状态s到状态v之间的权重t_sv，利用式(5)来更新该状态路径指向的下一拆分状态集合中的状态v对应的由s_root开始到状态v的拆分路径的权重l_v。In this technical solution, the unsplit state of the input tensor data of the entire neural network model is set as the starting state s _root , and in the initial stage, the unsplit state of the input tensor data of the neural network model is the starting state s The weight of the splitting path corresponding to the _root is 0, and the weight of the corresponding splitting path of all states of the remaining tensor data is ∞. Any state s of any tensor data in the neural network model has a corresponding splitting path weight ls from the _root of s to _s . Visit each split state set from front to back, and in each split state set, traverse each state s in turn. For each state s, there are directed edges e ₁ ,...,e _ks pointing to several split states in the set of subsequent split states. Taking the split state v in the next split state set as an example, use equation (1) to obtain the weight t _sv between state s and state v, and use equation (5) to update the next split state pointed to by the state path The weight l _v of the split path from s _root to the state v corresponding to the state v in the set.

l_v＝min(l_v,l_s+t_sv) (5)l _v =min(l _v ,l _s +t _sv ) (5)

在依据神经网络模型的拓扑关系正向遍历完成所有拆分状态集合的访问后，获得整个神经网络模型的输入张量数据的不拆分状态s_root到神经网络模型的输出张量数据的不拆状态s_end的目标拆分路径。After completing the access to all split state sets according to the topological relationship of the neural network model, the unsplit state s _root of the input tensor data of the entire neural network model is obtained to the unsplit state s root of the output tensor data of the neural network model. The target split path for state _send .

上述描述一条由不拆分状态s_root到不拆分状态s_end经过每个拆分状态集合中的一个状态的路径，该路径即为神经网络模型的拆分路径。从神经网络模型的拆分路径中选取权重最小的作为神经网络模型的目标拆分路径。The above describes a path from the unsplit state s _root to the unsplit state _send through a state in each split state set, and this path is the split path of the neural network model. From the split paths of the neural network model, the one with the smallest weight is selected as the target split path of the neural network model.

需要说明的是，图3所示的神经网络模型为串行神经网络模型，且为了便于说明本技术方案，神经网络模型的输入张量数据和输出张量数据对应的拆分状态集合均为不拆分状态。在神经网络模型的输出张量数据的拆分状态集合不是不拆分状态s_end，而是多个拆分状态构成的集合时，神经网络模型的输出张量数据的拆分状态集合中的每个拆分状态的拆分路径的权重中选出最小值作为整个神经网络模型的输入张量数据的拆分状态集合到神经网络模型的输出张量数据的拆分状态集合之间的目标拆分路径。It should be noted that the neural network model shown in FIG. 3 is a serial neural network model, and in order to facilitate the description of the technical solution, the input tensor data of the neural network model and the split state sets corresponding to the output tensor data are different. Split state. When the split state set of the output tensor data of the neural network model is not a non-split state _send , but a set composed of multiple split states, each of the split state sets of the output tensor data of the neural network model The minimum value is selected from the weights of the split paths of the split states as the target split between the split state set of the input tensor data of the entire neural network model and the split state set of the output tensor data of the neural network model path.

注意到整个方案也可以转换为搜索由不拆分状态s_end到不拆分状态s_root的拆分路径，二者是等价的。同理，在神经网络模型的输入张量数据的拆分状态集合不是不拆分状态s_end，而是多个拆分状态构成的集合时，神经网络模型的输入张量数据的拆分状态集合中的每个拆分状态的拆分路径的权重中选出最小值作为整个神经网络模型的输入张量数据的拆分状态集合到神经网络模型的输出张量数据的拆分状态集合之间的目标拆分路径。Note that the whole scheme can also be transformed to search for a split path from the unsplit state _send to the unsplit state s _root , and the two are equivalent. Similarly, when the split state set of the input tensor data of the neural network model is not the _unsplit state send, but a set composed of multiple split states, the split state set of the input tensor data of the neural network model Among the weights of the splitting paths of each splitting state, the minimum value is selected as the difference between the splitting state set of the input tensor data of the entire neural network model and the splitting state set of the output tensor data of the neural network model. Target split path.

图3所示的神经网络模型为串行的神经网络模型，且模型的输入张量数据和输出张量数据的拆分状态集合中的状态为不拆分情况。在实际应用中，针对图2、图4、图7和图9所示的技术方案，均需要解决多分支神经网络模型中不同分支拆分方式一致性的问题。位于分支交汇处的算子具有1个以上的输入张量数据，例如对位加法算子(Add)，对位乘法算子(Mult)，拼接算子(Concat)。对一个有2个输入的算子A，在访问该算子，即根据输入张量数据的拆分状态集合枚举输入张量数据的拆分状态集合结束后，两个输入张量数据tensor_left，tensor_right分别有对应的拆分状态集合S_left和S_right。分别沿tensor_left，tensor_right开始的两条之路继续向前遍历，一种情况下，两条支路会直接延伸直至遍历结束，代表整个网络有不止一个输入数据，这通常在推理任务中并不常见，另一种情况下，两条支路在某算子处合到一起。无论哪种情况，当确定拆分方案P时，在算子A的两个输入张量数据tensor_left，tensor_right上，可能会选中相互不匹配的拆分状态。具体来说，假设算子A是二元对位加法算子，回溯过程在tensor_left的拆分状态集合中选中的可能是一个仅在C维度上有拆分的状态，而在tensor_right的拆分状态集合中选中的可能是一个仅在H维度上有拆分的状态，这两个拆分状态所表示的加法算子本身的拆分方式是不一致的，因此会导致整个拆分方案P无效。为了解决这个问题，在遍历算子A结束前保证tensor_left，tensor_right对应的拆分状态集合中都只含有一个拆分状态，这确保回溯过程中在两状态集合中选择的状态的确定性。因此，在正向遍历阶段，当所述算子的输出张量数据被至少两个算子作为输入张量数据，或者所述算子具有至少两个输出张量数据时，所述算子的输出张量数据的拆分状态集合中保留一个拆分状态，且所述拆分状态经由所述算子的同一状态路径确定。在反向遍历阶段，当所述算子具有至少两个输入张量数据时，所述算子的输入张量数据的拆分状态集合中保留一个拆分状态，且所述拆分状态经由所述算子的同一状态路径确定。这样，在遍历分支算子结束前，将从多个输入数据的拆分状态集合中选择出对应累计权重最小的状态保留下来，移除拆分状态集合中其他的拆分状态。The neural network model shown in FIG. 3 is a serial neural network model, and the state in the split state set of the input tensor data and the output tensor data of the model is not split. In practical applications, for the technical solutions shown in FIG. 2 , FIG. 4 , FIG. 7 and FIG. 9 , it is necessary to solve the problem of consistency of different branch splitting methods in the multi-branch neural network model. Operators located at the intersection of branches have more than one input tensor data, such as bitwise addition operator (Add), bitwise multiplication operator (Mult), and concatenation operator (Concat). For an operator A with two inputs, after accessing the operator, that is, enumerating the split state set of the input tensor data according to the split state set of the input tensor data, the two input tensor data tensor _left , tensor _right has corresponding split state sets S _left and S _right respectively. Continue to traverse along the two paths starting from tensor _left and tensor _right respectively. In one case, the two branches will extend directly until the end of the traversal, which means that the entire network has more than one input data, which is usually not used in inference tasks. Uncommon, in another case, two branches join together at some operator. In either case, when the splitting scheme P is determined, on the two input tensor data tensor _left and tensor _right of operator A, the split states that do not match each other may be selected. Specifically, assuming that operator A is a binary pairwise addition operator, the backtracking process may select a state that is split only in the C dimension in the split state set of tensor _left , and the split state of tensor _right may be selected in the backtracking process. The selected state in the sub-state set may be a state that only has splits in the H dimension. The splitting methods of the addition operators themselves represented by these two split states are inconsistent, so the entire splitting scheme P will be invalid. . In order to solve this problem, it is ensured that the split state sets corresponding to tensor _left and tensor _right only contain one split state before the end of traversal operator A, which ensures the certainty of the state selected in the two state sets during the backtracking process. Therefore, in the forward traversal phase, when the output tensor data of the operator is used as input tensor data by at least two operators, or the operator has at least two output tensor data, the operator's output tensor data is One split state remains in the set of split states of the output tensor data, and the split state is determined via the same state path of the operator. In the reverse traversal stage, when the operator has at least two input tensor data, one split state remains in the split state set of the input tensor data of the operator, and the split state is passed through the The same state path of the predicate operator is determined. In this way, before the end of the traversal of the branch operator, the state with the smallest corresponding cumulative weight is selected from the split state sets of multiple input data and retained, and the other split states in the split state set are removed.

需要说明的是，上述获取目标拆分路径方式类似于viterbi算法，此次仅仅是例举的部分情况，而不是穷举，本领域技术人员在理解本申请技术方案的精髓的情况下，可能会在本申请技术方案的基础上产生其它的变形或者变换，比如：神经网络模型的输入张量数据的拆分状态集合到神经网络模型的输出张量数据的拆分状态集合之间的每一条拆分路径的权重由对应的状态路径的权重之和确定。根据经验设置一阈值，拆分路径的权重小于设定的阈值，就可以作为目标拆分路径对神经网络模型进行拆分。但只要其实现的功能以及达到的技术效果与本申请类似，那么均应当属于本申请的保护范围。It should be noted that the above method of obtaining the target splitting path is similar to the viterbi algorithm, and this time is only a partial example, not an exhaustive list. Those skilled in the art may understand the essence of the technical solution of the present application. Other deformations or transformations are generated on the basis of the technical solutions of the present application, such as: each split between the split state set of the input tensor data of the neural network model and the split state set of the output tensor data of the neural network model The weights of the sub-paths are determined by the sum of the weights of the corresponding state paths. A threshold is set according to experience. If the weight of the split path is less than the set threshold, the neural network model can be split as the target split path. However, as long as the functions realized and the technical effects achieved are similar to those of the present application, they should all belong to the protection scope of the present application.

步骤204)：利用所述目标拆分路径对所述神经网络模型的目标层的算子进行拆分。Step 204): Use the target splitting path to split the operator of the target layer of the neural network model.

由上述描述可知，通过把神经网络中的算子的计算逻辑拆分成更小的子任务分配到多个核上并行执行来充分利用多核处理器结构芯片的硬件资源。As can be seen from the above description, the hardware resources of the multi-core processor structure chip can be fully utilized by dividing the computing logic of the operator in the neural network into smaller subtasks and assigning them to multiple cores for parallel execution.

对于图2所示的技术方案来说，最理想的条件下，希望拆分后的子算子可以把它们的输出张量数据写到一个存放完整输出张量数据的存储空间中的对应位置上，这样一个算子的所有子算子执行完毕后得到的始终是一个完整连续的数据块。但对于部分加速器来说，这一点并不容易做到。首先，由于拆分后的算子的输出张量数据在整个输出中的存储位置可能是不连续的，使得需要重写算子输出部分的代码，使其能够把输出结果写回到一个子张量在存储中对应的离散位置上。与此同时，加速器通常会进一步调整数据在存储中的顺序来提升计算时的访存效率，这使得修改算子输出逻辑的工作更加困难繁琐。同时，如果后续算子的计算逻辑或拆分逻辑不需要输入数据在某个维度上保证存储连续性，可以把上一层输出的在该维度上处于离散存储状态的数据直接用于下一层的计算，而不需要保证输出数据的连续性。For the technical solution shown in Figure 2, under the most ideal conditions, it is hoped that the split sub-operators can write their output tensor data to the corresponding position in a storage space where the complete output tensor data is stored , after all sub-operators of such an operator are executed, a complete and continuous data block is always obtained. But for some accelerators, this is not easy to do. First, since the storage location of the output tensor data of the split operator may be discontinuous in the entire output, it is necessary to rewrite the code of the output part of the operator so that the output result can be written back to a sub-tensor The quantities are in corresponding discrete locations in storage. At the same time, the accelerator usually further adjusts the order of data in storage to improve the memory access efficiency during computation, which makes the work of modifying the output logic of the operator more difficult and cumbersome. At the same time, if the calculation logic or splitting logic of the subsequent operator does not require the input data to ensure storage continuity in a certain dimension, the data in the discrete storage state in the dimension output by the previous layer can be directly used for the next layer. calculation without guaranteeing the continuity of the output data.

因此，框架把调整张量数据拆分形式的任务从算子的计算任务中剥离出来，抽象为一个新的算子，称为胶水算子，这种剥离避免了对每个算子输出逻辑的修改，增强了框架对于底层不同加速器的可移植性。胶水算子用来把张量按照某种方式拆分的子数据块调整成按照另一种拆分方式形成的子数据块。如表1所示，不同种类的算子所允许的拆分方式表现到它们的输入张量数据、输出张量数据上是不同的。当上一层算子的输出张量数据的拆分方式并不被下一层算子所允许，就需要使用胶水算子对该张量数据的拆分方式进行调整，从而把前后算子“粘接”起来。此外，即使上一层输出的拆分方式被下一层所支持，也可以通过胶水算子把张量数据的拆分进行调整成更利于与下一层计算的形式。Therefore, the framework separates the task of adjusting the split form of tensor data from the computing task of the operator, and abstracts it into a new operator, called the glue operator. This separation avoids the logic of the output of each operator. Modification to enhance the portability of the framework to different underlying accelerators. The glue operator is used to adjust sub-blocks of tensors split in one way into sub-blocks formed by another split. As shown in Table 1, the splitting methods allowed by different types of operators are different in their input tensor data and output tensor data. When the splitting method of the output tensor data of the upper layer operator is not allowed by the next layer operator, it is necessary to use the glue operator to adjust the splitting method of the tensor data, so as to separate the front and rear operators " glue". In addition, even if the splitting method of the output of the previous layer is supported by the next layer, the splitting of the tensor data can be adjusted to a form that is more conducive to the calculation of the next layer through the glue operator.

表1Table 1

基于上述描述，在图2的基础上，本实施例还提出另一种神经网络模型的拆分方法。如图4所示。在图2的基础上，还包括：Based on the above description and on the basis of FIG. 2 , this embodiment further proposes another method for splitting the neural network model. As shown in Figure 4. On the basis of Figure 2, it also includes:

步骤201’)：在所述目标层的算子与关联的拆分状态集合之间插入胶水算子，调整所述算子的张量数据的拆分状态集合中的状态；其中，所述胶水算子用于将所述张量数据按照一拆分方式获得的状态调整成按照任一种拆分方式获得的状态。Step 201'): insert a glue operator between the operator of the target layer and the associated split state set, and adjust the state in the split state set of the tensor data of the operator; wherein, the glue The operator is used to adjust the state obtained by the tensor data according to a splitting method to a state obtained according to any splitting method.

在本步骤中，通过胶水算子来表示调整张量数据的拆分状态的行为方式，神经网络模型的每一层的计算规模随着网络的延申不断变化，随着神经网络模型拆分趋势的变化，需要对算子的拆分方式做出相应的调整，也就是对中间结果的状态进行调整。如图5所示，图3中的Op_2和其输入Tensor1之间加入了胶水算子，可以把张量数据的任意一种拆分状态转换成另一种拆分状态。对胶水算子而言，其输入张量数据和输出张量数据有着相同的形状和相同的状态空间，由输入张量数据的任一拆分状态，存在指向输出张量数据所有拆分状态的有向边，因此在输入张量数据的拆分状态集合和输出张量数据的拆分状态集合之间形成了全连接的网状结构。这使得任意一种输入张量数据的拆分状态可以在算子Op_2前转换成另一种拆分状态，给拆分方案P的搜索空间中引入了在每个算子计算开始前调整其输入张量数据的拆分状态，也即是调整算子本身拆分方式的可能性。In this step, the behavior of adjusting the split state of the tensor data is represented by the glue operator. The calculation scale of each layer of the neural network model changes continuously with the extension of the network, and with the split trend of the neural network model The change of the operator needs to be adjusted accordingly, that is, the state of the intermediate result needs to be adjusted. As shown in Figure 5, a glue operator is added between Op_2 in Figure 3 and its input Tensor1, which can convert any split state of tensor data into another split state. For the glue operator, its input tensor data and output tensor data have the same shape and the same state space. From any split state of the input tensor data, there is a pointer to all split states of the output tensor data. Directed edges, so a fully connected network is formed between the split state set of the input tensor data and the split state set of the output tensor data. This allows any split state of the input tensor data to be converted into another split state before operator Op_2, and introduces into the search space of the split scheme P that the input of each operator is adjusted before the calculation starts. The split state of the tensor data, that is, the possibility to adjust the split method of the operator itself.

需要说明的是，图5示出了算子与对应的输入张量数据之间插入胶水算子，也可以在算子与对应的输出张量数据之间插入胶水算子，更可以在算子与对应的输入张量数据、输出张量数据之间均插入胶水算子，此次仅仅是例举的部分情况，而不是穷举，本领域技术人员在理解本申请技术方案的精髓的情况下，可能会在本申请技术方案的基础上产生其它的变形或者变换，但只要其实现的功能以及达到的技术效果与本申请类似，那么均应当属于本申请的保护范围。It should be noted that Fig. 5 shows that the glue operator is inserted between the operator and the corresponding input tensor data. The glue operator can also be inserted between the operator and the corresponding output tensor data. The glue operator is inserted between the corresponding input tensor data and output tensor data. This time is only an example of some cases, not an exhaustive list. Those skilled in the art understand the essence of the technical solution of the present application. , other deformations or transformations may occur on the basis of the technical solutions of the present application, but as long as the functions realized and the technical effects achieved are similar to those of the present application, they should all belong to the protection scope of the present application.

在神经网络模型的目标层的算子与关联的拆分状态集合之间插入胶水算子，虽然对算子的拆分方式作出相应的调整，但是这种调整又会带来额外的开销，如何在整个神经网络模型中恰当地插入胶水算子来提高神经网络模型的性能。为了解决这个问题，在神经网络模型的目标层的算子与关联的拆分状态集合之间插入胶水算子，获取包含所述胶水算子在内的神经网络模型的有向无环图；根据所述有向无环图遍历所述目标层的所有张量数据对应的拆分状态集合，确定相邻拆分状态集合之间的状态路径及状态路径的权重；根据所述状态路径的权重，确定包含所述胶水算子在内的神经网络模型的目标层的拆分路径；利用包含所述胶水算子在内的神经网络模型的目标层的拆分路径对插入的每个胶水算子进行选择，对不需要插入的胶水算子删除，对需要插入的胶水算子保留。A glue operator is inserted between the operator of the target layer of the neural network model and the associated split state set. Although the split method of the operator is adjusted accordingly, this adjustment will bring additional overhead. How to The glue operator is properly inserted in the whole neural network model to improve the performance of the neural network model. In order to solve this problem, a glue operator is inserted between the operator of the target layer of the neural network model and the associated split state set, and a directed acyclic graph of the neural network model including the glue operator is obtained; according to The directed acyclic graph traverses the split state sets corresponding to all tensor data of the target layer, and determines the state paths and the weights of the state paths between adjacent split state sets; according to the weights of the state paths, Determine the split path of the target layer of the neural network model including the glue operator; use the split path of the target layer of the neural network model including the glue operator to perform each inserted glue operator. Select, delete the glue operators that do not need to be inserted, and keep the glue operators that need to be inserted.

对于胶水算子来说，胶水算子内部采用拆分-拼接、拼接-拆分、拼接、拆分这四种方式中的其中之一种实现方式，在拼接阶段可以把在任意维度上相邻的子数据块凭借成一个新的数据块，在拆分阶段，可以把任意一个子数据块拆分成两个更小的子数据块。任意一种拆分可以通过这样一个两阶段的过程转换成另外一种拆分形式。为了说明这一点，不妨假设数据本身是一维的，调整前的拆分形式表示为{(0,p₁),(p₁,p₂),…,(p_n-1,end)}，每一段代表一维数据拆分后的一个子段，调整后的拆分形式是{(0,q₁),(q₁,q₂),…,(q_m-1,end)}，如果调整前的某相邻两段(p_i-1,p_i),(p_i,p_i+1)是调整后的某一段(q_j,q_j+1)，即p_i-1＝q_j，p_i+1＝q_j+1，在调整这一部分时只需要在拼接阶段把(p_i-1,p_i),(p_i,p_i+1)拼接在一起，跳过拆分阶段。同样另一种情况下，如果调整前的某一子段是调整后的若干子段的集合，则跳过拼接阶段，在拆分阶段执行相应的拆分。最坏情况下，可以把拼接阶段把所有数据组合成一个完整一维数据，在拼接阶段在进行对应的拆分。For the glue operator, the glue operator adopts one of the four methods of split-splicing, splicing-splitting, splicing, and splitting. In the splicing stage, adjacent ones in any dimension can be In the splitting stage, any sub-data block can be split into two smaller sub-data blocks. Any split can be transformed into another split through such a two-stage process. To illustrate this point, suppose that the data itself is one-dimensional, and the split form before adjustment is expressed as {(0,p ₁ ),(p ₁ ,p ₂ ),…,(p _n-1 ,end)}, Each segment represents a sub-segment after one-dimensional data is split, and the adjusted split form is {(0,q ₁ ),(q ₁ ,q ₂ ),…,(q _m-1 ,end)}, if The two adjacent segments before adjustment (pi _-1 , _pi ), ( _pi ,pi ₊₁ ) are a segment after adjustment ( _qj ,qj ₊₁ ), that is,pi _-1 =q _j , p _i+1 =q _j+1 , when adjusting this part, only need to splicing (pi _-1 , _pi ),( _pi ,pi ₊₁ ) together in the splicing stage, skip the split stage. In another case as well, if a certain sub-segment before adjustment is a set of several sub-segments after adjustment, the splicing stage is skipped, and corresponding splitting is performed in the splitting stage. In the worst case, all the data can be combined into a complete one-dimensional data in the splicing stage, and the corresponding splitting is carried out in the splicing stage.

以胶水算子采用拆分-拼接或拼接-拆分为例，假设待调整张量数据的总大小为M，且两个阶段均不能跳过，且每个阶段都要针对4个维度进行拼接或者拆分。为了便于移植，拼接和拆分通常会使用神经网络算法中自带的拼接算子(Concat)和拆分算子(Slice)实现，由于这两个算子每次只能处理一个维度，整个胶水在最差情况下会带来8M的存储读写读写开销。所以，必须在调整拆分状态和引入的额外开销之间寻找一个最佳的平衡点，再引入尽量少的胶水算子的情况下又能符合网络结构的规律再合理的地方对算子的拆分方式进行调整。Take the glue operator using split-splicing or splicing-split as an example, assuming that the total size of the tensor data to be adjusted is M, and neither stage can be skipped, and each stage must be spliced for 4 dimensions or split. In order to facilitate porting, splicing and splitting are usually implemented using the concatenating operator (Concat) and the splitting operator (Slice) that come with the neural network algorithm. Since these two operators can only process one dimension at a time, the entire glue In the worst case, it will bring 8M storage, read and write overhead. Therefore, it is necessary to find an optimal balance between the adjustment of the split state and the extra overhead introduced, and then to introduce as few glue operators as possible, and to comply with the rules of the network structure and to dismantle the operators in a reasonable place. Adjust the way.

进一步详说，胶水算子和普通的神经网络算子做相同处理，每个胶水算子对张量数据拆分状态的调整都有与之对应的时间t，该时间作为相应的状态路径的权重。仍然利用公式(5)获得包括胶水算子在内的神经网络模型的目标层的输入张量数据的不拆分状态s_root到神经网络模型的输出张量数据的不拆状态s_end的目标拆分路径。选择胶水算子时，在拆分路径中，检查每个胶水算子的输入张量数据对应的一拆分状态及输出张量数据对应的一拆分状态，如果该两个拆分状态相同，即图5中的拆分状态集合Tensor_1中的拆分状态status_1通过状态路径与拆分状态集合Tensor_1’中的拆分状态status_1相连，这两个拆分状态相同。说明神经网络模型的目标层的拆分路径P并不需要调整算子Op_2的输入张量数据的拆分状态，并且这是一个基于前后算子和整体性能统筹考虑的结果。算子Op_2与对应输入张量数据之间插入的胶水算子会从网络中移除。否则，需要对插入的胶水算子保留。In further detail, the glue operator and the ordinary neural network operator do the same processing. Each glue operator has a corresponding time t to adjust the state of tensor data splitting, and this time is used as the weight of the corresponding state path. . Still use formula (5) to obtain the _unsplit state s _root of the input tensor data of the target layer of the neural network model, including the glue operator, to the unsplit state send of the output tensor data of the neural network model. sub-path. When selecting the glue operator, in the split path, check the split state corresponding to the input tensor data of each glue operator and the split state corresponding to the output tensor data. If the two split states are the same, That is, the split state status_1 in the split state set Tensor_1 in FIG. 5 is connected to the split state status_1 in the split state set Tensor_1' through a state path, and the two split states are the same. It shows that the splitting path P of the target layer of the neural network model does not need to adjust the splitting state of the input tensor data of the operator Op_2, and this is a result based on the overall consideration of the front and rear operators and the overall performance. The glue operator inserted between the operator Op_2 and the corresponding input tensor data is removed from the network. Otherwise, the inserted glue operator needs to be reserved.

需要说明的是，胶水算子的实现使用神经网络模型中原有的算子。拼接阶段对应的是神经网络模型中的Concat算子，而拆分阶段对应的是神经网络模型中的Slice算子，任何已经支持这两种算子的加速器都可以很快地实现胶水算子。并且，在本实施例中，上述获取目标拆分路径方式类似于viterbi算法，此次仅仅是例举的部分情况，而不是穷举，本领域技术人员在理解本申请技术方案的精髓的情况下，可能会在本申请技术方案的基础上产生其它的变形或者变换，比如：神经网络模型的输入张量数据的拆分状态集合到神经网络模型的输出张量数据的拆分状态集合之间的每一条拆分路径的权重由对应的状态路径的权重之和确定。根据经验设置一阈值，拆分路径的权重小于设定的阈值，就可以作为目标拆分路径对神经网络模型进行拆分。但只要其实现的功能以及达到的技术效果与本申请类似，那么均应当属于本申请的保护范围。It should be noted that the implementation of the glue operator uses the original operator in the neural network model. The splicing stage corresponds to the Concat operator in the neural network model, and the splitting stage corresponds to the Slice operator in the neural network model. Any accelerator that already supports these two operators can quickly implement the glue operator. Moreover, in this embodiment, the above-mentioned method of obtaining the target splitting path is similar to the viterbi algorithm, and this time is only an example of some cases, rather than an exhaustive list. Those skilled in the art understand the essence of the technical solution of the present application. , other deformations or transformations may occur on the basis of the technical solutions of the present application, such as: the split state set of the input tensor data of the neural network model to the split state set of the output tensor data of the neural network model. The weight of each split path is determined by the sum of the weights of the corresponding state paths. A threshold is set according to experience. If the weight of the split path is less than the set threshold, the neural network model can be split as the target split path. However, as long as the functions realized and the technical effects achieved are similar to those of the present application, they should all belong to the protection scope of the present application.

需要强调的是，图2所示的关于算子拆分的技术方案内容适用于图4所示的技术方案，这里不再赘述。It should be emphasized that the content of the technical solution about operator splitting shown in FIG. 2 is applicable to the technical solution shown in FIG. 4 , and details are not repeated here.

对于神经网络模型来说，卷积算子是比较特殊的算子，在一些情况下需要额外的辅助算子完成拆分任务。当把计算按照输入张量数据的H/W维度进行划分时，如果卷积核窗口的尺寸超其每次移动的步长，即kernel>stride，拆分后的卷积算子在计算时会出现框移动到边界处超过子张量数据的边界的情况，而缺少的这一部分数据位于相邻子张量数据上。为了处理这种子任务之间的输入张量数据相互重叠的情况，同时保证可移植性，同样把这种需要访问相邻子张量数据的边界数据的行为剥离出来，形成一个新的辅助算子，称为补偿算子。For the neural network model, the convolution operator is a special operator, and in some cases additional auxiliary operators are required to complete the splitting task. When the calculation is divided according to the H/W dimension of the input tensor data, if the size of the convolution kernel window exceeds the step size of each movement, that is, kernel>stride, the split convolution operator will be calculated during the calculation. It occurs that the box moves to the boundary beyond the boundary of the sub-tensor data, and the missing part of the data is located on the adjacent sub-tensor data. In order to deal with the overlapping of input tensor data between subtasks and ensure portability, the behavior of accessing boundary data of adjacent sub-tensor data is also stripped out to form a new auxiliary operator , called the compensation operator.

如图6所示，补偿算子用来把一个子张量数据之外的相邻子张量数据中获取目标数据，将目标数据和子张量数据合并在一起形成一块更大的数据，此时计算阶段窗口的移动范围不会超过这个经过补偿的数据块的边界。除了卷积算子之外，池化算子以及目前不是很常见的局部响应归一化算子(Local Response Normalization，简称LRN)，同样有这种拆分后的子任务依赖相邻数据块上数据的问题，池化算子与卷积算子类似，主要是由于池化窗口大于窗口本身的移动步长所引起的，而局部响应归一化算子不同，其计算逻辑是，即为了计算输出张量数据在C维度上的一个点的结果，需要输入张量数据在C维度上与之对应的一个点以及左右相邻的k/2个点的数值。因此，如果把局部响应归一化算子的计算按照C维度拆分成多个LRN算子，每个新算子同样需要相邻子张量数据中的元素数据来计算位于C维度边界上的值。As shown in Figure 6, the compensation operator is used to obtain target data from adjacent sub-tensor data other than one sub-tensor data, and combine the target data and sub-tensor data to form a larger piece of data. The computation phase window does not move beyond the boundaries of this compensated data block. In addition to the convolution operator, the pooling operator and the currently not very common Local Response Normalization (LRN) operator also have such split subtasks that depend on adjacent data blocks. The problem of data, the pooling operator is similar to the convolution operator, mainly because the pooling window is larger than the moving step size of the window itself, while the local response normalization operator is different, its calculation logic is, that is, in order to calculate To output the result of a point in the C dimension of the tensor data, you need to input a point corresponding to the C dimension of the tensor data and the values of the left and right adjacent k/2 points. Therefore, if the calculation of the local response normalization operator is divided into multiple LRN operators according to the C dimension, each new operator also needs the element data in the adjacent sub-tensor data to calculate the C dimension boundary. value.

根据算子计算输出张量数据在某个维度上的一个数据所需要的在输入张量数据在该维度上的数据的范围，可以把算子归纳为三类。一类是点对点的算子，即计算输出张量数据的一个数据点只需要输入张量上与之对应的一个数据点的值，这一类算子包括激活算子(Relu，pRelu)，批标准化算子(BatchNorm)，以及对位加减乘除的基本算子(Add，Sub，Mult，Div)。这类算子在任何维度上可以进行任务拆分，得到的子算子在计算阶段只需要对应的子张量数据作为输入。另一类算子是全依赖类型的算子，即计算输出张量数据的一个数据点只需要输入张量数据上在该维度上的所有数据，例如卷积算子和全连接算子在计算输出张量数据C维度上的一个点需要输入张量数据C维度上的所有点，尽管可以通过事后累加部分和的方式实现卷积算子在输入C维度上的拆分，当算子在该维度是上的计算逻辑更加复杂时，例如归一化指数回归算子(Softmax)，式(6)给出在其归一化维度上的计算公式。According to the range of the input tensor data in the dimension required by the operator to calculate a data in the output tensor data in a certain dimension, the operators can be classified into three categories. One is the point-to-point operator, that is, to calculate a data point of the output tensor data, only the value of a corresponding data point on the input tensor is required. This type of operators includes activation operators (Relu, pRelu), batches Normalization operator (BatchNorm), and basic operators (Add, Sub, Mult, Div) for bitwise addition, subtraction, multiplication and division. This type of operator can perform task splitting in any dimension, and the obtained sub-operator only needs the corresponding sub-tensor data as input in the calculation stage. The other type of operator is a fully dependent type of operator, that is, calculating a data point of the output tensor data only requires all the data in that dimension on the input tensor data, such as the convolution operator and the fully connected operator in the calculation. A point on the C dimension of the output tensor data requires all points on the C dimension of the input tensor data, although the splitting of the convolution operator on the input C dimension can be achieved by accumulating partial sums afterwards. When the calculation logic on the dimension is more complicated, such as the normalized exponential regression operator (Softmax), the formula (6) gives the calculation formula on its normalized dimension.

其中，I是输入张量数据在归一化维度上的向量，O是输出张量数据在归一化维度上的向量。不同于卷积的部分和累加，这里的计算逻辑较为复杂，很难实现拆分。从这个角度看来，补偿算子实际是用来处理位于点对点类型的算子和全依赖类型的算子之间的第三种情况，即计算输出张量数据上的一个点需要输入张量数据在对应位置附近区域内的数据。对应位置附件区域根据补偿参数确定。这种情况下，算子在计算逻辑上实际上仍然是可以拆分的，尽管它们会依赖与子张量数据之外的数据，使用补偿算子可以统一地解决这个问题。where I is the vector of the input tensor data in the normalized dimension, and O is the vector of the output tensor data in the normalized dimension. Different from the partial sum accumulation of convolution, the calculation logic here is more complicated, and it is difficult to achieve splitting. From this point of view, the compensation operator is actually used to deal with the third case between the point-to-point type operator and the fully dependent type operator, that is, calculating a point on the output tensor data requires input tensor data Data in the area near the corresponding location. The corresponding position attachment area is determined according to the compensation parameters. In this case, the operators can still be split logically, although they will depend on data other than the sub-tensor data, and the compensation operator can be used to solve this problem uniformly.

基于此，如图7所示，为本公开实施例提供的一种神经网络模型拆分方法流程图之三。Based on this, as shown in FIG. 7 , the third flowchart of a method for splitting a neural network model provided by an embodiment of the present disclosure.

在图2的基础上，还包括：On the basis of Figure 2, it also includes:

步骤201”)：在所述目标层的算子与关联的的拆分状态集合之间插入补偿算子，调整所述算子的输入张量数据的拆分状态集合中的状态；其中，所述补偿算子用于从所述状态的任一子张量数据的相邻子张量数据中获取目标数据，将所述目标数据与所述子张量数据合并。Step 201"): insert a compensation operator between the operator of the target layer and the associated split state set, and adjust the state in the split state set of the input tensor data of the operator; The compensation operator is used to obtain target data from adjacent sub-tensor data of any sub-tensor data of the state, and combine the target data and the sub-tensor data.

在本技术方案中，为了解决卷积算子、池化算子由于窗口小于位移步长导致在沿H/W维度进行任务拆分的情况下，窗口超过输入子张量数据的边界的问题，框架引入了补偿算子在计算开始前，对一子张量数据集合来说，在每个子张量数据周围补上位于相邻的子张量数据的元素。这种方式避免了修改拆分后的卷积算子或者池化算子本身的计算逻辑，使得对相邻子张量数据的依赖行为对卷积算子、池化算子本身不可见，有利于这个系统的快速实现，便于在不同结构的加速器上一致。但是，补偿算子本身会带来额外的开销，假设原始数据块的大小为M，在不考虑补偿后子张量数据彼此之间重叠部分的情况下，补偿算子引入了2M的访存开销。卷积算子和池化算子是组成神经网络尤其是图像分类神经网络的主要算子，为了减轻补偿行为带来的开销，将插入的补偿算子采用一种金字塔形式的结构来合并。如图8所示，该神经网络模型是由两个卷积算子构成的串行序列，两个卷积算子均按照H/W维度进行任务拆分，分别拆成了4个更小的卷积算子，图中省略了数据的N维度和C维度。假设两个卷积算子的卷积核尺寸分别是k₁，k₂，为了简化计算，位移步长均为1。正常经情况下，卷积算子Conv1在计算前在子张量数据外围补偿的数据宽度为k₁/2，这保证拆分后的卷积任务在计算时卷积核不会超过输入子张量数据的边界。但在这里，卷积算子Conv1在计算前在子张量数据外围补偿的数据宽度为k₁/2+k₂/2，这使得其输出数据Tensor1的子张量数据彼此之间有k₂宽度的相互重叠，所以卷积算子Conv2在计算开始前不需要对其输入子张量数据进行数据补偿。In this technical solution, in order to solve the problem that the window exceeds the boundary of the input sub-tensor data in the case of task splitting along the H/W dimension due to the window of the convolution operator and the pooling operator being smaller than the displacement step size, The framework introduces a compensation operator, for a sub-tensor data set, to supplement the elements located in adjacent sub-tensor data around each sub-tensor data before the calculation starts. This method avoids modifying the calculation logic of the split convolution operator or the pooling operator itself, so that the dependence on adjacent sub-tensor data is invisible to the convolution operator and the pooling operator itself. It is beneficial to the rapid realization of this system, and it is convenient to be consistent on accelerators of different structures. However, the compensation operator itself will bring additional overhead. Assuming that the size of the original data block is M, the compensation operator introduces a memory access cost of 2M without considering the overlapping part of the sub-tensor data after compensation. . The convolution operator and the pooling operator are the main operators that make up the neural network, especially the image classification neural network. In order to reduce the overhead caused by the compensation behavior, the inserted compensation operators are combined in a pyramid-shaped structure. As shown in Figure 8, the neural network model is a serial sequence composed of two convolution operators. The two convolution operators are divided into four smaller ones according to the H/W dimension. Convolution operator, the N and C dimensions of the data are omitted from the figure. Assuming that the convolution kernel sizes of the two convolution operators are k ₁ and k ₂ respectively, in order to simplify the calculation, the displacement step size is both 1. Under normal circumstances, the data width of the convolution operator Conv1 compensated at the periphery of the sub-tensor data before calculation is k ₁ /2, which ensures that the convolution kernel of the split convolution task will not exceed the input sub-tensor during calculation. data boundaries. But here, the data width of the convolution operator Conv1 compensated at the periphery of the sub-tensor data before calculation is k ₁ /2+k ₂ /2, which makes the sub-tensor data of its output data Tensor1 have k ₂ between each other The widths overlap each other, so the convolution operator Conv2 does not need to perform data compensation on its input sub-tensor data before the computation starts.

通过这种方法，可以把串行算子序列中使用的多个补偿算子合并成顶端的一个，尽管这使得第一次补偿的访存开销变大，但在补偿宽度远小于子数据块自身的尺寸的情况下，可以有效减少了模型拆分后补偿算子的访存开销。但另一方面，这种方法会带来重复计算，图8中卷积算子Conv1的输出张量数据Tensor1的子张量数据的重叠部分的结果在多个拆分后的卷积算子中都被计算了。此外，对输入的特征图像尺寸较小的卷积网络，由于不再满足补偿宽度远小于子张量数据自身尺寸的条件，需要更加谨慎地评估合并多个补偿算子前后总的访存开销的变化情况。为了解决这个问题，同样把补偿算子的合并加入的拆分方案的搜索空间中。整个遍历过程由正向遍历转变为反向遍历，这两种输入在原理上是相通的，但后者更加适合引入了补偿算子合并以后的搜索策略。设整个神经网络模型的输出张量数据的不拆分状态为终点状态s_end，神经网络模型中任一张量数据的任一状态s都有与之对应的由s开始到s_end的拆分路径的权重为l_s。在遍历开始前，s_end对应的权重为0，其余所有张量数据的所有状态的对应权重为∞。依据神经网络模型的拓扑关系反向遍历每个算子，在由输出张量数据的拆分状态枚举可能的输入张量数据的拆分状态的过程中，除了枚举正常情况下子张量数据彼此没有重叠需要引入补偿过程的拆分状态，同样枚举彼此有重叠不需要补偿的输入拆分状态。后者对应的状态路径的权重在计算上考虑了冗余计算的部分所增加的时间。仍然利用公式(5)获得包括补偿算子在内的神经网络模型的目标层的输入张量数据的不拆分状态s_root到神经网络模型的输出张量数据的不拆状态s_end的目标拆分路径。In this way, multiple compensation operators used in the serial operator sequence can be combined into one at the top. Although this increases the memory access overhead of the first compensation, the compensation width is much smaller than that of the sub-data block itself. In the case of the size of , it can effectively reduce the memory access cost of the compensation operator after model splitting. But on the other hand, this method will bring repeated calculations. In Figure 8, the result of the overlapping part of the sub-tensor data of the output tensor data Tensor1 of the convolution operator Conv1 is in multiple split convolution operators. are all counted. In addition, for the convolutional network with a small input feature image size, since the compensation width is no longer much smaller than the size of the sub-tensor data itself, it is necessary to more carefully evaluate the total memory access cost before and after combining multiple compensation operators. Changes. To solve this problem, the merging of compensation operators is also added to the search space of the splitting scheme. The whole traversal process is changed from forward traversal to reverse traversal. The two inputs are similar in principle, but the latter is more suitable for the search strategy after the compensating operator is merged. Let the non-split state of the output tensor data of the entire neural network model be the _end state send, and any state s of any tensor data in the neural network model has a corresponding split from s to _send . The weight of the path is _ls . Before the _traversal starts, the corresponding weight of send is 0, and the corresponding weight of all states of all other tensor data is ∞. According to the topological relationship of the neural network model, each operator is traversed in reverse. In the process of enumerating the possible splitting states of the input tensor data from the splitting state of the output tensor data, in addition to enumerating the normal sub-tensor data The split states that do not overlap each other and need to introduce a compensation process, also enumerate the input split states that overlap each other and do not need compensation. The weights of the state paths corresponding to the latter are calculated to take into account the time added by the redundant computation part. Still use formula (5) to obtain the target split of the _unsplit state s _root of the input tensor data of the target layer of the neural network model including the compensation operator to the unsplit state send of the output tensor data of the neural network model. sub-path.

对于补偿算子来说，将神经网络模型中插入的多个补偿算子采用金字塔结构进行合并，可以获得一个合并后的补偿算子，也可以获得多个合并后的补偿算子。这种情况下，合并后的补偿算子的个数少于合并之前的补偿算子的个数。For the compensation operator, a pyramid structure is used to merge multiple compensation operators inserted in the neural network model to obtain one merged compensation operator or multiple merged compensation operators. In this case, the number of compensating operators after merging is less than the number of compensating operators before merging.

需要强调的是，图2所示的关于算子拆分的技术方案内容适用于图7所示的技术方案，这里不再赘述。It should be emphasized that the content of the technical solution about operator splitting shown in FIG. 2 is applicable to the technical solution shown in FIG. 7 , and details are not repeated here.

如图9所示，为本公开实施例提供的一种神经网络模型拆分方法流程图之四。将胶水算子和补偿算子均引入算子拆分方案中，此时拆分方法包括：As shown in FIG. 9 , the fourth flowchart of a method for splitting a neural network model provided by an embodiment of the present disclosure. Both the glue operator and the compensation operator are introduced into the operator splitting scheme. At this time, the splitting method includes:

步骤a)：根据所述神经网络模型中目标层的算子，确定与所述目标层的算子关联的张量数据的拆分状态集合；其中，所述目标层为所述神经网络模型中的至少一层；Step a): According to the operator of the target layer in the neural network model, determine the split state set of the tensor data associated with the operator of the target layer; wherein, the target layer is in the neural network model. at least one layer of;

步骤b)：在所述目标层的算子与关联的拆分状态集合之间插入胶水算子，调整所述算子的张量数据的拆分状态集合中的状态；其中，所述胶水算子用于将所述张量数据的拆分状态集合中的状态调整成所述张量数据的任一拆分状态；Step b): insert a glue operator between the operator of the target layer and the associated split state set, and adjust the state in the split state set of the tensor data of the operator; wherein, the glue operator The sub is used to adjust the state in the split state set of the tensor data to any split state of the tensor data;

步骤c)：在所述目标层的算子与关联的的拆分状态集合之间插入补偿算子，调整所述算子的输入张量数据的拆分状态集合中的状态；其中，所述补偿算子用于从所述状态的任一子张量数据的相邻子张量数据中获取目标数据，将所述目标数据与所述子张量数据合并；Step c): insert a compensation operator between the operator of the target layer and the associated split state set, and adjust the state in the split state set of the input tensor data of the operator; wherein, the The compensation operator is used to obtain target data from adjacent sub-tensor data of any sub-tensor data of the state, and combine the target data and the sub-tensor data;

步骤d)：根据所述神经网络模型的有向无环图遍历所述拆分状态集合，确定相邻拆分状态集合之间的状态路径及状态路径的权重；其中，所述状态路径表示所述算子的拆分方式；所述拆分状态集合中的每个状态表示一个子张量数据集合，所述状态的所有子张量数据的并集结果为所述张量数据；Step d): traverse the split state set according to the directed acyclic graph of the neural network model, and determine the state path and the weight of the state path between adjacent split state sets; The splitting method of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all the sub-tensor data of the state is the tensor data;

步骤e)：根据所述状态路径的权重，确定所述目标层的目标拆分路径；Step e): determine the target splitting path of the target layer according to the weight of the state path;

步骤f)：利用所述目标拆分路径对所述神经网络模型的目标层的算子进行拆分。Step f): using the target splitting path to split the operator of the target layer of the neural network model.

在神经网络模型的每个算子和其输入张量数据之间插入胶水算子，在神经网络模型的输出张量数据和产生该输出张量数据的算子之间拆入胶水算子。为神经网络模型中的每个张量数据tensor_i初始化其状态集合S_i，状态集合中存储状态本身以及由该数据的该拆分状态开始执行到网络最后的输出数据的输出状态s_root的最短时间，该数值对用(s,t)表示。整个神经网络模型的输出张量数据对应的状态集合S_root中有该数据的不拆分状态以及对应的最短时间(s_root,0)，其余所有集合为空。对于给定神经网络模型，对神经网络模型中的所有算子按照彼此之间的依赖关系给出拓扑序λ，该拓扑序需满足：对任意算子A，所有依赖于A的算子必须在拓扑序上排在A之后，而所有A依赖的算子必须在拓扑序上排在A之前。A glue operator is inserted between each operator of the neural network model and its input tensor data, and a glue operator is inserted between the output tensor data of the neural network model and the operator that generates the output tensor data. The state set S _i is initialized for each tensor data tensor _i in the neural network model, and the state set stores the state itself and the shortest output state s _root from the split state of the data to the final output data of the network Time, the value pair is represented by (s, t). The state set S _root corresponding to the output tensor data of the entire neural network model has the unsplit state of the data and the corresponding shortest time (s _root , 0), and all other sets are empty. For a given neural network model, a topological order λ is given to all operators in the neural network model according to their dependencies. The topological order must satisfy: For any operator A, all operators that depend on A must be in Topologically sorted after A, and all A-dependent operators must topologically sorted before A.

考虑到补偿算子的插入，反向遍历神经网络模型的每个算子的拆分状态集合。在反向遍历阶段，按照逆λ的顺序逐个遍历神经网络模型中的算子，对m输入n输出的算子A，有输入张量数据u₁,…,u_m，输出张量数据v₁,…,v_n，图2所示的在神经网络模型中关于算子拆分的技术方案内容适用于图9所示的技术方案，图4所示的在插入胶水算子的神经网络模型中关于胶水算子的技术方案内容适用于图9所示的技术方案，图7所示的在插入补偿算子的神经网络模型中关于补偿算子的技术方案内容适用于图9所示的技术方案，这里均不再赘述。反向遍历的时间复杂度为O(NM²)，其中，N是神经网络模型中算子的个数，M表示所有张量数据的拆分状态集合中最大拆分状态集合中的状态个数。Considering the insertion of the compensation operator, the split state set of each operator of the neural network model is traversed in reverse. In the reverse traversal stage, the operators in the neural network model are traversed one by one in the order of inverse λ. For the operator A with m input and n output, there are input tensor data u ₁ ,..., _um , and output tensor data v ₁ ,...,v _n , the content of the technical solution about operator splitting in the neural network model shown in Fig. 2 is applicable to the technical solution shown in Fig. 9, and the content of the technical solution shown in Fig. 4 is in the neural network model with the glue operator inserted The technical solution content of the glue operator is applicable to the technical solution shown in FIG. 9 , and the technical solution content of the compensation operator in the neural network model in which the compensation operator is inserted as shown in FIG. 7 is applicable to the technical solution shown in FIG. 9 , will not be repeated here. The time complexity of reverse traversal is O(NM ² ), where N is the number of operators in the neural network model, and M represents the number of states in the largest split state set in the split state set of all tensor data .

需要强调的是，图2所示的关于算子拆分技术方案内容适用于图9所示的技术方案，图4所示的基于胶水算子的算子拆分技术方案中关于胶水算子的内容适用于图9所示的技术方案，图7所示的基于补偿算子的算子拆分技术方案中关于补偿算子的内容适用于图9所示的技术方案，这里均不再赘述。It should be emphasized that the content of the technical solution about operator splitting shown in FIG. 2 is applicable to the technical solution shown in FIG. The content is applicable to the technical solution shown in FIG. 9 , and the content about the compensation operator in the technical solution of the operator splitting based on the compensation operator shown in FIG. 7 is applicable to the technical solution shown in FIG. 9 , and will not be repeated here.

图2、图4、图7和图9所示的技术方案通过把神经网络模型中的目标层的每个算子拆分成更小的子任务，并分配到多个核上并行执行来充分利用多核系统的硬件资源。在图4、图7和图9所示的技术方案中，通过引入胶水算子或补偿算子，保证了神经网络模型在拆分后的计算图仍然可以使用单核上的算子核函数来实现，避免了框架在移植过程底层加速器需要修改或者重新实现大量算子的软件栈工作，使其对不具有良好可编程性的加速器更加友好。框架能够自动生成一套对于给定神经网络和多核加速器的高效拆分方案，在方案的生成过程中，能够根据算子的类型和规模结合底层硬件的计算吞吐速率和访存带宽合理调整算子的拆分方式，在硬件核的计算效率和算子本身的拆分度之间取得较好的平衡，同时也会兼顾上下文算子之间在拆分上的相互配合，统筹规划多个算子的拆分选择。The technical solutions shown in Fig. 2, Fig. 4, Fig. 7 and Fig. 9 are sufficient by dividing each operator of the target layer in the neural network model into smaller sub-tasks and assigning them to multiple cores for parallel execution. Take advantage of the hardware resources of a multi-core system. In the technical solutions shown in Fig. 4, Fig. 7 and Fig. 9, by introducing a glue operator or a compensation operator, it is ensured that the computation graph of the neural network model after splitting can still use the operator kernel function on a single core. The implementation avoids the need to modify or re-implement the software stack of a large number of operators for the underlying accelerator during the transplantation process, making it more friendly to accelerators that do not have good programmability. The framework can automatically generate a set of efficient splitting schemes for a given neural network and multi-core accelerator. During the generation process of the scheme, operators can be reasonably adjusted according to the type and scale of the operators combined with the computing throughput rate and memory access bandwidth of the underlying hardware. It achieves a good balance between the computing efficiency of the hardware core and the splitting degree of the operator itself, and also takes into account the mutual cooperation between the context operators in the splitting, and plans multiple operators as a whole. split selection.

本公开提供一种神经网络模型拆分装置，其中，包括：The present disclosure provides a neural network model splitting device, including:

优选地，还包括：Preferably, it also includes:

如图10所示，为本申请实施例提出的一种神经网络模型拆分硬件设备示意图。包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述所述的神经网络模型拆分方法。As shown in FIG. 10 , a schematic diagram of a hardware device for splitting a neural network model proposed in an embodiment of the present application is shown. It includes a memory, a processor, and a computer program stored on the memory and running on the processor. When the processor executes the computer program, the above-mentioned method for splitting a neural network model is implemented.

本说明书实施方式提供的一种神经网络模型拆分硬件设备，其存储器和处理器实现的具体功能，可以与本说明书中的前述实施方式相对照解释，并能够达到前述实施方式的技术效果，这里便不再赘述。The specific functions implemented by the memory and the processor of a neural network model splitting hardware device provided by the embodiments of this specification can be explained in comparison with the foregoing embodiments in this specification, and can achieve the technical effects of the foregoing embodiments. Here I won't go into details.

在本实施方式中，所述存储器可以包括用于存储信息的物理装置，通常是将信息数字化后再以利用电、磁或者光学等方法的媒体加以存储。本实施方式所述的存储器又可以包括：利用电能方式存储信息的装置，如RAM、ROM等；利用磁能方式存储信息的装置，如硬盘、软盘、磁带、磁芯存储器、磁泡存储器、U盘；利用光学方式存储信息的装置，如CD或DVD。当然，还有其他方式的存储器，例如量子存储器、石墨烯存储器等等。In this embodiment, the memory may include a physical device for storing information, usually after digitizing the information, it is stored in a medium using electrical, magnetic, or optical methods. The memory described in this embodiment may further include: devices that use electrical energy to store information, such as RAM, ROM, etc.; devices that use magnetic energy to store information, such as hard disks, floppy disks, magnetic tapes, magnetic core memory, magnetic bubble memory, U disk ; a device that stores information optically, such as a CD or DVD. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so on.

在本实施方式中，所述处理器可以按任何适当的方式实现。例如，所述处理器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application SpecificIntegrated Circuit，ASIC)、可编程逻辑控制器和嵌入微控制器的形式等等。In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or a processor and a computer readable medium storing computer readable program code (eg software or firmware) executable by the (micro)processor, logic gates, switches, application specific integrated Circuit (Application Specific Integrated Circuit, ASIC), programmable logic controller and embedded microcontroller form and so on.

在本实施例中，本申请实施例还提供一种可读存储介质，其上存储有计算机程序，所述计算机程序被执行时实现上述所述的神经网络模型拆分方法。In this embodiment, the embodiment of the present application further provides a readable storage medium on which a computer program is stored, and when the computer program is executed, the above-mentioned method for splitting a neural network model is implemented.

由上可见，本公开的技术方案可以用较小地开销实现深度学习加速器由单核向多核结构上的扩展，并且能够针对给定的网络和底层加速器特点给出一种高效的拆分方案。实验结果表明，该方案能有效降低各种网络在多核加速器上的端到端时延。It can be seen from the above that the technical solution of the present disclosure can realize the expansion of the deep learning accelerator from a single core to a multi-core structure with less overhead, and can provide an efficient splitting solution for a given network and underlying accelerator characteristics. Experimental results show that this scheme can effectively reduce the end-to-end delay of various networks on multi-core accelerators.

本领域技术人员也知道，除了以纯计算机可读程序代码方式实现客户端和服务器以外，完全可以通过将方法步骤进行逻辑编程来使得客户端和服务器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种客户端和服务器可以被认为是一种硬件部件，而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至，可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。Those skilled in the art also know that, in addition to implementing the client and the server in the form of pure computer-readable program codes, the client and the server can be implemented as logic gates, switches, application-specific integrated circuits, programmable logic by logically programming the method steps. The same function can be realized in the form of a controller and an embedded microcontroller, etc. Therefore, the client and the server can be regarded as a kind of hardware components, and the devices included therein for realizing various functions can also be regarded as structures within the hardware components. Or even, the means for implementing various functions can be regarded as both a software module implementing a method and a structure within a hardware component.

通过以上的实施方式的描述可知，本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施方式或者实施方式的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in storage media, such as ROM/RAM, magnetic disks , CD-ROM, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments of the present application or some parts of the embodiments.

本说明书中的各个实施方式均采用递进的方式描述，各个实施方式之间相同相似的部分互相参见即可，每个实施方式重点说明的都是与其他实施方式的不同之处。尤其，针对客户端和服务器的实施方式来说，均可以参照前述方法的实施方式的介绍对照解释。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the implementation of the client and the server, reference may be made to the description of the foregoing method implementation for comparison and explanation.

本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

虽然通过实施方式描绘了本申请，本领域普通技术人员知道，本申请有许多变形和变化而不脱离本申请的精神，希望所附的权利要求包括这些变形和变化而不脱离本申请的精神。Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that the present application is subject to many modifications and variations without departing from the spirit of the present application, and it is intended that the appended claims include such modifications and variations without departing from the spirit of the present application.

Claims

1. a neural network model splitting method, is characterized in that, described method comprises:

According to the operator of the target layer in the neural network model, the split state set of the tensor data associated with the operator of the target layer is determined; wherein the target layer is at least one layer in the neural network model ;

Traverse the split state set according to the directed acyclic graph of the neural network model, and determine the state paths and the weights of the state paths between adjacent split state sets; wherein, the state path represents the value of the operator Splitting method; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

Determine the target splitting path of the target layer according to the weight of the state path;

The operator of the target layer of the neural network model is split using the target splitting path.

2. The method of claim 1, wherein the step of determining the target splitting path of the target layer comprises:

Traverse all split state sets of the target layer, traverse each state for the current split state set, and obtain all state paths pointing to the current state and the input tensors from the starting state of the state path to the target layer The split path of the starting state of the data;

The splitting path from the current state to the starting state of the input tensor data of the target layer is determined according to the weight of the state path and the weight of the splitting path; wherein the weight of the splitting path is based on the weight of the splitting path. Determine the weights of all state paths corresponding to the split paths;

After traversing all the split state sets of the target layer, obtain the target split between the split state set of the input tensor data of the target layer and the split state set of the output tensor data of the target layer path.

3. The method of claim 1, wherein the step of determining the target splitting path of the target layer comprises:

Traverse all split state sets of the target layer, traverse each state for the current split state set, and obtain all the state paths starting from the current state and the output sheets from the end state of the state path to the target layer. The split path of the terminal state of the volume data;

The splitting path from the current state to the terminal state of the output tensor data of the target layer is determined according to the weight of the state path and the weight of the splitting path; wherein, the weight of the splitting path is based on the The weights of all state paths corresponding to the split paths are determined;

4 . The method according to claim 1 , wherein the number of sub-operators obtained after the operators of the target layer of the neural network model are split is an integer power of 2. 5 .

5. The method according to claim 1, wherein the state in the split state set of the input tensor data of the operator of the target layer of the neural network model is based on the calculation logic of the operator and the corresponding output State determination in the split state collection of tensor data.

6. The method according to claim 1, wherein the state in the split state set of the output tensor data of the operator of the target layer of the neural network model is based on the calculation logic of the operator and the corresponding input State determination in the split state collection of tensor data.

7. The method of claim 1, further comprising:

In the forward traversal stage, when the output tensor data of the operator is used as input tensor data by at least two operators, or the operator has at least two output tensor data, the output tensor data of the operator One split state remains in the set of split states of the volume data, and the split state is determined via the same state path of the operator.

8. The method of claim 1, further comprising:

In the reverse traversal stage, when the operator has at least two input tensor data, one split state remains in the split state set of the input tensor data of the operator, and the split state is passed through the The same state path of the predicate operator is determined.

9 . The method of claim 1 , wherein the weight of the state path is determined according to the type and scale of the operator and hardware parameters of the multi-core processor. 10 .

10. A neural network model splitting device, characterized in that, comprising:

A split state set module, configured to determine a split state set of tensor data associated with the operator of the target layer according to the operator of the target layer in the neural network model; wherein, the target layer is the at least one layer in the neural network model;

a state path module, configured to traverse the split state set according to the directed acyclic graph of the neural network model, and determine the state path and the weight of the state path between adjacent split state sets; wherein, the state path Represents the splitting method of the operator; each state in the split state set represents a sub-tensor data set, and the union result of all sub-tensor data of the state is the tensor data;

a target splitting path module, configured to determine the target splitting path of the target layer according to the weight of the state path;

A splitting module, configured to split the operator of the target layer of the neural network model by using the target splitting path.

11. The apparatus of claim 10, wherein the target splitting path module comprises:

The first traversal unit is used to traverse all the split state sets of the target layer, traverse each state for the current split state set, and obtain all the state paths pointing to the current state and the starting state of the state path to all the state paths. Describe the splitting path of the starting state of the input tensor data of the target layer;

a first splitting path determination unit, configured to determine a splitting path from the current state to the starting state of the input tensor data of the target layer according to the weight of the state path and the weight of the splitting path; wherein , the weight of the splitting path is determined according to the weights of all state paths corresponding to the splitting path;

The first selection target splitting path unit is used to obtain the split state set of the input tensor data of the target layer and the output tensor data of the target layer after traversing all the split state sets of the target layer The target split path between the set of split states.

12. The apparatus of claim 10, wherein the target splitting path module comprises:

The second traversal unit is used to traverse all the split state sets of the target layer, traverse each state for the current split state set, and obtain all the state paths starting from the current state and the end states of the state paths to The splitting path of the terminal state of the output tensor data of the target layer;

The second splitting path determination unit is configured to determine the splitting path from the current state to the terminal state of the output tensor data of the target layer according to the weight of the state path and the weight of the splitting path; wherein, The weight of the split path is determined according to the weight of all state paths corresponding to the split path;

The second selection target splitting path unit is used to obtain the split state set of the input tensor data of the target layer and the output tensor data of the target layer after traversing all the split state sets of the target layer The target split path between the set of split states.

13. The apparatus of claim 10, further comprising:

The first split state set optimization module is used in the forward traversal stage, when the output tensor data of the operator is used as input tensor data by at least two operators, or the operator has at least two output tensor data. When the amount of data is stored, one split state remains in the split state set of the output tensor data of the operator, and the split state is determined through the same state path of the operator.

14. The apparatus of claim 10, further comprising:

The second split state set optimization module is configured to, in the reverse traversal stage, when the operator has at least two input tensor data, retain one split state set of the input tensor data of the operator split states, and the split states are determined via the same state path of the operator.

15. A neural network model splitting hardware device, comprising a memory and a processor, and a computer program that can be run on the processor is stored on the memory, wherein the processor implements the right when executing the computer program. The steps of the method of any one of claims 1-9.

16. A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 9 are implemented.