CN121399572A

CN121399572A - Multi-precision tensor multiplication in neural networks

Info

Publication number: CN121399572A
Application number: CN202380099746.XA
Authority: CN
Inventors: 孟晨; 何普江; 汪彬; 周姗
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2026-01-23
Also published as: WO2025091335A1

Abstract

Tensor multiplication operations in deep neural networks (DNNs) can be performed at multiple precision levels. Various precision levels can be selected for different DNN layers to achieve the desired accuracy of the DNN while minimizing computational resource consumption. A layer can be selected from multiple predetermined precision levels. The layer may include tensor multiplication operations on activation and weight tensors. Activations in the activation tensor can be converted to lower-precision activations. Weights in the weight tensor can be converted to lower-precision weights. An algorithm determined based on the selected precision level can be used to compute an approximate product of activations and weights using lower-precision activations and weights. Different precision levels can be selected for different layers, and different algorithms can be used to compute products in different layers.

Description

Multi-precision tensor multiplication in neural networks

Technical Field

The present disclosure relates generally to neural networks (also referred to as "deep neural networks" or "DNNs"), and more particularly to multi-precision tensor multiplication operations in DNNs.

Background

DNNs are widely used in a variety of artificial intelligence applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. However, high accuracy comes at significant computational cost. DNNs have extremely high computational demands because each inference can require hundreds of millions of tensor operations and a large amount of data to read and write. Many tensor operations (e.g., MAC (multiply-accumulate) operations in the convolutional layer, linear transformations in the fully-concatenated layer, etc.) include tensor multiplication.

Drawings

Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. In the figures of the accompanying drawings, embodiments are shown by way of example and not by way of limitation.

FIG. 1 illustrates an example DNN according to various embodiments.

FIG. 2 illustrates an example convolution in accordance with various embodiments.

Fig. 3 is a block diagram of a DNN system according to various embodiments.

Fig. 4 is a block diagram of a DNN module according to various embodiments.

Fig. 5 is a block diagram of a multi-precision tuning module in accordance with various embodiments.

FIG. 6 illustrates an example diagram in accordance with various embodiments.

FIG. 7 illustrates data elements of different precision in accordance with various embodiments.

FIG. 8 illustrates an example multi-precision tensor multiplication operator in accordance with various embodiments.

FIG. 9 illustrates locality and reuse of data in a multi-precision tensor multiplication operation, according to various embodiments.

Fig. 10 is a flow diagram illustrating a method of performing a tensor multiplication operation in a DNN, according to various embodiments.

FIG. 11 is a block diagram of an example computing device, according to various embodiments.

Detailed Description

SUMMARY

The past decade has witnessed a rapid increase in AI (artificial intelligence) -based data processing, in particular DNN-based data processing. DNN is widely used in the fields of computer vision, speech recognition, image and video processing, mainly because of its ability to achieve accuracy exceeding the human level. Significant improvements in DNN model size and accuracy coupled with rapid increases in computing power of execution platforms have led to adoption of DNN applications even within resource-constrained mobile and edge devices with limited energy availability.

Many DNN layers include tensor multiplication operations. The DNN layer may include one or more deep learning operations (also referred to as "neural network operations"), such as convolution, linear transformation, pooling, element-by-element (elementwise) operations, linear operations, nonlinear operations, and the like. The fundamental part of convolution or linear transformation is tensor multiplication. The deep learning operations in the DNN may be performed on one or more internal parameters (e.g., weights) of the DNN determined during the training phase and one or more activations. An activation may be a data point (also referred to as a "data element" or "element"). The activation or weighting of the DNN layer may be an element of the tensor of the DNN layer. A tensor is a data structure having multiple elements across one or more dimensions. Example tensors include vectors that may be one-dimensional tensors and matrices that may be two-dimensional tensors. Three-dimensional tensors and even higher-dimensional tensors may also exist. The DNN layer may have an input tensor (also referred to as an "Input Feature Map (IFM)") including one or more input activations (also referred to as "input elements") and a weight tensor including one or more weights. Weights are elements in the weight tensor. The weight tensor may be a kernel, a filter, or a set of filters. The output data of the DNN layer may be an output tensor (also referred to as an "Output Feature Map (OFM)") including one or more output activations (also referred to as "output elements").

The data used in the tensor multiplication operation may have different accuracies. Tensor multiplication with higher accuracy may provide better accuracy, but generally requires more computing resources such as power, time, memory, and the like. Tensor multiplication with lower accuracy may have lower computational resource consumption, but may result in lower accuracy. With the increasing demand for reduced inference delays in AI applications, it may be important to enhance support for low-precision data structures in tensor multiplication operations.

Currently available methods for improving performance using low precision formats include employing a mix of precision schemes based on experience and heuristics. Some methods default to converting all tensor multiplication operators into BF16 (where BF stands for brain (brain) floating point) data format. When the accuracy requirement is not met, the precision sensitive layer is rolled back to FP32 (where FP stands for floating point) data format using a heuristic method. Some other methods make FP32 and BF16 data types available to each tensor multiplication operator and use tuning algorithms for automatic searching. For example, a tuning procedure is used to determine whether the manipulator should be in a single precision format or a low precision format. This approach helps to avoid unacceptable accuracy loss caused by mixing accuracy optimization. However, it suffers from the disadvantage that many tensor multiplication operators cannot be selected to take advantage of the low precision format in the final solution, which can significantly limit utilization. Furthermore, both heuristic and auto-tuning algorithms are limited to operating at the level of the different DNN layers. In those cases of precision sensitive models (e.g., recommendation systems, etc.), some layers will need to roll back to FP32 format, which results in suboptimal utilization of acceleration capability and reduced utilization of computing capability.

Embodiments of the present disclosure may improve at least some of the challenges and problems described above by facilitating multi-precision tensor multiplication in DNN. Optimization of the multi-precision tensor multiplication operations can be automated to achieve the desired accuracy of the DNN while minimizing consumption of computing resources (e.g., time, power, memory, etc.). Tuning can be done not only between different tensor multiplication operators, but also within a single tensor multiplication operator so that acceleration capability can be maximized.

In various embodiments of the present disclosure, various levels of precision may be selected for various layers in the DNN. For example, the precision level of the DNN layer may be selected from a plurality of predetermined precision levels supported by a tensor multiplication operator configured to perform tensor multiplication operations in the DNN. Tensor multiplication operations in the layers may be performed on the activation tensor and the weight tensor. The activations in the activation tensor may be converted to lower precision activations. The weights in the weight tensor may also be converted to lower precision weights. An algorithm based on the accuracy level determination selected for the layer may be used to calculate an approximate product of the activation and weight using the lower accuracy activation and lower accuracy weight. A different level of precision may be selected for another layer in the DNN. The tensor multiplication operator may be implemented in different modes of operation and use different algorithms determined based on different levels of precision to calculate the approximate product of the activation and weights.

The map may be used to determine the best accuracy setting for the DNN. The graph may include nodes representing layers in the DNN and one or more edges between the nodes. Edges may indicate a relationship between layers, e.g., a data flow between two layers. The node may encode the level of precision of the layer. The initial state of the node may be set to a random level of accuracy. The state of the node may be iteratively updated to find the best level of accuracy. The optimal precision level may be a precision level of a layer that may meet the requirements for accuracy of DNN and the requirements for consumption of computing resources. For example, the requirement for consumption of computing resources may be a maximized speed of reasoning. The tensor multiplication operator may be configured based on the level of precision selected for the layer. In executing a layer, the tensor multiplication operator may operate in an operating mode that matches a selected level of precision of the layer.

The present disclosure provides a method of accelerating tensor multiplication operations in DNN. Various levels of precision may be used for various tensor multiplication operations in the DNN to maximize the speed of reasoning, but still achieve the desired accuracy of the DNN. Many data formats may be supported, such as FP32, FP16, BF16, FP8, INT8, and the like.

For purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. However, it will be apparent to one skilled in the art that the present disclosure may be practiced without the specific details or/and with only some of the described aspects. In other instances, well-known features are omitted or simplified in order not to obscure the illustrative implementations.

Further, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration embodiments which may be practiced. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description is, therefore, not to be taken in a limiting sense.

Various operations may be described as multiple discrete acts or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. The described operations may be performed in a different order than the described embodiments. Various additional operations may be performed or, in additional embodiments, described operations may be omitted.

For the purposes of this disclosure, the phrase "a or B" or the phrase "a and/or B" means (a), (B) or (a and B). For the purposes of this disclosure, the phrase "A, B or C" or the phrase "A, B and/or C" means (a), (B), (C), (a and B), (a and C), (B and C), or (A, B and C). The term "between" when used with reference to a measurement range includes the endpoints of the measurement range.

The description uses the phrases "in an embodiment" or "in various embodiments," which may each refer to one or more of the same or different embodiments. The terms "comprising," "including," "having," and the like, as used with respect to embodiments of the present disclosure, are synonymous. The present disclosure may use perspective-based descriptions such as "above," "below," "top," "bottom," and "side" to explain various features of the drawings, but these terms are merely for ease of discussion and do not imply a desired or required orientation. The figures are not necessarily drawn to scale. Unless otherwise indicated, the use of ordinal adjectives "first," "second," and "third," etc., to describe a common object merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

In the following detailed description, various aspects of the illustrative implementations will be described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art.

The terms "substantially," "near," "approximately," "near," and "about" generally refer to within +/-20% of a target value as described herein or known in the art. Similarly, terms indicating the orientation of various elements, such as "coplanar," "perpendicular," "orthogonal," "parallel," or any other angle between elements, generally refer to within +/-5-20% of a target value as described herein or known in the art.

In addition, the terms "comprising," "including," "containing," "incorporating," "having," "with," or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, apparatus, or DNN accelerator that comprises a list of elements is not necessarily limited to only those elements, but may include other elements not expressly listed or inherent to such method, process, apparatus, or DNN accelerator. Furthermore, the term "or" refers to an inclusive "or" rather than an exclusive "or"

Each of the systems, methods, and devices of the present disclosure has several innovative aspects, no single one of which is solely responsible for all of the desirable attributes disclosed herein. The details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below.

Example DNN

Fig. 1 illustrates an example DNN 100 according to various embodiments. For purposes of illustration, DNN 100 in FIG. 1 is CNN. In other embodiments, DNN 100 may be other types of DNNs. DNN 100 is trained to receive images and output classifications of objects in the images. In the embodiment of fig. 1, DNN 100 receives an input image 105 comprising objects 115, 125, and 135. DNN 100 includes a sequence of layers including a plurality of convolutional layers 110 (respectively referred to as "convolutional layers 110"), a plurality of pooling layers 120 (respectively referred to as "pooling layers 120"), and a plurality of fully-connected layers 130 (respectively referred to as "fully-connected layers 130"). In other embodiments, DNN 100 may include fewer, more, or different layers. In reasoning of DNN 100, layers of DNN 100 perform tensor calculations that include a number of tensor operations, such as convolution (e.g., multiply-accumulate (MAC) operations, etc.), pooling operations, element-by-element operations (e.g., element-by-element addition, element-by-element multiplication, etc.), other types of tensor operations, or some combination thereof.

The convolution layer 110 summarizes the presence of features in the input image 105. The convolution layer 110 acts as a feature extractor. The first layer of DNN 100 is convolutional layer 110. In an example, the convolution layer 110 performs convolution on the input tensor 140 (also referred to as IFM 140) and the filter 150. As shown in fig. 1, IFM 140 is represented by a 7 x 3 three-dimensional (3D) matrix. The IFM 140 includes 3 input channels, each represented by a 7 x 7 two-dimensional (2D) matrix. The 7 x 72 d matrix includes 7 input elements (also referred to as input points) in each row and 7 input elements in each column. The filter 150 is formed of 3×3 x 3D matrix representation. Filter 150 includes 3 kernels, each of which may correspond to a different input channel of IFM 140. The kernel is a 2D matrix of weights, where the weights are arranged in columns and rows. The kernel may be smaller than the IFM. In the embodiment of fig. 1, each core is represented by a 3 x 32 d matrix. The 3 x 3 kernel includes 3 weights in each row and 3 weights in each column. The weights may be initialized and updated by using back propagation of gradient descent. The magnitude of the weights (magnitide) may indicate the importance of the filter 150 in extracting features from the IFM 140.

The convolution includes a MAC operation with the input elements in IFM 140 and the weights in filter 150. The convolution may be a standard convolution 163 or a depth (depthwise) convolution 183. In the standard convolution 163, the entire filter 150 slides across the IFM 140. All input channels are combined to produce an output tensor 160 (also referred to as an output signature (OFM) 160). OFM160 is represented by a5×5 2D matrix. The 5 x 52 d matrix includes 5 output elements (also referred to as output points) in each row and 5 output elements in each column. For illustration purposes, in the embodiment of fig. 1, the standard convolution includes a filter. In embodiments where there are multiple filters, a standard convolution may produce multiple output channels in OFM 160.

The multiplication applied between the kernel-sized patch (patch) of the IFM 140 and the kernel may be dot-product. Dot product is an element-wise multiplication between a core-sized patch of IFM 140 and the corresponding core, which is then summed, always resulting in a single value. Because it results in a single value, this operation is commonly referred to as a "scalar product". The use of a kernel smaller than the IFM 140 is intentional in that it allows the same kernel (set of weights) to multiply the IFM 140 multiple times at different points on the IFM 140. Specifically, the kernel systematically applies from left to right, top to bottom, to each overlapping portion or kernel-sized patch of the IFM 140. The result from the kernel multiplying the IFM 140 once is a single value. When the kernel is applied to the IFM 140 multiple times, the multiplication result is a 2D matrix of output elements. Thus, the 2D output matrix from the standard convolution 163 (i.e., OFM 160) is referred to as an OFM.

In the depth convolution 183, the input channels are not combined. Instead, MAC operations are performed on individual input channels and individual cores and produce output channels. As shown in fig. 1, the depth convolution 183 produces a depth output tensor 180. The depth output tensor 180 is represented by a 5×5×33 d matrix. The depth output tensor 180 comprises 3 output channels, each represented by a 5 x 5 2d matrix. The 5 x 5 2d matrix includes 5 output elements in each row and 5 output elements in each column. Each output channel is the result of the MAC operation of the input channel of IFM 140 and the kernel of filter 150. For example, a first output channel (patterned with dots) is the result of a MAC operation of a first input channel (patterned with dots) and a first core (patterned with dots), a second output channel (patterned with horizontal stripes) is the result of a MAC operation of a second input channel (patterned with horizontal stripes) and a second core (patterned with horizontal stripes), and a third output channel (patterned with diagonal stripes) is the result of a MAC operation of a third input channel (patterned with diagonal stripes) and a third core (patterned with diagonal stripes). In such a deep convolution, the number of input channels is equal to the number of output channels, and each output channel corresponds to a different input channel. The input channels and the output channels are collectively referred to as depth channels. After the depth convolution, a point-wise convolution (pointwise convolution) 193 is then performed on the depth output tensor 180 and the 1 x3 tensor 190 to produce the OFM 160.

OFM 160 is then passed to the next layer in the sequence. In some embodiments, OFM 160 traverses the activation function. An example activation function is to correct (rectified) a linear unit (ReLU). ReLU is a calculation that returns directly the value provided as an input, or the value zero if the input is zero or less. The convolution layer 110 may receive several images as inputs and calculate the convolution of each of them with each of the kernels. This process may be repeated several times. For example, OFM 160 is passed to a subsequent convolution layer 110 (i.e., a convolution layer 110 following the convolution layer 110 that generated OFM 160 in the sequence). Subsequent convolution layer 110 performs a convolution on OFM 160 with the new kernel and generates a new feature map. The new feature map may also be normalized and resized (resized). The new feature map may be convolved (kernel) again by a further subsequent convolution layer 110, and so on.

In some embodiments, the convolutional layer 110 has four super-parameters, the number of kernels, the size F kernels (e.g., the kernels have dimensions of F D pixels), the S step of dragging the window corresponding to the kernels over the image (e.g., step one means moving the window one pixel at a time), and zero padding P (e.g., adding a black contour of P pixels thickness to the input image of the convolutional layer 110). The convolution layer 110 may perform various types of convolutions, such as 2-dimensional convolutions, dilated or hole convolutions (dilated or atrous convolution), spatially separable convolutions, depth separable convolutions, transposed convolutions, and the like. DNN 100 includes 16 convolutional layers 110. In other embodiments, DNN 100 may include a different number of convolutional layers.

The pooling layer 120 downsamples the feature map generated by the convolution layer, for example, by summarizing the presence of features in the patch of the feature map. The pooling layer 120 is placed between two convolution layers 110, a previous convolution layer 110 (the convolution layer 110 preceding the pooling layer 120 in the layer sequence) and a subsequent convolution layer 110 (the convolution layer 110 following the pooling layer 120 in the layer sequence). In some embodiments, pooling layer 120 is added after convolutional layer 110, e.g., after an activation function (e.g., reLU, etc.) has been applied to OFM 160.

The pooling layer 120 receives the feature map generated by the previous convolution layer 110 and applies a pooling operation to the feature map. The pooling operation reduces the size of the feature maps while preserving their important properties. Thus, the pooling operation improves the efficiency of DNN and avoids over-learning. The pooling layer 120 may perform the pooling operation by averaging the pooling (calculating the average of each patch on the feature map), max pooling (calculating the maximum of each patch of the feature map), or a combination of both. The size of the pooling operation is smaller than the size of the feature map. In various embodiments, the pooling operation is 2×2 pixels applied in steps (stride) of two pixels, such that the pooling operation reduces the size of the feature map to 1/2, e.g., the number of pixels or values in the feature map to one quarter of the size. In an example, the pooling layer 120 applied to the 6×6 feature map results in a 3×3 output pooled feature map. The output of the pooling layer 120 is input into a subsequent convolution layer 110 for further feature extraction. In some embodiments, the pooling layer 120 operates on each feature map separately to create a new set of the same number of pooled feature maps.

The full connectivity layer 130 is the last layer of DNN. The full link layer 130 may be convolved or non-convolved. The fully connected layer 130 may also be referred to as a linear layer. In some embodiments, the full connectivity layer 130 (e.g., the first full connectivity layer in DNN 100) may receive input operands. The input operands may define the outputs of the convolutional layer 110 and the pooling layer 120 and include the value of the last feature map generated by the last pooling layer 120 in the sequence. The fully-connected layer 130 may apply a linear transformation to the input operands through a weight matrix. The weight matrix may be a kernel of the full connectivity layer 130. The linear transformation may include tensor multiplication between the input operands and the weight matrix. The result of the linear transformation may be an output operand. In some embodiments, the fully-connected layer may also apply a nonlinear transformation to the results of the linear transformation (e.g., by using a nonlinear activation function) to generate output operands. The output operand may contain as many elements as there are classes, element i representing the probability that the image belongs to class i. Thus, each element is between 0 and 1, and the sum of all elements is one. These probabilities are calculated by the last fully connected layer 130 by using a logic function (binary class) or SoftMax function (multi-class) as the activation function.

In some embodiments, the fully connected layer 130 sorts the input image 105 and returns operands of size N, where N is the number of classes in the image sort problem. In the embodiment of fig. 1, N is equal to 3 because there are 3 objects 115, 125, and 135 in the input image. Each element of the operand indicates a probability that the input image 105 belongs to a class. To calculate the probabilities, the fully connected layer 130 multiplies each input element by a weight, sums it, and then applies an activation function (e.g., logic if n=2, softMax if N > 2). This corresponds to multiplying the input operands by a matrix containing weights. In an example, the vector includes 3 probabilities, a first probability that indicates that object 115 is a tree, a second probability that indicates that object 125 is an automobile, and a third probability that indicates that object 135 is a person. In other embodiments where the input image 105 includes a different object or a different number of objects, the individual values may be different.

Example convolution

FIG. 2 illustrates an example convolution in accordance with various embodiments. The convolution may be a deep learning operation in a convolution layer of DNN, such as convolution layer 110 in fig. 1. Convolution may be performed on the input tensor 210 and the filter 220 (respectively referred to as "filter 220"). The result of the convolution is an output tensor 230. In some embodiments, the convolution is performed by a DNN accelerator. An example of a DNN accelerator may be DNN accelerator 302 in fig. 3.

In the embodiment of fig. 2, the input tensor 210 includes activations (also referred to as "input activations", "elements" or "input elements") arranged in a 3D matrix. The input elements are data points in the input tensor 210. The input tensor 210 has a spatial sizeWhereinIs the height of the 3D matrix (i.e., the length along the Y-axis, which indicates the number of activations in the columns of the 3D matrix for each input channel),Is the width of the 3D matrix (i.e., the length along the X-axis, which indicates the number of activations in a row in the 2D matrix for each input channel), andIs the depth of the 3D matrix (i.e., the length along the Z-axis, which indicates the number of input channels). For simplicity and illustration purposes, the input tensor 210 has a spatial size of 7×7×3, i.e., the input tensor 210 includes three input channels, and each input channel has a 7×72 d matrix. Each input element in the input tensor 210 may be represented by an (X, Y, Z) coordinate. In other embodiments, the height, width, or depth of the input tensor 210 may be different.

Each filter 220 includes weights arranged in a 3D matrix. The value of the weight may be determined by training the DNN. The filter 220 has a spatial sizeWhereinIs the height of the filter (i.e., the length along the Y-axis, which indicates the number of weights in the column in each kernel),Is the width of the filter (i.e., the length along the X-axis, which indicates the number of weights in the row in each kernel), andIs the depth of the filter (i.e., the length along the Z-axis, which indicates the number of channels). In some embodiments of the present invention, in some embodiments,Equal to. For simplicity and illustration purposes, each filter 220 in fig. 2 has a spatial size of 2 x 3, i.e., filter 220 includes 2 convolution kernels having a spatial size of 2 x 3. In other embodiments, the height, width, or depth of the filter 220 may be different. The spatial size of the convolution kernel is smaller than the spatial size of the 2D matrix of each input channel in the input tensor 210.

The activation or weighting may take one or more bytes in memory. The number of bytes of activation or weight may depend on the data format. For example, when the activation or weight has the INT8 format, the activation occupies one byte. When the activation or weight has FP16 format, the activation or weight occupies two bytes. Other data formats may be used for activation or weighting.

In convolution, each filter 220 slides across the input tensor 210 and generates a 2D matrix for the output channels in the output tensor 230. In the embodiment of fig. 2, the 2D matrix has a spatial size of 5 x 5. The output tensor 230 includes activations (also referred to as "output activations", "elements" or "output elements") arranged in a 3D matrix. Output activations are data points in the output tensor 230. The output tensor 230 has a spatial sizeWhereinIs the height of the 3D matrix (i.e., the length along the Y-axis, which indicates the number of output activations in the columns of the 2D matrix for each output channel),Is the width of the 3D matrix (i.e., the length along the X-axis, which indicates the number of output activations in a row in the 2D matrix for each output channel), andIs the depth of the 3D matrix (i.e., the length along the Z-axis, which indicates the number of output channels).May be equal to the number of filters 220 in the convolution.AndMay depend on the height and weight of the input tensor 210 and each filter 220.

As part of the convolution, a MAC operation may be performed on the input tensor 210 and the 2 x 3 sub-tensors 215 (highlighted in fig. 2 by the dashed line pattern) in each filter 220. The result of the MAC operation on the sub-tensor 215 and one filter 220 is output activation. In some embodiments (e.g., embodiments where the convolution is an integral convolution), the output activation may comprise 8 bits, such as one byte. In other embodiments (e.g., embodiments where the convolution is a floating point convolution), the output activation may include more than one byte. For example, the output element may comprise two bytes.

After the MAC operations on the sub-tensors 215 and all filters 220 are completed, a vector 235 is generated. Vector 235 is highlighted with diagonal lines in fig. 2. Vector 235 includes a sequence of output activations arranged along the Z-axis. The output activations in vector 235 have the same (X, Y) coordinates, but the output activations correspond to different output channels and have different Z coordinates. The dimension of vector 235 along the Z-axis may be equal to the total number of output channels in output tensor 230. After generating vector 235, further MAC operations are performed to generate additional vectors until output tensor 230 is generated.

In some embodiments, MAC operations on 2 x 3 sub-tensors (e.g., sub-tensor 215) and filter 220 may be performed by multiple PEs. One or more PEs may receive an input operand (e.g., input operand 217 shown in FIG. 2) and a weight operand (e.g., weight operand 227 shown in FIG. 2). The input operand 217 includes an activation sequence having the same (X, Y) coordinate but different z coordinates. The input operands 217 include activations from each input channel in the input tensor 210. The weight operand 227 includes a sequence of weights having the same (X, Y) coordinate but different z coordinates. The weight operand 227 includes a weight from each channel in the filter 220. The activation in the input operand 217 and the weights in the weight operand 227 may be sequentially fed into the PE. The PE may receive the activations and weights at a time ("activations-weight pairs") and multiply the activations and weights. The location of the activation in input operand 217 may be matched with the location of the weight in weight operand 227. The activation and weighting may correspond to the same channel.

The activation or weighting may be a floating point number. Floating point numbers may have various data formats, such as FP32, FP16, BF16, and the like. The floating point number may be a positive number or a negative number with a decimal point. The floating point number may be represented by a sequence of bits including one or more bits representing a sign (e.g., positive or negative) of the floating point number, bits representing an exponent of the floating point number, and bits representing a mantissa (mantissa) of the floating point number. Mantissas are part of a floating point number that represents the significant number of the number. The mantissa is multiplied by the exponent of the base (the base raised to the exponent) to give the actual value of the floating point number.

In some embodiments, the output activations in the output tensor 230 may be further processed based on one or more activation functions before they are stored or input into the next layer of DNN. The processing based on the one or more activation functions may be at least part of post-processing of the convolution. In some embodiments, post-processing may include one or more other calculations, such as offset calculations, bias (bias) calculations, and the like. The result of the post-processing may be stored in a local memory of the calculation block and used as input for the next DNN layer. In some embodiments, the input activations in the input tensor 210 may be the result of post-processing of the previous DNN layer.

Example DNN System

Fig. 3 is a block diagram of a DNN system 300 according to various embodiments. The entire DNN system 300 or a portion of DNN system 300 may be implemented in one or more computing devices, such as computing device 1100 in fig. 11. DNN system 300 may generate and execute a DNN, such as DNN 100 in fig. 1. As shown in fig. 3, DNN system 300 includes DNN module 301 and DNN accelerator 302. In other embodiments, alternative configurations, different, or additional components may be included in DNN system 300. For example, DNN system 300 may include multiple DNN modules or multiple DNN accelerators. Furthermore, the functionality attributed to components of DNN system 300 may be implemented by different components or different systems included in DNN system 300. In some embodiments, DNN module 301 and DNN accelerator 302 may include different types of processing units. DNN module 301 and DNN accelerator 302 may be implemented in the same chip or in separate chips.

DNN module 301 facilitates the generation and deployment of DNNs. In some embodiments, DNN module 301 may generate and train DNNs. For example, DNN module 301 may define a hierarchical architecture of DNNs. DNN module 301 may also determine internal parameters of DNN through a DNN training process. DNN module 301 may also determine one or more hyper-parameters defining how to train the DNN. An example hyper-parameter is a sparsity ratio defining a sparsity level of one or more deep learning tensors of the DNN.

After training the DNN, DNN module 301 may compress the DNN to reduce the amount of computing resources (e.g., power, processing elements, time, memory, etc.) required to execute the DNN. For example, DNN module 301 may determine a level of precision for a tensor multiplication operation in the DNN layer. One or more layers may be assigned to a lower level of precision such that the time required to execute the layers may be reduced. Other layers may have a higher level of accuracy to ensure that the accuracy of the DNN meets the desired accuracy. DNN module 301 may provide configuration parameters to configure computing block 330 (e.g., multi-precision tensor multiplication operation 350) to operate in a mode matching the selected precision level.

DNN module 301 may deploy trained, compressed, or validated DNNs for use in deep learning applications. DNN module 301 may control the reasoning process of the trained, compressed, or validated DNN. In some embodiments, DNN module 301 may distribute the trained, compressed, or validated DNNs to devices or systems that may use the DNNs to perform tasks for which the DNNs are trained (e.g., image classification, motion planning, etc.). In other embodiments, DNN module 301 may use DNN accelerator 302 to facilitate the deployment of DNNs. For example, DNN module 301 may receive data from a device or system coupled with DNN system 300 and input the received data (or data generated by DNN module 301, e.g., based on the received data) into the DNN. DNN module 301 may generate instructions (e.g., configuration files) that control the operation of DNN accelerator 302 during DNN reasoning. DNN module 301 may receive the output of DNN from DNN accelerator 302. DNN module 301 may transmit the output of the DNN (or the result of processing the output of the DNN by DNN module 301) to a device or system. Certain aspects of DNN module 301 are provided below in connection with fig. 4.

DNN accelerator 302 executes the DNN provided by DNN module 301. For example, DNN accelerator 302 may perform DNN reasoning, e.g., by running deep learning operations in the DNN, for training the DNN or for performing tasks using the trained/compressed/validated DNN. As shown in fig. 3, DNN accelerator 302 includes memory 310, DMA (direct memory access) engine 320, and computation block 330 (individually referred to as "computation block 330"). In other embodiments, alternative configurations, different, or additional components may be included in DNN accelerator 302. For example, DNN accelerator 302 may include more than one memory 310 or DMA engine 320. As another example, DNN accelerator 302 may include a single computation block 330. Furthermore, the functionality attributed to the components of DNN accelerator 302 may be implemented by different components included in DNN accelerator 302 or by different systems. The components of DNN accelerator 302 may be implemented in hardware, software, firmware, or some combination thereof.

Memory 310 stores data associated with deep learning operations performed by the DNN accelerator. In some embodiments, memory 310 may store data to be used by computing block 330 for DNN reasoning. For example, memory 310 may store weights determined by training DNNs, such as weights of convolutional layers. As another example, memory 310 may store an input of DNN or an output of DNN. Memory 310 may also store data generated by computing block 330 from performing deep learning operations in the DNN. Example deep learning operations include convolution (also referred to as "convolution operations"), pooling operations, element-by-element operations, activation functions, other types of deep learning operations, or some combination thereof. Memory 310 may be the main memory of DNN accelerator 302. In some embodiments, memory 310 includes one or more Dynamic Random Access Memories (DRAMs).

DMA engine 320 facilitates data transfer between memory 310 and the local memory of computing block 330. For example, DMA engine 320 may read data from memory 310 and write data to the local memory of computing block 330. As another example, DMA engine 320 may read data from the local memory of computing block 330 and write data to memory 310. DMA engine 320 provides DMA features that allow computing block 330 to initiate a data transfer between memory 310 and the local memory of computing block 330 and perform other operations while the data transfer is in progress. In some embodiments, DMA engine 320 may read the tensor from memory 310, modify the tensor in a manner optimized for computing block 330 before computing block 330 writes the tensor into the local memory of computing block 330.

The computation block 330 may perform a deep learning operation in the DNN. For example, the computation block 330 may run the deep learning operation in the DNN layer once, or a portion of the deep learning operation. The computation block 330 may be capable of running various types of deep learning operations, such as convolution, pooling, element-by-element operations, linear operations, nonlinear operations, and the like. In an example, the computation block 330 may perform a convolution, such as a standard convolution or a depth convolution. In some embodiments, the computation block 330 receives the input tensor and one or more convolution kernels and performs a convolution with the input tensor and the convolution kernels. The result of the convolution may be an output tensor, which may be further computed, for example, by the computation block 330 or another computation block 330. In some embodiments, the operations of the DNN layer may be run in parallel by multiple computing blocks 330. For example, each of the plurality of computation blocks 330 may perform a portion of the workload for convolution. Data may be shared between the computing blocks 330. The computation block 330 may also be referred to as a computation tile (tile). In some embodiments, each computing block 330 may be a processing unit.

In the embodiment of fig. 3, each computation block 330 includes a local memory 340 and a multi-precision tensor multiplication operator 350. Some or all of the components of the computation block 330 may be implemented on the same chip. In other embodiments, alternative configurations, different or additional components may be included in the calculation block 330. Furthermore, the functionality of the components attributed to the computing block 330 may be implemented by different components included in the computing block 330, different computing blocks 330, another component of the DNN accelerator 302, or different systems. The components of the computing block 330 may be implemented in hardware, software, firmware, or some combination thereof.

The local memory 340 is local to the corresponding computing block 330. In the embodiment of fig. 3, local memory 340 is internal to computing block 330. In other embodiments, local memory 340 may be external to computing block 330. Data in local memory 340 may be transferred to memory 310 or from memory 310, for example, by DMA engine 320. In some embodiments, the data in the local memory 340 may be transferred to the local memory of another computing block 330 or transferred from the local memory of another computing block 330. The local memory 340 may store data received, used, or generated by the multi-precision tensor multiplication operator 350. Examples of data may include input activations, weights, output activations, sparsity bitmaps, and the like.

Local memory 340 may include different levels of memory. For example, local memory 340 may include registers, L1 cache, and the like. Different memories in the local memory 340 may be arranged at different locations in the computing block 330. In some embodiments, local memory 340 includes one or more Static Random Access Memories (SRAMs). Local memory 340 may be byte-addressable, and each memory address identifies a single byte (eight bits) of the storage device. In some embodiments, local memory 340 may include a memory bank. The number of data sets (data banks) in the local memory 340 may be 16, 64, 128, 356, 512, 1024, 2048 or other numbers. The memory bank may include a plurality of memory cells. In examples, the data set may include 8, 16, 64, or a different number of memory cells. The memory bank or memory cells in the memory bank may have memory addresses. In an example, a storage unit may store a single byte, and data greater than the single byte may be stored in storage units having consecutive memory addresses, i.e., in adjacent storage units.

Data elements of different precision may require different amounts of memory. For example, an 8-bit memory location may store data elements in FP8 format. Two memory locations may be required to store data elements in FP16 or BF16 format having 16 bits. Three memory locations may be required to store data elements in FP32 format having 32 bits. Data elements of different precision may also require different memory bandwidths. In some embodiments, a single read cycle may read a predetermined number of bits from local memory 340. In an example where 16 bits may be transferred from the local memory 340 in a single read cycle, the read cycle may transfer two data elements in FP8 format, but transfer a single data element in FP16 or BF16 format, while transferring a single data element in FP32 format would require two read cycles. Thus, reducing data accuracy may reduce memory storage space and memory bandwidth.

The multi-precision tensor multiplication operator 350 may receive data from the local memory 340 and perform multi-precision tensor multiplication operations on the data. As shown in fig. 3, the multi-precision tensor multiplication operator 350 includes a precision reduction operator 360 and a multiplication operator 370. In other embodiments, the multi-precision tensor multiplication operator 350 may include more than one precision reduction operator 360 or multiplication operator 370.

The precision reduction operator 360 may reduce the precision of data elements, such as data elements stored in the memory 310 or the local memory 340. The data elements may include activations and weights. In some embodiments, the precision reduction operator 360 may convert higher precision data elements into one or more lower precision data elements. For example, the precision reduction operator 360 may remove one or more bits in the data element to reduce the precision of the data element. In some embodiments (e.g., embodiments where the precision reduction operator 360 converts one data element into a plurality of lower precision data elements), the precision reduction operator 360 may generate the first lower precision data element by removing a number of bits from the original data element. The number of removed bits may be determined based on the difference between the two accuracies. The precision reduction operator 360 may generate a second lower precision data element based on the first lower precision data element.

In an example where the precision reduction operator 360 converts data elements in FP32 format to data elements in BF16 format, the precision reduction operator 360 may reduce the precision of the weights using a conversion function expressed as:

Where W _FP32 is the weight of the FP32 format, which may be in the weight tensor determined from the training DNN, W0 _BF16 is the lower precision weight of the BF16 format, W1 _BF16 is another lower precision weight of the BF16 format, (BF 16) W _FP32 represents the operation of removing 16 bits from W _FP32 to convert W _FP32 to W0 _BF16, and (FP 32) W0 _BF16 represents the operation of adding 16 bits (e.g., 16 zero bits) to W0 _BF16. W0 _BF16 and W1 _BF16 may be used in place of W _FP32 in the tensor multiplication operation. In some embodiments, the reduction in weight accuracy may be performed offline, e.g., prior to DNN reasoning. The offline reduction of weight accuracy may be performed by DNN module 301. For example, after multi-precision tuning module 450 selects a precision level for the DNN layer, multi-precision tuning module 450 may reduce the precision of the weights in the DNN layer based on the selected precision level.

Similarly, the accuracy reduction operator 360 may reduce the accuracy of the weights using a transfer function expressed as:

where A _FP32 is an activation of the FP32 format, which may be in an activation tensor that is entered into or generated by a previous layer in DNN, A0 _BF16 is a lower precision activation of the BF16 format, W1 _BF16 is another lower precision activation of the BF16 format, (BF 16) A _FP32 represents an operation to remove 16 bits from A _FP32 to convert A _FP32 to A0 _BF16, and (FB 32) A0 _BF16 represents an operation to add 16 bits (e.g., 16 zero bits) to A0 _BF16. A0 _BF16 and A1 _BF16 may be used in place of a _FP32 in the tensor multiplication operation. Even though FP32 and BF16 are described above, the precision reduction operator 360 may convert or generate other formats of data elements.

Multiplication operator 370 may be configured to perform tensor multiplication operations at different precision levels. Multiplication operator 370 may have multiple modes of operation corresponding to different levels of precision. For example, multiplication operator 370 may receive configuration parameters from DNN module 301. The configuration parameters may configure the multiplication operator 370 to operate in a corresponding mode.

Multiplication operator 370 may receive input from precision reduction operator 360 and calculate output. The output of multiplication operator 370 may be used as the product of the activation and the weights in subsequent deep learning operations in the DNN. In some embodiments, multiplication operator 370 includes one or more multipliers and one or more accumulators (also referred to as "adders") for performing multi-precision multiplications. The multipliers and accumulators may allow multiplication operator 370 to calculate the output using different algorithms or functions and thus facilitate different levels of precision. More details regarding the multi-precision tensor multiplication operator 350 are provided below in connection with fig. 8.

Example DNN Module

Fig. 4 is a block diagram of DNN module 400 according to various embodiments. DNN module 400 may be an embodiment of DNN module 301 of fig. 3. As shown in fig. 4, DNN module 400 includes an interface module 410, a training module 420, a compression module 430, a verification module 440, a multi-precision tuning module 450, and a data store (datastore) 460. In other embodiments, alternative configurations, different, or additional components may be included in DNN module 400. Furthermore, the functionality attributed to components of DNN module 400 can be implemented by different components or different modules or systems included in DNN module 400.

Interface module 410 facilitates communication of DNN module 400 with other modules or systems. For example, interface module 410 establishes communication between DNN module 400 and an external database to receive data that may be used to train the DNN or entered into the DNN to perform tasks. As another example, interface module 410 supports DNN module 400 to distribute DNNs to other systems, e.g., computing devices configured to apply DNNs to perform tasks.

The training module 420 trains the DNN by using the training data set. The training module 420 forms a training data set. In embodiments where training module 420 trains DNNs to identify objects in images, the training dataset includes training images and training tags. The training labels describe a base truth (group-truth) classification of objects in the training image. In some embodiments, each label in the training dataset corresponds to an object in the training image. In some embodiments, a portion of the training data set may be used to initially train the DNN, and the remainder of the training data set may be retained as a verification subset used by verification module 440 to verify the performance of the trained DNN. The portion of the training data set that does not include the tuning subset and the validation subset may be used to train the DNN.

The training module 420 also determines hyper-parameters for training the DNN. The hyper-parameters are variables that specify the DNN training process. The super-parameters are different from parameters inside the DNN (e.g., weights of filters). In some embodiments, the superparameter comprises a variable that determines the architecture of the DNN, such as the number of hidden layers, etc. The superparameter also includes variables that determine how to train the DNN, such as batch size, number of rounds (epochs), etc. The batch size defines the number of training samples to be solved before updating the parameters of the DNN. The batch size is the same as or smaller than the number of samples in the training dataset. The training data set may be partitioned into one or more batches. The number of rounds defines how many times the entire training data set is passed forward and backward through the entire network. The number of rounds defines the number of times the deep learning algorithm works throughout the training dataset. One round means that each training sample in the training dataset has an opportunity to update parameters inside the DNN. The round may include one or more batches. The number of rounds may be 1,5, 10, 50, 100, 500, 1000 or even more.

The training module 420 defines the architecture of the DNN, for example, based on some super parameters. The architecture of the DNN includes an input layer, an output layer, and a plurality of hidden layers. The input layer of DNN may include a tensor (e.g., a multi-dimensional array) that specifies attributes of the input image, such as the height of the input image, the width of the input image, and the depth of the input image (e.g., the number of bits that specify the color of the pixels in the input image). The output layer includes tags for objects in the input layer. The hidden layer is a layer between the input layer and the output layer. The hidden layers include one or more convolutional layers and one or more other types of layers, such as a pooling layer, a fully-connected layer, a normalization layer, a SoftMax or logic layer, and the like. The convolution layer of DNN abstracts the input image into a feature map represented by a tensor that specifies the feature map height, feature map width, and feature map channels (e.g., red, green, blue images include 3 channels). The pooling layer is used to reduce the spatial volume of the convolved input image. It is used between two convolutional layers. The fully connected layers relate to weights, biases and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different categories by training.

In defining the architecture of the DNN, training module 420 also adds an activation function to the hidden layer or output layer. The activation function of a layer transforms a weighted sum of the inputs of the layer into an output of the layer. The activation function may be, for example, a ReLU activation function, a tangent activation function, or other types of activation functions.

After training module 420 defines the architecture of the DNN, training module 420 inputs the training data set into the DNN. The training data set includes a plurality of training samples. Examples of training samples include objects in an image and reference truth labels for the objects. Training module 420 modifies parameters internal to the DNN ("internal parameters of the DNN") to minimize errors between the label of the training object generated by the DNN and the reference truth label of the object. The internal parameters include the weights of the filters in the convolutional layer of DNN. In some embodiments, the training module 420 uses a cost function to minimize the error.

The training module 420 may train the DNN for a predetermined number of rounds. The number of rounds is a hyper-parameter defining the number of times the deep learning algorithm will work throughout the training dataset. One round means that each sample in the training dataset has an opportunity to update the internal parameters of the DNN. After training module 420 completes a predetermined number of passes, training module 420 may stop updating parameters in the DNN. The DNNs with updated parameters are referred to as trained DNNs.

The compression module 430 compresses the DNN. For example, compression module 430 may add a pruning operation to the DNN layer to reduce computational complexity or memory usage. The pruning operation may prune the weight tensors of the DNN layer by changing one or more non-zero value weights of the layer to zero. The modification may be done before, during or after training. The weights may be pruned during training, reasoning, or a combination of both. The compression module 430 may determine a sparsity ratio of the DNN layers. The sparsity ratio may be a ratio of the number of zero-valued weights in a layer to the total number of weights. The compression module 430 may perform pruning operations until the sparsity ratio of the DNN layer meets a target sparsity ratio, such as 10%, 20%, 30%, 40%, 50%, and the like.

In some embodiments, compression module 430 may select one or more layers in the DNN and modify each selected layer with a pruning operation. For example, the compression module 430 may select a computationally complex layer, such as a layer with a large filter. For a pruning operation of a layer or a type of layer, compression module 430 may determine a weight threshold that will not cause the accuracy loss of DNN to exceed the accuracy loss constraint. The pruning operation may modify weights having absolute values above the weight threshold to zero and leave the other weights unchanged. Weight pruning may reduce memory storage because zero-value weights may not be stored. Furthermore, the number of operations in the layer may be reduced, as the calculation of the zero value weights may be skipped without affecting the output of the layer. In some embodiments, compression module 430 may also measure the power savings, final DNN accuracy, or layer-by-layer sparsity caused by the pruning operation.

After compressing the DNN, compression module 430 may fine tune the DNN, for example, through a retraining process. The compression module 430 may fine tune the DNN after the weights are pruned. In some embodiments, the fine tuning process is a retraining or further training process. For example, after pruning the weights in the DNN, compression module 430 may further train the DNN by inputting the training dataset into the DNN. The values of the untrimmed weights in the DNNs may be modified based on the output of the DNNs and the reference truth labels of the training samples in the training dataset. In some embodiments, the value of the pruned weight (i.e., zero) does not change during the trimming process. For example, compression module 430 may place a mask (mask) over the pruned weight blocks and the mask may prevent values in the pruned weight blocks from being changed during the trimming process. In other embodiments, the values of all weights, including the pruned weights, may change during the trimming process. After one or more cycles of retraining and weight change by the compression module 430, the compression module 430 may perform a new pruning process, for example, by selecting weight blocks and pruning the selected weight blocks. In some embodiments, the weight pruning process may be repeated multiple times before the trimming process is completed.

In some embodiments, the number of rounds in the fine tuning process may be different from the number of rounds in the training process that determines the pre-pruning value of the weight. For example, the fine tuning process may have fewer passes than the training process. In an example, the number of rounds in the fine tuning process may be relatively small, such as 2, 3, 4, 5, and the like.

The verification module 440 verifies the accuracy of the trained or compressed DNN. In some embodiments, the validation module 440 inputs samples in the validation dataset into the trained DNN and uses the output of the DNN to determine model accuracy. In some embodiments, the validation data set may be formed from some or all of the samples in the training data set. Additionally or alternatively, the validation data set includes additional samples in addition to the samples in the training set. In some embodiments, verification module 440 may determine an accuracy score for measuring the accuracy, recall (recall), or a combination of accuracy and recall of the DNN. The verification module 440 may use a measure of accuracy = TP/(tp+fp) and a measure of recall = TP/(tp+fn), where accuracy may be how much correctly predicted (TP or true positive) the reference classification model has in its total number of predictions (tp+fp or false positive) and recall may be how much correctly predicted (TP) the reference classification model has in the total number of objects (tp+fn or false negative) that do have the property in question. F-score (F-score = 2 x pr/(p+r)) unifies accuracy and recall into a single measurement.

The verification module 440 may compare the accuracy score to a threshold score. In examples where verification module 440 determines that the accuracy score of the enhancement model is less than the threshold score, verification module 440 instructs training module 420 to retrain the DNN. In one embodiment, training module 420 may iteratively retrain the DNN until a stop condition occurs, such as an accuracy measurement indication that the DNN may be sufficiently accurate, or multiple training rounds have occurred.

The multi-precision tuning module 450 may determine the precision level of the deep learning operation in the DNN layer. In some embodiments, the multi-precision tuning module 450 may select a precision level from a plurality of predetermined precision levels of the DNN layer. The multi-precision tuning module 450 may also generate configuration parameters that indicate the selected level of precision. The multi-precision tuning module 450 may also send the configuration parameters to a computation block (e.g., computation block 330) so that the computation block will execute the DNN layer at the selected level of precision. For example, the computing block may execute the DNN layer in an operational mode configured by configuration parameters. In performing the DNN layers at the selected level of precision, one or more tensor multiplication operations in the DNN layers are performed at the selected precedent (prevision) level. For example, the product of two data elements (e.g., the product of an activation and a weight) may be approximated with a selected level of precision during a tensor multiplication operation.

In some embodiments, the multi-precision tuning module 450 may determine the level of precision of the DNN layer based on one or more criteria. The one or more criteria may include a target accuracy of an output of the DNN layer, a target accuracy of an output of the DNN, an estimated consumption of one or more computing resources (e.g., power, time, memory storage, memory bandwidth, computation, etc.) for performing the DNN layer, an estimated consumption of one or more computing resources for scouting (scout) the DNN, other criteria, or some combination thereof. In some embodiments, multi-precision tuning module 450 may perform a search using the graph representing DNNs to determine the level of precision of the layers in the DNNs. The graph may include nodes representing layers, and links (also referred to as "edges") between the nodes may represent relationships between the layers. For example, a layer is connected to another layer that receives the output of that layer. The node may encode the level of precision of the corresponding layer. For example, the state of a node may indicate the level of precision of that layer. The node may also encode other attributes of the layer, such as target accuracy of the layer, amount of computation in the layer, estimated consumption of computing resources for executing the layer, and the like. The multi-precision tuning module 450 may update the state of the node until the corresponding search criteria are met. The states of the nodes that result in meeting the search criteria may encode the best level of precision for the layer. By using graph searching, the multi-precision tuning module 450 may enable automatic optimization of multi-precision tensor multiplication operations. Certain aspects of the multi-precision tuning module 450 are described below in connection with fig. 5.

Data store 460 stores data that is received, generated, used, or otherwise associated with DNN module 400. For example, data store 460 stores data sets used by training module 420 and verification module 440. Data store 460 may also store data generated by training module 420 and verification module 440, such as hyper-parameters for training DNNs, internal parameters of trained DNNs (e.g., weights, etc.), data for sparsity acceleration (e.g., sparsity bitmaps, etc.), and the like. Data store 460 may store configuration parameters and graphs generated by multi-precision tuning module 450. In the embodiment of fig. 4, data store 460 is a component of DNN module 400. In other embodiments, data store 460 may be external to DNN module 400 and in communication with DNN module 400 via a network.

Fig. 5 is a block diagram of a multi-precision tuning module 450 according to various embodiments. The multi-precision tuning module 450 includes a graph generator 510, a graph compiler 520, a search criteria module 530, a search module 540, and an output module 550. In other embodiments, alternative configurations, different or additional components may be included in the multi-precision tuning module 450. Furthermore, the functionality of the components attributed to multi-precision tuning module 450 can be implemented by different modules, devices, or systems.

Graph generator 510 generates a graph representing DNN. A graph is a data structure that includes a set of nodes and one or more edges. An edge is a connection of two nodes. Each node may encode a layer of DNNs represented by the graph. Each edge linking two or more nodes may encode a connection between layers encoded by the nodes. For example, an edge between two nodes may encode a data stream between two layers encoded by the two nodes.

In some embodiments, graph generator 510 may generate a graph of the DNNs that includes nodes representing all layers in the DNNs. In other embodiments, graph generator 510 may select a subset of the layers in the DNN and generate a graph that includes nodes representing the selected layers. The graph may not have layers representing unselected layers in the DNN. For example, graph generator 510 may select a layer that includes a tensor multiplication operation. Layers that do not include tensor multiplication operations may not be selected. In other embodiments, graph generator 510 may generate a graph that includes nodes representing layers without tensor multiplication operations. However, a node representing a layer without a tensor multiplication operation may encode information different from a node representing a layer with a tensor multiplication operation. For example, a node representing a layer without a tensor multiplication operation may not encode the precision level of the corresponding layer.

The graph compiler 520 may replace the tensor multiplication operations in the graph with a multi-precision tensor multiplication operator. The graph may be generated by graph generator 510. An example multi-precision tensor multiplication operator is multi-precision tensor multiplication operator 800 in fig. 8. By way of alternative, graph compiler 520 may transform a graph representing DNN into a temporary graph based on multi-precision tensor multiplication. The graph compiler may assign an initial precision level to each tensor multiplication operation in the temporary graph. The temporary map defines a search space for searching for a desired level of precision for tensor multiplication operations in the temporary map.

The search criteria module 530 determines criteria for performing a search to find a desired level of precision for the tensor multiplication operation. Such criteria may be referred to as search criteria or search thresholds. In some embodiments, the search criteria module 530 determines the search threshold based on the accuracy of one or more tensor multiplication operations or the accuracy of the DNN. The accuracy may be a target accuracy, such as a desired accuracy. The search criteria module 530 may further determine the search threshold based on other factors, such as estimated consumption of computing resources for performing tensor multiplication operations, and the like.

The search module 540 performs a search based on the search criteria determined by the search criteria module 530 to find a desired level of precision for the tensor multiplication operation. In some embodiments, search module 540 performs such searches in a search space defined by the graph compiled by graph compiler 520. In some embodiments, the search module may determine a search algorithm. Examples of search algorithms include particle swarm (sawm) optimization (PSO) algorithms, general algorithms, differential evolution algorithms, and the like. The search module 540 may iteratively explore different combinations of precision levels (e.g., by iteratively updating the states of nodes in the graph) with the goal of finding an optimal precision level setting that can meet the search criteria while minimizing consumption of computing resources, e.g., minimizing inference time (i.e., maximizing inference speed).

The output module 550 may output the results of the search performed by the search module 540. In some embodiments, the output module 550 may generate the configuration parameters based on the optimal level of precision for the layer. Output module 550 may provide the configuration parameters to a compiler that compiles the DNN, or to a DNN accelerator that performs DNN reasoning (e.g., DNN accelerator 302). In some embodiments, the result of the search may be a computational graph with the best level of precision setting. The graph can be used for DNN reasoning. With the best level of precision setting, DNN reasoning can be more efficient and thus can enhance the performance of the DNN accelerator (e.g., DNN accelerator 302) that is executing DNN.

Exemplary diagram representing DNN

Fig. 6 illustrates an example diagram 600 in accordance with various embodiments. Graph 600 is a data structure that includes a set of nodes 610A-610P (collectively, "multiple nodes 610" or "nodes 610"). The line linking nodes 610 indicates a connection between nodes 610. The connections in graph 600 are referred to as edges. For purposes of illustration, node 610 and edges in FIG. 6 are shown. In other embodiments, graph 600 may include a different number of nodes or different edges.

The graph 600 may be used to represent DNNs. In an example, node 610 may represent a layer or deep learning operation in the DNN. The deep learning operation may be or include a tensor multiplication operation. Edges may indicate relationships between layers or deep learning operations in the DNN. Node 610 may be associated with an embedding (embedding) that encodes information about the layer or the deep learning operation, such as a level of precision. The graph may be generated by graph generator 510. The graph may also be compiled by graph compiler 520. In some embodiments, the graph may represent an optimal precision level setting for the DNN, e.g., an optimal combination of various precision levels for various tensor multiplication operations in the DNN. The graph can be used for reasoning about the running DNN.

Example data elements of different precision

Fig. 7 illustrates data elements 710 and 720 of different precision in accordance with various embodiments. The data element 710 or 720 may be an activation in an input tensor of the DNN layer or a weight in a weight tensor of the DNN layer. The data elements 710 and 720 have different accuracies and thus require different numbers of bits to store.

In the embodiment of fig. 7, the data element 710 has a lower precision, e.g., BF16 precision, and has 16 bits. The first bit (represented by "S" in FIG. 7) indicates the sign of the data element 710. The next eight bits (represented by "E" in FIG. 7) indicate the exponent of the data element 710. The next seven bits (represented by "M" in fig. 7) indicate the mantissa of the data element 710. The data element 720 has a relatively high precision, such as FP32 precision, and has 32 bits. The first bit (represented by "S" in FIG. 7) indicates the sign of the data element 720. The next eight bits (represented by "E" in FIG. 7) indicate the exponent of the data element 720. The next 23 bits (denoted by "M" in fig. 7) indicate the mantissa of the data element 720. The data element 720 has more 16 bits for mantissas than the data element 710.

The data elements 720 may be converted to one or more data elements with lower precision to increase the speed of DNN reasoning and reduce memory space and memory bandwidth. In an example, the data element 720 may be converted into one or more data elements having lower precision of the data element 710. For example, 16 additional bits of mantissa enclosed in the dashed oval in fig. 7 may be removed. Furthermore, data element 710 may be converted to data element 720, for example, by adding 16 additional bits to the mantissa. The 16 additional bits may be zero. In some embodiments, the conversion from data element 710 to data element 720 may be used to calculate another data element of lower precision in reducing the precision of data element 720.

Example Multi-precision tensor multiplication operator

FIG. 8 illustrates an example multi-precision tensor multiplication operator 800, according to various embodiments. The multi-precision tensor multiplication operator 800 may be an operator in a computation block, such as computation block 330 in fig. 3. The multi-precision tensor multiplication operator 800 may perform tensor multiplication at various precision levels. In the embodiment of fig. 8, the multi-precision tensor multiplication operator 800 includes a precision reduction operator 810A, another precision reduction operator 810W, and three multiplication operators 820A-820C (collectively "multiple multiplication operators 820" or "multiplication operators 820").

In other embodiments, alternative configurations, different or additional components may be included in the multi-precision tensor multiplication operator 800. For example, the multi-precision tensor multiplication operator 800 may include a different number of precision reduction operators or a different number of multiplication operators. Furthermore, the functionality attributed to a component of the multi-precision tensor multiplication operator 800 may be implemented by a different component or a different module, device, or system included in the multi-precision tensor multiplication operator 800.

In the round of computation, the precision reduction operator 810W receives the data element 801, and the precision reduction operator 810A receives the data element 802. The data element 801 may be a weight. The data element 802 may be an activation. The data elements 801 and 802 may have the same precision. For illustration purposes, both data elements 801 and 802 are in the FP32 format in fig. 8. The precision reduction operator 810W converts the data element 801 into two new data elements 803 and 804. The two data elements 803 and 804 have a lower precision than the data element 801. In fig. 8, data elements 803 and 804 are in BF16 format. Similarly, the precision reduction operator 810A converts the data element 802 into two new data elements 805 and 806. The two data elements 805 and 806 have a lower precision than the data element 802. In fig. 8, data elements 805 and 806 are in BF16 format.

The precision reducing operators 810W and 810A may be examples of the precision reducing operator 360 in fig. 3. Although fig. 8 shows two precision reduction operators, in other embodiments, the multi-precision tensor multiplication operator 800 may comprise a single precision reduction operator. For example, the multi-precision tensor multiplication operator 800 may include a precision reduction operator for activated precision reduction during DNN reasoning. The exact reduction of the weights may be performed by different modules or systems, for example by DNN module 301 prior to DNN reasoning.

The data elements 803, 804, 805 and 806 are input into one of the multiplication operators 820 for calculating the product 807 of the data element 801 and the data element 802. The product 807 may not be the true product of the data element 801 and the data element 802. Instead, the product 807 may be an approximation of the true product of the data element 801 and the data element 802. The three multiplication operators 820 may correspond to different levels of precision. For example, multiplication operator 820A may calculate an approximate product with higher accuracy than the approximate products calculated by the other two multiplication operators 820. Multiplication operator 820B may calculate an approximate product with higher accuracy than the approximate products calculated by the other two multiplication operators 820C.

Three multiplication operators 820 may use different algorithms to calculate the product 807. In an example, multiplication operator 820A may use an algorithm or function expressed as:

multiplication operator 820B may use an algorithm or function expressed as:

multiplication operator 820C may use an algorithm or function expressed as:

In this way, the product 807 calculated by the three multiplication operators 820 will have different accuracies. In other embodiments, multiplication operator 820 may use different algorithms or functions to calculate the product. Each multiplication operator 820 may include one or more multipliers and one or more adders for calculating the product 807. In some embodiments, the multipliers or adders may be shared by some or all of the multiplication operators 820.

Example data locality and reuse

FIG. 9 illustrates locality and reuse of data in a multi-precision tensor multiplication operation, according to various embodiments. The multi-precision tensor multiplication operation in fig. 9 may be part of a convolution or other type of deep learning operation. The multi-precision tensor multiplication operation has as inputs an activation tensor 910 and a weight tensor 920. For simplicity and illustration, each of the activation tensors 910 weight tensors 920 is a 4 x 4 matrix. The activation tensor 910 includes 16 activations, each of which is converted to two lower, precedent activations (A0 and A1). The weight tensor 920 includes 16 weights, each of which is converted to two lower preemptive weights (W0 and W1). The accuracy of the lower preemption activation may be the same as the accuracy of the lower preemption weight.

In some embodiments, the lower preemptive activations of the activation transition from some or all of the activation tensors 910 may be stored in memory, for example, in a register file. The register file may be in local memory 340. The lower preemptive activation of some or all weight transitions from the weight tensor 920 may be stored in another memory, such as another register file in the local memory 340. Storing lower precedent activations converted from multiple activations simultaneously in the same register file or storing lower precedent weights converted from multiple weights simultaneously in the same register file may increase data locality and data reuse. As represented by the arrows in fig. 9, each activation in the activation tensor 910 may be multiplied by multiple (or even all) weights in the weight tensor 920. Similarly, each weight in the weight tensor 920 may be multiplied by multiple (or even all) activations in the activation tensor 910. When multiple activations (or weights) are stored simultaneously in the same register file, the weights (or activations) to be multiplied by the activations (or weights) may be used more than once. Data locality and data reuse may improve the efficiency of DNN reasoning, as fewer data transfer operations will be required.

Example method of performing tensor multiplication operations

Fig. 10 is a flow diagram illustrating a method 1000 of performing a tensor multiplication operation in a DNN, according to various embodiments. Method 1000 may be performed by DNN system 300 in fig. 3. Although method 1000 is described with reference to the flowchart shown in FIG. 10, many other methods for draining data from a PE array may alternatively be used. For example, the order of execution of the steps in fig. 10 may be changed. As another example, some steps may be changed, eliminated, or combined.

DNN system 300 selects 1010 a precision level for a layer in DNN based on the target accuracy of the DNN. The layer includes a multiplication operation on a first tensor and a second tensor of the neural network. The first tensor includes a first data element. The second tensor includes a second data element. In some embodiments, the layer is a convolution layer, the first tensor is an input tensor of the convolution layer, the second tensor comprises a kernel of the convolution layer, and the tensor multiplication operation is part of the convolution. In other embodiments, the layer is a linear layer.

In some embodiments, DNN system 300 selects the level of accuracy for the layer by generating a graph representing at least a portion of the DNN. The graph includes nodes and one or more links between the nodes. Nodes represent layers in the DNN. The one or more links represent one or more connections between layers. DNN system 300 uses the graph to perform a search to select the level of precision for that layer. In some embodiments (e.g., embodiments where DNN system 300 selects a precision level from a plurality of predetermined precision levels), nodes in the graph have states that encode the plurality of predetermined precision levels.

In some embodiments, DNN system 300 determines one or more criteria for searching based on a target accuracy of the DNN. DNN system 300 updates the states of the nodes in the graph until one or more criteria are met. One or more criteria for searching are further determined based on the estimated consumption of the one or more computing resources for performing the DNN.

DNN system 300 converts 1020 the first data element into a first data element. The first data element has a first precision. The first pair of data elements has a second precision. In some embodiments, the first accuracy is higher than the second accuracy. DNN system 300 converts the first data element into a first data element by removing one or more bits in the first data element. In some embodiments, the data elements in the first data element are calculated by removing one or more bits in the first data element. Another data element of the first pair of data elements is calculated based on the data element of the first pair of data elements.

DNN system 300 converts 1030 the second data element into a second data element. The second data element has a first precision. The second pair of data elements has a second precision. In some embodiments, DNN system 300 converts the second data element into a second data element by removing one or more bits in the second data element. In some embodiments, the data elements in the second data element are calculated by removing one or more bits in the second data element. A data element of the second pair of data elements is calculated based on a data element of the second pair of data elements.

DNN system 300 identifies 1040 a function from a plurality of functions based on the selected level of precision of the layer. In some embodiments, DNN system 300 selects a precision level from a plurality of predetermined precision levels. The plurality of predetermined precision levels correspond to different functions for calculating an approximate product of the first data element and the second data element. The approximate products have different accuracies. In some embodiments, DNN system 300 selects a different one of a plurality of predetermined levels of precision as the level of precision for another layer in the DNN.

DNN system 300 calculates 1050 the output data elements for the layer by applying the identified functions to the first and second data elements. The output data element has a first precision. In some embodiments, DNN system 300 uses different functions to calculate output data elements when the selected precision levels are different.

DNN system 300 performs another layer in 1060 DNN by using the output data element. In some embodiments, the output data element is an approximation of the product of the first data element and the second data element.

Example computing device

FIG. 11 is a block diagram of an example computing device 1100, according to various embodiments. In some embodiments, computing device 1100 may be used as at least a portion of DNN system 300. Various components are shown in fig. 11 as being included in computing device 1100, but any one or more of these components may be omitted or duplicated as appropriate for the application. In some embodiments, some or all of the components included in computing device 1100 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system-on-a-chip (SoC) die. Additionally, in various embodiments, computing device 1100 may not include one or more of the components shown in fig. 11, but computing device 1100 may include interface circuitry for coupling to one or more of the components. For example, computing device 1100 may not include display device 1106, but may include display device interface circuitry (e.g., connectors and driver circuitry) to which display device 1106 may be coupled. In another set of examples, computing device 1100 may not include audio input device 1118 or audio output device 1108, but may include audio input or output device interface circuitry (e.g., connectors and support circuitry) to which audio input device 1118 or audio output device 1108 may be coupled.

Computing device 1100 can include a processing device 1102 (e.g., one or more processing devices). The processing device 1102 processes electronic data from registers and/or memory to transform the electronic data into other electronic data that may be stored in registers and/or memory. Computing device 1100 can include memory 1104, which can itself include one or more memory devices, such as volatile memory (e.g., DRAM), non-volatile memory (e.g., read Only Memory (ROM)), high Bandwidth Memory (HBM), flash memory, solid state memory, and/or a hard disk drive. In some embodiments, memory 1104 may include memory that shares a die with processing device 1102. In some embodiments, memory 1104 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for accelerating deep learning operations, such as method 1000 described above in connection with fig. 10 or some operations performed by DNN system 300 described above in connection with fig. 3-5. Instructions stored in one or more non-transitory computer-readable media may be executed by the processing device 1102.

In some embodiments, computing device 1100 may include a communication chip 1112 (e.g., one or more communication chips). For example, the communication chip 1112 may be configured to manage wireless communications for transferring data to and from the computing device 1100. The term "wireless" and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a non-solid medium. The term does not mean that the associated devices do not contain any wires, although in some embodiments they may not.

The communication chip 1112 may implement any of a variety of wireless standards or protocols, including but not limited to Institute of Electrical and Electronics Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 series), IEEE 802.16 standards (e.g., IEEE 802.16-2005 amendments), long Term Evolution (LTE) project, and any amendments, updates, and/or revisions (e.g., LTE-advanced project, ultra Mobile Broadband (UMB) project (also referred to as "3GPP 2"), etc.). IEEE 802.16 compliant Broadband Wireless Access (BWA) networks are commonly referred to as WiMAX networks, the acronym stands for worldwide interoperability for microwave access, which is an authentication mark for products that pass the compliance and interoperability test of the IEEE 802.16 standard. The communication chip 1112 may operate in accordance with the global system for mobile communications (GSM), general Packet Radio Service (GPRS), universal Mobile Telecommunications System (UMTS), high Speed Packet Access (HSPA), evolved HSPA (E-HSPA), or LTE network. The communication chip 1112 may operate in accordance with Enhanced Data for GSM Evolution (EDGE), GSM EDGE Radio Access Network (GERAN), universal Terrestrial Radio Access Network (UTRAN), or evolved UTRAN (E-UTRAN). The communication chip 1112 may operate in accordance with Code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), digital Enhanced Cordless Telecommunications (DECT), evolution data optimized (EV-DO) and its derivatives, as well as any other wireless protocol designated 3G, 4G, 5G and above. In other embodiments, communication chip 1112 may operate in accordance with other wireless protocols. The computing device 1100 can include an antenna 1122 to facilitate wireless communications and/or receive other wireless communications (such as AM or FM radio transmissions).

In some embodiments, communication chip 1112 may manage wired communications, such as electrical, optical, or any other suitable communication protocol (e.g., ethernet). As described above, the communication chip 1112 may include a plurality of communication chips. For example, the first communication chip 1112 may be dedicated to shorter range wireless communications such as Wi-Fi or bluetooth, and the second communication chip 1112 may be dedicated to longer range wireless communications such as Global Positioning System (GPS), EDGE, GPRS, CDMA, wiMAX, LTE, EV-DO or other wireless communications. In some embodiments, the first communication chip 1112 may be dedicated to wireless communication and the second communication chip 1112 may be dedicated to wired communication.

Computing device 1100 can include battery/power circuit 1114. The battery/power circuit 1114 may include one or more energy storage devices (e.g., batteries or capacitors) and/or circuitry for coupling components of the computing device 1100 to an energy source (e.g., AC line power) separate from the computing device 1100.

Computing device 1100 can include a display device 1106 (or corresponding interface circuitry, as described above). Display device 1106 may include any visual indicator such as, for example, a heads-up display, a computer monitor, a projector, a touch screen display, a Liquid Crystal Display (LCD), a light emitting diode display, or a flat panel display.

Computing device 1100 can include an audio output device 1108 (or corresponding interface circuitry, as described above). The audio output device 1108 may include any device that generates audible indicators, such as speakers, headphones, or earphones, for example.

The computing device 1100 may include an audio input device 1118 (or corresponding interface circuitry, as described above). Audio input device 1118 may include any device that generates a signal representing sound, such as a microphone, microphone array, or digital instrument (e.g., an instrument with a Musical Instrument Digital Interface (MIDI) output).

The computing device 1100 may include a GPS device 1116 (or corresponding interface circuitry, as described above). The GPS device 1116 may communicate with a satellite-based system and may receive the location of the computing device 1100, as is known in the art.

Computing device 1100 can include another output device 1110 (or corresponding interface circuitry, as described above). Examples of other output devices 1110 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or additional storage devices.

Computing device 1100 can include another input device 1120 (or corresponding interface circuit, as described above). Examples of other input devices 1120 may include an accelerometer, gyroscope, compass, image capture device, keyboard, cursor control device, such as a mouse, stylus, touch pad, bar code reader, quick Response (QR) code reader, any sensor, or Radio Frequency Identification (RFID) reader.

The computing device 1100 may have any desired form factor, such as a handheld or mobile computer system (e.g., cellular telephone, smart phone, mobile internet device, music player, tablet computer, laptop computer, netbook computer, super notebook computer, personal Digital Assistant (PDA), super mobile personal computer, etc.), desktop computer system, server or other networked computing component, printer, scanner, monitor, set-top box, entertainment control unit, vehicle control unit, digital camera, digital video recorder, or a wearable computer system. In some embodiments, computing device 1100 may be any other electronic device that processes data.

Select examples

The following paragraphs provide various examples of the embodiments disclosed herein.

Example 1 provides a method for performing a tensor multiplication operation in a neural network, comprising selecting a precision level for a layer in the neural network based on a target accuracy of the neural network, the layer comprising a multiplication operation on a first tensor and a second tensor of the neural network, the first tensor comprising a first data element, the second tensor comprising a second data element, the first data element and the second data element having a first precision, converting the first data element into a first data element, converting the second data element into a second data element, the first data element and the second data element having a second precision, identifying a function from a plurality of functions based on the selected precision level of the layer, calculating an output data element for the layer by applying the identified function to the first data element and the second data element, the output data element having the first precision, and performing another layer in the neural network using the output data element.

Example 2 provides the method of example 1, wherein selecting the precision level comprises selecting the precision level from a plurality of predetermined precision levels, the plurality of predetermined precision levels corresponding to different functions of the plurality of functions.

Example 3 provides the method of example 2, further comprising selecting a different one of a plurality of predetermined precision levels as a precision level of another layer in the neural network.

Example 4 provides the method of any of examples 1-3, wherein the first precision is higher than the second precision, and converting the first data element to the first data element includes removing one or more bits in the first data element.

Example 5 provides the method of example 4, wherein the data element in the first pair of data elements is calculated by removing one or more bits in the first data element, and the other data element in the first pair of data elements is calculated based on the data element in the first pair of data elements.

Example 6 provides the method of any of examples 1-5, wherein selecting a level of precision for the layer comprises generating a graph representing at least a portion of a neural network, the graph comprising nodes and one or more links between nodes, the nodes representing layers in the neural network, the one or more links representing one or more connections between layers, and performing a search using the graph to select the level of precision for the layer.

Example 7 provides the method of example 6, wherein the precision level is selected from a plurality of predetermined precision levels, and the node in the graph has a state encoding the plurality of predetermined precision levels.

Example 8 provides the method of example 7, wherein performing the search using the graph includes determining one or more criteria for the search based on a target accuracy of the neural network, and updating a state of nodes in the graph until the one or more criteria are met.

Example 9 provides the method of example 8, wherein the one or more criteria for searching is further determined based on an estimated consumption of one or more computing resources for performing the neural network.

Example 10 provides the method of any of examples 1-9, wherein the layer is a convolution layer or a linear layer, the first tensor is an input tensor of the convolution layer or the linear layer, the second tensor comprises a kernel of the convolution layer or the linear layer, and the tensor multiplication operation is part of a convolution or a linear transformation.

Example 11 provides one or more non-transitory computer-readable media storing instructions executable to perform operations for performing a tensor multiplication operation in a neural network, the operations comprising selecting a precision level for a layer in the neural network based on a target accuracy of the neural network, the layer comprising a multiplication operation on a first tensor and a second tensor of the neural network, the first tensor comprising a first data element, the second tensor comprising a second data element, the first data element and the second data element having a first precision, converting the first data element to the first data element, converting the second data element to the second data element, the first data element and the second data element having a second precision, identifying a function from a plurality of functions based on the selected precision level of the layer, calculating an output data element for the layer by applying the identified function to the first data element and the second data element, the output data element having the first precision, and performing another layer in the neural network using the output data element.

Example 12 provides the one or more non-transitory computer-readable media of example 11, wherein selecting the precision level comprises selecting the precision level from a plurality of predetermined precision levels, the plurality of predetermined precision levels corresponding to different functions of the plurality of functions.

Example 13 provides the one or more non-transitory computer-readable media of example 12, wherein the operations further comprise selecting a different one of the plurality of predetermined precision levels as a precision level of another layer in the neural network.

Example 14 provides the one or more non-transitory computer-readable media of any of examples 11-13, wherein the first precision is higher than the second precision, and converting the first data element to the first data element comprises removing one or more bits in the first data element.

Example 15 provides the one or more non-transitory computer-readable media of example 14, wherein the data element in the first data element is calculated by removing one or more bits in the first data element, and the other data element in the first data element is calculated based on the data element in the first data element.

Example 16 provides the one or more non-transitory computer-readable media of any of examples 11-15, wherein selecting the level of precision of the layer comprises generating a graph representing at least a portion of the neural network, the graph comprising nodes and one or more links between the nodes, the nodes representing the layers in the neural network, the one or more links representing one or more connections between the layers, and performing a search using the graph to select the level of precision of the layer.

Example 17 provides an apparatus comprising a computer processor to execute computer program instructions, and a non-transitory computer readable memory storing computer program instructions executable by the computer processor to perform operations for performing tensor multiplication operations in a neural network, the operations comprising selecting a precision level for a layer in the neural network based on a target accuracy of the neural network, the layer comprising multiplication operations on a first tensor and a second tensor of the neural network, the first tensor comprising a first data element, the second tensor comprising a second data element, the first data element and the second data element having a first precision, converting the first data element to the first data element, converting the second data element to the second data element, the first data element and the second data element having a second precision, identifying a function from a plurality of functions based on the selected precision level of the layer, calculating an output data element for the layer by applying the identified function to the first data element and the second data element, the output data element having a first precision, and performing another data element in the output network.

Example 18 provides the apparatus of example 17, wherein selecting the precision level comprises selecting the precision level from a plurality of predetermined precision levels, the plurality of predetermined precision levels corresponding to different functions of the plurality of functions.

Example 19 provides the apparatus of example 17 or 18, wherein the first precision is higher than the second precision, and converting the first data element to the first data element comprises removing one or more bits in the first data element.

Example 20 provides the apparatus of any of examples 17-19, wherein selecting a precision level for the layer comprises generating a graph representing at least a portion of a neural network, the graph comprising nodes and one or more links between nodes, the nodes representing layers in the neural network, the one or more links representing one or more connections between layers, and performing a search using the graph to select the precision level for the layer.

The above description of illustrated implementations of the disclosure, including what is described in the abstract, is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. While specific implementations of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize. These modifications can be made to the disclosure in light of the above detailed description.

Claims

1. A method for performing tensor multiplication operations in a neural network, comprising:

The target accuracy of a neural network is to select a precision level for a layer in the neural network. The layer includes a multiplication operation on a first tensor and a second tensor of the neural network. The first tensor includes a first data element, the second tensor includes a second data element, and the first and second data elements have a first precision.

Convert the first data element into the first pair of data elements;

The second data element is converted into a second pair of data elements, and the first pair of data elements and the second pair of data elements have a second precision.

Identify functions from multiple functions based on the selected level of precision for each layer;

The output data element of the layer is computed by applying the identified function to the first pair of data elements and the second pair of data elements, and the output data element has a first precision; and

Use the output data elements to execute another layer in the neural network.

2. The method according to claim 1, wherein selecting the precision level includes:

Choose a precision level from a number of predetermined precision levels, which correspond to different functions among a number of functions.

3. The method according to claim 2, further comprising:

Choose one of several predetermined precision levels as the precision level for another layer in the neural network.

4. The method of claim 1, wherein the first precision is higher than the second precision, and converting the first data element into a first pair of data elements includes removing one or more bits from the first data element.

5. The method of claim 4, wherein a data element in the first pair of data elements is calculated by removing one or more bits from the first data element, and another data element in the first pair of data elements is calculated based on the data element in the first pair of data elements.

6. The method of claim 1, wherein selecting a precision level for the layer comprises:

Generate a graph representing at least a portion of a neural network, the graph including nodes and one or more links between nodes, nodes representing layers in the neural network, and one or more links representing one or more connections between layers; and

Use a graph to perform a search to select the precision level for the layer.

7. The method of claim 6, wherein a precision level is selected from a plurality of predetermined precision levels, and the nodes in the graph have states that encode the plurality of predetermined precision levels.

8. The method of claim 7, wherein using a graph to perform the search comprises:

The target accuracy of the neural network is used to determine one or more criteria for the search; and

Update the state of the nodes in the graph until one or more criteria are met.

9. The method of claim 8, wherein the one or more criteria used for the search are further determined based on an estimated consumption of one or more computational resources used to perform the neural network.

10. The method of claim 1, wherein the layer is a convolutional layer or a linear layer, the first tensor is the input tensor of the convolutional layer or the linear layer, the second tensor includes the kernel of the convolutional layer or the linear layer, and the tensor multiplication operation is part of a convolutional or linear transformation.

11. One or more non-transitory computer-readable media storing instructions executable to perform operations for performing tensor multiplication operations in a neural network, the operations including:

Convert the first data element into the first pair of data elements;

Use the output data elements to execute another layer in the neural network.

12. One or more non-transitory computer-readable media according to claim 11, wherein selecting the level of precision includes:

13. The one or more non-transitory computer-readable media of claim 12, wherein the operation further comprises:

14. The one or more non-transitory computer-readable media of claim 11, wherein the first precision is higher than the second precision, and converting the first data element into a first pair of data elements includes removing one or more bits from the first data element.

15. One or more non-transitory computer-readable media according to claim 14, wherein a data element in a first pair of data elements is calculated by removing one or more bits from a first data element, and another data element in the first pair of data elements is calculated based on the data element in the first pair of data elements.

16. One or more non-transitory computer-readable media according to claim 11, wherein selecting a precision level for the layer includes:

Use a graph to perform a search to select the precision level for the layer.

17. An apparatus comprising:

A computer processor is used to execute computer program instructions; and

A non-transitory computer-readable storage medium storing computer program instructions executable by a computer processor to perform operations for performing tensor multiplication operations in a neural network, the operations including:

The precision level of a layer in a neural network is selected based on the target accuracy of the neural network. The layer involves multiplication operations on a first tensor and a second tensor of the neural network. The first tensor includes first data elements, and the second tensor includes second data elements. Both the first and second data elements have a first precision.

Convert the first data element into the first pair of data elements.

The second data element is converted into a second pair of data elements, where the first pair of data elements and the second pair of data elements have a second precision.

Identify functions from multiple functions based on the selected precision level of the layer.

The output data element of the layer is calculated by applying the identified function to the first pair of data elements and the second pair of data elements. The output data element has a first precision, and

Use the output data elements to execute another layer in the neural network.

18. The apparatus of claim 17, wherein selecting the level of precision includes:

19. The apparatus of claim 17, wherein the first precision is higher than the second precision, and converting the first data element into a first pair of data elements includes removing one or more bits from the first data element.

20. The apparatus of claim 17, wherein selecting a precision level for the layer comprises:

Use a graph to perform a search to select the precision level for the layer.