CN120508407B

CN120508407B - Optimization method for data type conversion operator, computer device, readable storage medium and computer program product

Info

Publication number: CN120508407B
Application number: CN202511007685.7A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2025-07-22
Filing date: 2025-07-22
Publication date: 2025-09-12
Anticipated expiration: 2045-07-22
Also published as: CN120508407A

Abstract

The application relates to a method for optimizing a data type conversion operator, a computer device, a readable storage medium and a computer program product. The method comprises the steps of distributing memory resources for target data, establishing a read mapping relation between each logic index in a logic structure of source data and a memory address of the source data, and establishing a write mapping relation between each logic index in the logic structure and the memory address of the target data, traversing the logic structure, determining a read address corresponding to the logic index according to the read mapping relation for any logic index traversed, reading data from the memory resources corresponding to the source data based on the read address, converting the read data elements from a first data type to a second data type, and writing the data elements after the data type conversion into the memory resources corresponding to the target data according to the write mapping relation. The method can improve the computing capacity of the data type conversion operator.

Description

Optimization method for data type conversion operator, computer device, readable storage medium and computer program product

Technical Field

The present application relates to the field of computer technology, and in particular, to a method for optimizing a data type conversion operator, a computer device, a readable storage medium, and a computer program product.

Background

In the current computer graphics processing and high performance computing field, the parallel computing technology of the mainstream GPU (Graphics Processing Unit, graphics processor) architecture is widely used, and becomes a key support for improving the efficiency of various computing tasks.

When such GPU architecture performs data type conversion (typecast) operation, the related art generally adopts a manner of directly mapping the physical address of the memory to the physical address of the on-chip register, and under this direct mapping mechanism, since each thread (thread) needs to independently calculate the memory address corresponding to the corresponding operation, a large amount of address calculation overhead is generated, and these additional overheads not only continuously consume the computing resources of the chip, but also reduce the overall computing efficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data type conversion operator optimizing method, a computer device, a readable storage medium, and a computer program product that can improve the calculation efficiency of the data conversion operator.

In a first aspect, the present application provides a method for optimizing a data type conversion operator, including:

Memory resources are allocated for target data, wherein the target data is data after data type conversion of source data, the source data corresponds to a first data type, and the target data corresponds to a second data type;

Determining a logic structure corresponding to the source data, establishing a read mapping relation between each logic index in the logic structure and a memory address of the source data, and establishing a write mapping relation between each logic index in the logic structure and a memory address of the target data;

traversing the logic structure, determining a read address corresponding to the logic index based on the read mapping relation for any traversed logic index, reading data from a memory resource corresponding to the source data based on the read address, converting the read data element from the first data type to the second data type, and writing the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relation.

In one embodiment, the determining the logical structure corresponding to the source data, establishing a read mapping relationship between each logical index in the logical structure and the memory address of the source data, and establishing a write mapping relationship between each logical index in the logical structure and the memory address of the target data, includes:

Determining a target conversion strategy based on the first data type and the second data type, wherein the target conversion strategy comprises at least one conversion stage, a first conversion stage comprises converting the first data type into a third data type, and a last conversion stage comprises converting a fourth data type into the second data type;

Determining a logic structure corresponding to the source data, and respectively determining a first mapping relation and a second mapping relation for any conversion stage, wherein the first mapping relation is a mapping relation corresponding to a memory address corresponding to first data in the conversion stage and the logic structure, the second mapping relation is a mapping relation between the logic structure and a memory address corresponding to second data in the conversion stage, the first data is the source data in the conversion stage, and the second data is target data in the conversion stage.

In one embodiment, the determining the read address corresponding to the logical index based on the read mapping relationship, and performing data reading from the memory resource corresponding to the source data based on the read address, converting the read data element from the first data type to the second data type, and writing the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relationship includes:

and sequentially executing data type conversion corresponding to each conversion stage, wherein the data type conversion process comprises the following steps of:

And determining a read address corresponding to the logical index based on the first mapping relation, reading data from a memory resource corresponding to the first data based on the read address, converting the data type of the read data element corresponding to the first data into the data type corresponding to the second data, and writing the data element after the data type conversion into the memory resource corresponding to the second data according to the second mapping relation.

In one embodiment, the method further comprises:

for any of the conversion phases, an iteration step through the logical structure and a length of a burst transfer of read data from a memory resource during the conversion phase are determined.

In one embodiment, the determining the target conversion policy based on the first data type and the second data type includes:

searching a mapping relation based on the first data type and the second data type;

Under the condition that the mapping relation between the first data type and the second data type is searched, determining the target conversion strategy as a direct conversion strategy, wherein the direct conversion strategy comprises a conversion stage;

Or under the condition that the mapping relation between the first data type and the second data type is not found, determining the target conversion strategy as a multi-order conversion strategy, wherein the multi-order conversion strategy comprises at least two conversion stages.

In one embodiment, the determining the iteration step size of traversing the logical structure and the length of the burst transfer of the read data from the memory resource in the transition phase includes:

And calculating an iteration step length and a burst transmission length according to the data type of the first data and the data type of the second data, the physical layout of the first data and the physical layout of the second data and the maximum length supporting burst transmission.

In one embodiment, the iterative step satisfies the following condition:

After the iteration step length is converted into a physical byte step length, the maximum length supporting burst transmission is not exceeded, and the initial address is aligned according to the data type;

The burst transmission length is the minimum value of the maximum length of the supported burst transmission and the physical byte number of continuous data in the current iteration window.

In a second aspect, the present application also provides an optimizing apparatus for a data type conversion operator, including:

The distribution module is used for distributing memory resources for target data, wherein the target data is data obtained by converting data types of source data, the source data corresponds to a first data type, and the target data corresponds to a second data type;

The establishing module is used for determining a logic structure corresponding to the source data, establishing a read mapping relation between each logic index in the logic structure and a memory address of the source data, and establishing a write mapping relation between each logic index in the logic structure and a memory address of the target data;

The conversion module is used for traversing the logic structure, determining a read address corresponding to the logic index based on the read mapping relation for any traversed logic index, reading data from a memory resource corresponding to the source data based on the read address, converting the read data element from the first data type to the second data type, and writing the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relation.

In one embodiment, the method further comprises:

In one embodiment, the iterative step satisfies the following condition:

In a third aspect, the present application also provides a computer device comprising a memory storing a computer program and a processor implementing the method of optimizing a data type conversion operator of any of the above when the computer program is executed by the processor.

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of optimizing a data type conversion operator of any of the above.

In a fifth aspect, the application also provides a computer program product comprising a computer program which when executed by a processor implements the method of optimizing a data type conversion operator of any of the above.

According to the optimization method, the computer equipment, the readable storage medium and the computer program product of the data type conversion operator, memory resources can be allocated for target data, wherein the target data is data after data type conversion of source data, the source data corresponds to a first data type, and the target data corresponds to a second data type. Determining a logic structure corresponding to the source data, establishing a read mapping relation between each logic index in the logic structure and the memory address of the source data, and establishing a write mapping relation between each logic index in the logic structure and the memory address of the target data. Traversing the logic structure, determining a read address corresponding to the logic index based on the mapping relation for any logic index traversed, reading data from a memory resource corresponding to the source data based on the read address, converting the read data element from the first data type to the second data type, and writing the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relation. By adopting the optimization method, the computer equipment, the readable storage medium and the computer program product of the data type conversion operator, which are provided by the embodiment of the application, the mapping relation between the logic index and the source and target data memory addresses is established, so that each process does not need to perform address calculation in the data type conversion process, and the read-write memory address is directly queried based on the mapping relation. The method greatly reduces the calculated amount, saves the chip calculation resources and effectively improves the calculation efficiency of the data type conversion operator.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are needed in the description of the embodiments of the present application or the related technologies will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a schematic diagram of a data type conversion process of a conventional data type conversion operator in one embodiment;

FIG. 2 is a flow diagram of a method of optimizing a data type conversion operator in one embodiment;

FIG. 3 is a block diagram of an artificial intelligence processor in one embodiment;

FIG. 4 is a schematic diagram of step 204 in one embodiment;

FIG. 5 is a schematic diagram of a data type conversion process for a data type conversion operator in one embodiment;

FIG. 6 is a block diagram of an apparatus for optimizing data type conversion operators in one embodiment;

Fig. 7 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In the related art, in the Data type conversion process, after Source Data (or src for short) is mapped physically, typecast operations are executed in an on-chip register, and target Data after Data type conversion is stored in a memory through physical mapping.

FIG. 1 is a schematic diagram of mapping memory physical addresses directly to on-chip register physical addresses. For example, in the physical layout, 0-15 corresponds to storage units corresponding to different physical addresses, in which source data to be subjected to data type conversion is stored, and for data elements stored in each storage unit, mapping calculation of the physical address of the storage unit and an on-chip register address can be performed to determine the register address written by the storage unit. In fig. 1, the mapping relationship between the physical address and the register address is identified by the same color, (where R is the identifier of the register, T is the identifier of the thread, R0T0 represents the register R0 of the thread T0, and so on), e.g., physical address one (physical address one is identified as 0 in fig. 1) corresponds to R0T0, physical address two (physical address two is identified as 1 in fig. 1) corresponds to R1T0, physical address three (physical address three is identified as 2 in fig. 1) corresponds to R0T1, physical address four (physical address four is identified as 3 in fig. 1) corresponds to R1T1, and so on).

The source data stored in the memory can be written into the on-chip register based on the mapping relation between the physical address of the memory and the on-chip register, and the source data is written back into the memory after the data type conversion is carried out in the on-chip register. In the process of writing the converted data into the memory from the on-chip register, the mapping between the on-chip register address storing the converted data and the physical address is calculated, and the physical address to which the data is to be written is determined. In fig. 1, taking the data type converted from f32 to f16 as an example, the converted data is stored in R3T0, R3T1, R3T2 and R3T3, where the calculated R3T0 corresponds to the physical address one, R3T1 corresponds to the physical address two, R3T2 corresponds to the physical address five (the physical address five in fig. 1 is marked as 8), and R3T3 corresponds to the physical address six (the physical address six in fig. 1 is marked as 9). Furthermore, the target data stored in the register after the data type conversion can be written into the physical memory based on the mapping relation between the physical address of the memory and the on-chip register.

Therefore, each time of data type conversion, the thread needs to perform the operation of calculating the mapping address, such as physical address to register and register to physical address, which is independent in the whole flow, so that a great deal of calculation resources are consumed, and the conversion efficiency is dragged.

The embodiment of the application provides an optimization method of a data type conversion operator, which is characterized in that the mapping relation between a logic index and the memory addresses of source and target data is established, so that each process does not need to perform address calculation in the data type conversion process, and the read-write memory address is directly inquired based on the mapping relation. The method greatly reduces the calculated amount, saves the chip calculation resources and effectively improves the calculation efficiency of the data type conversion operator.

In an exemplary embodiment, as shown in fig. 2, a method for optimizing a data type conversion operator is provided and applied to an artificial intelligence processor, where the artificial intelligence processor in the embodiment of the present application is any one of GPU (Graphics Processing Unit, graphics processor), TPU (Tensor Processing Unit, tensor processor), NPU (Neural network Processing Unit, neural network processor), DPU (DEEP LEARNING Processing Unit, deep learning processor), APU (ACCELERATED PROCESSING UNIT, acceleration processor), and GPGPU (General-Purpose Graphics Processing Unit, general-purpose graphics processor), and in this embodiment, GPU is taken as an example.

Referring to FIG. 3, a schematic diagram of an artificial intelligence processor is shown. The artificial intelligence processor architecture includes multiple parallel independent CUs (Computing units), global Memory (GLM), and provides shared storage support for all CUs. Each CU is an independent computation submodule and comprises Threads (threads), TLR (TREAD LEVEL REGISTER, thread level register) and GSM (Group Shared Memory, shared memory), wherein Threads is the parallel execution unit with the finest granularity in the CU, and can schedule multiple threads simultaneously for splitting parallel computation tasks (such as matrix operation and convolution operation). The TLR is located between Threads and GSM, allocates computing resources for Threads, or prepares/buffers the data to be operated on. GSM is a specific high-speed local storage of a CU, and is used for storing data (such as model parameter fragments and intermediate calculation results) of high-frequency access of the current CU, so as to reduce access delay to a global memory. The global memory is used as global shared storage for storing complete model parameters, input data sets and intermediate results needed to be interacted among CUs.

The global memory distributes data (e.g., model weights, input features) to the GSM of each CU for local computation invocation by the CU. In each CU, threads relies on TLR scheduling, and utilizes GSM data to execute AI tasks (such as calculation of a neural network layer), and multiple CUs work simultaneously and in parallel. The CU calculation results can be temporarily stored in GSM or written back to GLM for other CUs to call, supporting collaborative calculations across CUs.

Referring to fig. 2, the method for optimizing a data type conversion operator according to an embodiment of the present application may include the following steps 202 to 206. Wherein:

Step 202, memory resources are allocated to target data, wherein the target data is data after data type conversion of source data, the source data corresponds to a first data type, and the target data corresponds to a second data type.

In the embodiment of the application, the memory resource can be allocated for the target data based on the logic shape and the data type according to the target data. The target data is obtained after the source data type conversion operation, the source data is assumed to correspond to the first data type, the data type conversion operation is to convert the source data from the first data type to the second data type, and the converted result is the target data.

For example, if the memory resource is global memory, and the source data and the target data are both matrices with a logical shape of 64×32, and the second data type corresponding to the target data is float32, the required memory size of the target data may be calculated to be 64×32×4b=8192b, and then global memory with a corresponding size may be allocated to the target data.

Step 204, determining a logical structure corresponding to the source data, establishing a read mapping relationship between each logical index in the logical structure and the memory address of the source data, and establishing a write mapping relationship between each logical index in the logical structure and the memory address of the target data.

In the embodiment of the application, the logic structure of the source data is constructed by the coordinates of each data element in the source data, and the coordinates of each data element are directly used as the logic index of the source data. In this way, by mapping the physical storage locations into a mathematical coordinate space, an abstract representation of data storage and access is achieved. Specifically, for matrix data of an arbitrary size of m×n, in which a data element located at the (i, j) position, its logical index is directly defined as a coordinate value (i, j). Taking a 64×32 matrix as an example, the logical index of the data element at the position (32,16) in the matrix is (32,16).

After determining the corresponding logic structure of the source data, the memory address of the source data can be mapped onto the corresponding logic structure based on a preset mapping rule to obtain a read mapping relationship between each logic index in the logic structure and the memory address of the source data, and the memory address of the target data is mapped onto the logic structure based on the preset mapping rule to obtain a write mapping relationship between each logic index in the logic structure and the memory address of the target data. In the embodiment of the application, the target data is the result of data type conversion of the source data, namely the source data and the target data are completely identical in structure, and the logic structures corresponding to the source data and the target data are completely matched. This consistency design allows the construction of the logical structure to be completely decoupled from the data types (fp 16, bf16, int8, etc.) of the data.

That is, in the embodiment of the present application, after the source data logical structure is established, the bidirectional mapping relationship between the source data memory address and the logical structure, and between the target data memory address and the logical structure is respectively established through a two-dimensional mapping rule system, where the mapping rule system has high flexibility and expansibility. It should be noted that the present application is not limited to a specific mapping rule, and allows various implementations to be adopted, including but not limited to linear mapping, hash mapping, piecewise mapping, and the like. The present application is applicable to embodiments of the present application as long as an efficient mapping of physical addresses to logical structures can be achieved.

Step 206, traversing the logic structure, determining a read address corresponding to the logic index based on the read mapping relation for any one of the traversed logic indexes, reading data from the memory resource corresponding to the source data based on the read address, converting the read data element from the first data type to the second data type, and writing the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relation.

In the embodiment of the application, in the execution process of the instruction corresponding to the data type conversion operator, the logic structure can be traversed to read data, the corresponding read address and write address are respectively determined from the read mapping relation and the write mapping relation aiming at the logic index of the traversed data element, the data is further read from the read address, the read data is subjected to data type conversion, the first data type is converted into the second data type, and the data of the second data type is written into the write address. After traversing the logical structure, conversion of the source data from the first data type to the second data type is accomplished.

According to the optimization method of the data type conversion operator, memory resources can be allocated for target data, wherein the target data is data after data type conversion of source data, the source data corresponds to a first data type, and the target data corresponds to a second data type. Determining a logic structure corresponding to source data, establishing a read mapping relation between each logic index in the logic structure and a memory address of the source data, establishing a write mapping relation between each logic index in the logic structure and a memory address of target data, traversing the logic structure, determining a read address corresponding to the logic index for any logic index traversed based on the mapping relation, reading data from a memory resource corresponding to the source data based on the read address, converting the read data element from a first data type to a second data type, and writing the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relation. By adopting the optimization method of the data type conversion operator provided by the embodiment of the application, through establishing the mapping relation between the logic index and the source and target data memory addresses, each process does not need to perform address calculation in the data type conversion process, and directly queries the read-write memory address based on the mapping relation. The method greatly reduces the calculated amount, saves the chip calculation resources and effectively improves the calculation efficiency of the data type conversion operator.

In an exemplary embodiment, step 204 of determining a logical structure corresponding to the source data, establishing a read mapping relationship between each logical index in the logical structure and a memory address of the source data, and establishing a write mapping relationship between each logical index in the logical structure and a memory address of the target data may include the following steps 402 to 404, referring to fig. 4, in which:

step 402, determining a target conversion strategy based on the first data type and the second data type, wherein the target conversion strategy comprises at least one conversion stage, the first conversion stage comprises converting the first data type into a third data type, and the last conversion stage comprises converting the fourth data type into the second data type;

Step 404, determining a logical structure corresponding to the source data, and determining a first mapping relationship and a second mapping relationship for any conversion stage, where the first mapping relationship is a mapping relationship between the logical structure and a memory address corresponding to the first data in the conversion stage, the second mapping relationship is a mapping relationship between the logical structure and a memory address corresponding to the second data in the conversion stage, the first data is source data in the conversion stage, and the second data is target data in the conversion stage.

In the embodiment of the application, the conversion strategy can comprise direct conversion and multi-order conversion, wherein the direct conversion is to directly perform single conversion operation from the first data type to the second data type on the source data without intermediate data type transition, i.e. the whole conversion process only comprises one conversion operation stage. For example, the first data type is float32 and the second data type is float16, then the target conversion policy includes one conversion phase, i.e., a conversion phase from float32 to float 16.

While multi-level conversion may divide the conversion operation of source data from a first data type to a second data type into a plurality of successive conversion operation phases, e.g., a first data type being float32 and a second data type being float8, the overall conversion operation may include two conversion phases, the first conversion phase being a conversion operation of float32 to float16 and the second conversion phase being a conversion operation of float16 to float 8.

That is, in an embodiment of the present application, the target conversion policy may include at least one conversion stage, wherein the first conversion stage includes converting the first data type into the third data type, and the last conversion stage includes converting the fourth data type into the second data type. When the target conversion strategy is direct conversion, the target conversion strategy comprises a conversion stage, wherein the first conversion stage is the last conversion stage, the third data type is the second data type, and the fourth data type is the first data type.

When the target conversion policy is multi-order conversion, the target conversion policy may include at least two conversion stages, and for clarity, source data in each conversion stage is referred to as first data, and target data in each conversion stage is referred to as second data, where a data type of the second data in a previous conversion stage is a data type of the first data in a next conversion stage, to form a chain conversion path. The data type of the first data in the first conversion stage is the first data type, and the data type of the second data in the last conversion stage is the second data type. It should be noted that, in the process that the target conversion policy includes two conversion stages, the third data type of the second data in the first conversion stage is the same as the fourth data type of the first data in the last stage.

In an exemplary embodiment, determining the target conversion policy based on the first data type and the second data type in step 402 may include the steps of:

And determining that the target conversion strategy is a direct conversion strategy, wherein the direct conversion strategy comprises a conversion stage under the condition that the mapping relation between the first data type and the second data type is found, or determining that the target conversion strategy is a multi-stage conversion strategy, wherein the multi-stage conversion strategy comprises at least two conversion stages under the condition that the mapping relation between the first data type and the second data type is not found.

In the embodiment of the application, whether the mapping relation between the first data type and the second data type is converted or not can be searched from a preset mapping relation, and under the condition that the mapping relation is searched, the conversion operation is not split, and the target conversion strategy is determined to be a direct conversion strategy, or under the condition that the mapping relation is not searched, the conversion operation between the first data type and the second data type is split through the intermediate transition data type, so that the multi-order conversion strategy is obtained.

Illustratively, assuming that the first data type is T1 and the second data type is T2, a direct conversion relationship of T1-T2 is looked up in a predefined mapping table. If the direct conversion relation of T1-T2 exists, a direct conversion strategy is selected, and if the direct conversion relation does not exist, an intermediate transition data type deducing flow is entered. The intermediate transition data type deducing process comprises the steps of trying to find an intermediate transition data type T3, wherein the intermediate transition data type T3 meets the following conditions that a conversion relation of T1-T3 and T3-T2 exists in a mapping table, if the intermediate transition data type T3 is found, two-stage multi-level conversion containing T1-T3-T2 is selected, or if the intermediate transition data type T3 is not found, trying to find intermediate transition data type sequences T3 and T4, the intermediate transition data types T3 and T4 meet the following conditions that a conversion relation of T1-T3, T3-T4 and T4-T2 exists in the mapping table, if the intermediate transition data types T3 and T4 are found, three-stage multi-level conversion containing T1-T3-T4-T2 is selected, the steps are recursively executed until a feasible intermediate transition data type sequence is found, and a multi-level conversion strategy is constructed based on the intermediate transition data types.

After determining the target conversion policy, for each conversion stage in the target conversion policy, a first mapping relationship between the logic structure of the source data and the memory address storing the first data, and a second mapping relationship between the logic structure of the source data and the memory address storing the second data, that is, a mapping relationship between the logic structure and the memory address storing the source data, a mapping relationship between the logic structure and the memory address storing the target data, and a mapping relationship between the logic structure and the memory address storing the intermediate transition data type data may be determined, respectively. For example, the physical memory of the source data and the target data may be global memory, and the memory address of the data of the intermediate transition data type is shared memory. The specific mapping relationship may be established by referring to the related description of the foregoing embodiments, which is not described herein in detail in the embodiments of the present application.

By adopting the optimization method of the data type conversion operator provided by the embodiment of the application, the high-efficiency data type conversion can be realized by flexibly dividing the conversion stage.

In an exemplary embodiment, determining a read address corresponding to a logical index based on a read mapping relationship, and performing data reading from a memory resource corresponding to source data based on the read address, converting a read data element from a first data type to a second data type, and writing the data element after the data type conversion into the memory resource corresponding to target data according to a write mapping relationship, including:

And sequentially executing data type conversion corresponding to each conversion stage, wherein the data type conversion process comprises the steps of determining a read address corresponding to the logic index based on a first mapping relation, reading data from a memory resource corresponding to first data based on the read address, converting the read data element from the data type corresponding to the first data to the data type corresponding to second data, and writing the data element after the data type conversion into the memory resource corresponding to the second data according to a second mapping relation.

In the embodiment of the application, the data type conversion corresponding to each stage can be sequentially executed from the source data. In the execution process of any conversion stage, the logic structure can be traversed to read data, the corresponding read address and write address are respectively determined from the first mapping relation and the second mapping relation aiming at the logic index of the traversed data element, the data is further read from the read address, the read data is subjected to data type conversion, the data type corresponding to the first data is converted into the data type corresponding to the second data, and the second data is written into the write address. After the traversal of the logic structure is completed, the data type conversion of the current conversion stage is realized.

In an exemplary embodiment, in the execution process of each conversion stage, step processing can be combined, and a burst transmission mechanism is adopted to perform data reading and writing and conversion processing, so that the system performance is remarkably improved by optimizing a data reading and writing mode. Illustratively, the above method may further comprise:

for any transition phase, an iteration step of traversing the logical structure and a length of a burst transfer of data read from the memory resource during the transition phase are determined.

In the embodiment of the application, the data access strategy can be dynamically adjusted based on the hardware characteristics. For example, for any transition stage, an optimal iteration step may be determined based on the current hardware environment and data characteristics. For example, for a memory system supporting burst transfers, the step size may be set to an integer multiple of the maximum length of the burst transfer. In the data traversal process, batch data processing is carried out by taking a fixed step length as a unit, so that the access frequency can be reduced.

Illustratively, the data traversal is based on iteration steps, for logical index (i, j) of traversed data elements. The first mapping relationship may be queried to determine the memory address corresponding to the logical index. Further, the memory address is used as a starting address, and the corresponding number of data elements are continuously read from the physical memory according to the maximum burst transmission length supported by the current hardware. By way of example, taking a burst transfer maximum length of 64 bytes and a first data of the float32 type as an example, 16 data elements of the float32 type may be read from the start address.

In the current conversion stage, the second mapping relation can be queried based on the logic index of the read data element, after the target memory address corresponding to the read data element is determined, the read data element is subjected to data type conversion processing, and the converted data element is written into the target memory address. Therefore, continuous reading and writing of data can be realized, access to physical resources is reduced, and the data reading and writing efficiency is improved.

In an exemplary embodiment, the method may further include:

and calculating the iteration step length and the burst transmission length according to the data type of the first data and the data type of the second data, the physical layout of the first data and the physical layout of the second data and the maximum length supporting burst transmission.

In the embodiment of the application, the maximum length of burst transmission supported by hardware, the data type of the first data and the physical layout of the first data can be determined, the number of data elements with continuous physical addresses starting from the data elements traversed currently is determined based on the physical layout of the first data, the number of physical bytes which can be read continuously is determined based on the data type of the first data, and the minimum value is determined as the burst transmission length from the maximum length of burst transmission and the number of physical bytes. After determining the burst transmission length, the iteration step length can be determined based on the burst transmission length and the data type of the first data, so that after the iteration step length is converted into the physical byte step length, the maximum length supporting the burst transmission is not exceeded, and the initial address is aligned according to the data type.

Illustratively, assume that the first data is a 1024 x 1024 float16 matrix and that the maximum length of the GPU (Graphics Processing Unit, graphics processor) hardware support burst transfers is 256 bytes. Each data element occupies 2 bytes, and the first data is stored in the memory in a row-first manner.

Since the first data is stored in the row first and is continuously distributed in the matrix memory, the number of elements with continuous physical addresses from the current traversal element is the number of the remaining elements of the row. For example, if the current traversal is to the ith row and jth column element, the number of physically consecutive elements is 1024-j. Each float16 element occupies 2 bytes, so the number of physical bytes that can be read consecutively is (1024-j) ×2.

Comparing the maximum length 256 bytes of the burst transmission of the hardware with the number of bytes which can be continuously read, and taking the minimum value. For example, if the number of remaining elements in the current line is 128, 256 bytes can be continuously read, which is equal to the maximum length of the hardware burst transmission, so the burst transmission length is 256 bytes, and if the number of remaining elements is 64 (i.e., 128 bytes can be continuously read), the burst transmission length is 128 bytes.

The iteration step is the burst length/element size based on the burst transmission length and the data type. For example, when the burst length is 256 bytes, the iteration step is 256/2=128 elements, ensuring that the data per transfer is continuous in physical address and does not exceed the hardware limit. And iterating by taking 128 elements as a unit, reading 256 bytes of data from the memory each time, performing data type conversion processing on the read data, and directly writing the data into a corresponding target memory address through a second mapping relation.

In the embodiment of the application, the iteration step length of the second data and the maximum length of the burst transmission can be determined based on the data type and the physical layout in the writing stage, or the burst transmission with larger length can be directly split into a plurality of burst transmissions with smaller length based on the physical layout and the data type of the second data.

Illustratively, assume that the first data is a 256×256 matrix with a data type of float32 and the second data is a data type of float 16. During the read phase 64 float32 elements (256 bytes) are read per burst, and during the write phase 64 float16 elements can be written as well. After the data type conversion is completed in the on-chip register, the data is written into the target memory in batches with optimized step length and burst length, so that the cost caused by the data type conversion is greatly reduced.

In the optimization method of the data type conversion operator provided by the embodiment of the application, in the data type conversion operation, source data and target data generally have different physical memory layouts (physical layout). As shown by the pink blocks in fig. 5, there are significant differences in the distribution of the characteristic blocks (tracks) of the source data and the target data in the physical memory, the hardware resources occupied, and the thread mapping manner. To solve this problem, the embodiment of the present application introduces a logical layout (logical layout) abstraction, and implements efficient and general type conversion operations by establishing a physical-to-logical mapping relationship.

The implementation thinking of the optimization method of the data type conversion operator provided by the embodiment of the application is as follows, because the logic shapes (logical shapes) of the source data and the target data are always consistent no matter in the conversion process from high precision to low precision or from low precision to high precision of the data type conversion operation (typecast), different physical layouts can be mapped to a unified logic structure. By defining various feature descriptions (track) to accurately express the mapping relation between physical addresses and logical coordinates, the physical memory layout of any dimension can be converted into a one-dimensional or multi-dimensional structure in a logical space. When the data type conversion is executed, the complete physical data space can be covered only by traversing the logic structure, so that the complex physical address calculation is converted into simple logic coordinate operation.

In the implementation level, the mapping relation between the memory and the logic structure is explicitly defined through the layout information (layout) in the feature description, so that the data type conversion operation can use a feature block (track block) as a basic processing unit in combination with a stepping processing mechanism, and perform batch data processing by taking burst transmission granularity (burst granularity) as a unit. The method only needs to calculate the starting address of the current burst block, and does not need to calculate the memory address for each thread independently, so that the address calculation overhead is obviously reduced. Meanwhile, the memory access mode based on burst transmission can fully utilize a hardware caching mechanism, so that the data transmission efficiency is improved.

For different data type conversion requirements, the system can automatically select an optimal conversion strategy, wherein when a direct conversion path (such as float 32-float 16) exists, single-stage direct conversion is adopted, and when direct conversion cannot be carried out (such as float 32-float 8), intermediate transition data types (such as float 16) are automatically introduced to carry out multi-stage conversion. In each conversion stage, the system dynamically adjusts the memory access mode according to the continuity of the physical memory, thereby maximizing the hardware bandwidth utilization rate and realizing a high-performance operator.

The optimization method of the data type conversion operator provided by the embodiment of the application converts the complex memory rearrangement problem into unified logic space operation through logic abstraction shielding physical layout difference, realizes the generalization capability of the data type conversion operator based on the design of characteristic description (track) and burst transmission (burst), supports arbitrary data type combination and memory layout, and remarkably improves the calculation efficiency while ensuring the conversion accuracy by an automatic conversion strategy selection and dynamic memory access optimization mechanism. Through the physical-to-logic mapping concept, programming complexity is reduced, parallel computing capacity of a modern artificial intelligent chip architecture can be fully exerted, and a high-performance type conversion operator is realized.

It should be noted that, the optimization method of the data type conversion operator provided by the embodiment of the application is at least applied to the fields of speech processing, image processing, text processing, video processing and the like, namely, under the scenes of deep learning reasoning process, scientific calculation, graphic rendering and the like. In a speech processing scenario, the source data and the target data are audio data of different data types, in an image processing process, the source data and the target data are image data of different data types, in a text processing process, the source data and the target data are text data of different data types, and in a video data process, the metadata and the target data are video data of different data types.

In one example, in the inference phase of a deep learning model (e.g., CNN), to reduce memory consumption and speed up computation, it is often necessary to convert image data of a high precision floating point type (e.g., float 32) to a low precision type (e.g., float16/int 8). Taking the RGB image in the h×w×3 dimension as an example, the data type conversion process can be optimized when it is quantized from float32 to int8 by:

A continuous memory space is allocated for target data (int 8 image), the size is H multiplied by W multiplied by 3 multiplied by 1 byte (int 8 occupies 1 byte), a three-dimensional logic index space (H, W, c) is defined, and the three-dimensional logic index space corresponds to the height, width and channel of the image respectively.

When the hardware does not support the direct conversion of float32 to int8, a two-stage conversion scheme is adopted, namely float32 to float16 to int8. The intermediate results of the flow 16 are stored in Shared Memory (Shared Memory), which is used to reduce global Memory accesses by its low latency characteristics. And respectively establishing the mapping relation from the physical address of the input data, the intermediate result and the target data to the logic index. In the execution process of each conversion stage, traversing the logic structure, determining the read-write address corresponding to the pixel currently operated according to the traversed logic index through the mapping relation, and batch-reading a plurality of pixels in a burst transmission mode, for example, batch-reading 16 float32 pixels (64 bytes), executing type conversion operation (float 32- & gt float 16- & gt int 8), and batch-writing the converted int8 data (16 bytes) into the corresponding write address.

By adopting the optimization method of the data type conversion operator provided by the embodiment of the application, the bandwidth utilization rate and the data conversion efficiency can be effectively improved, so as to meet the requirements of real-time image classification (such as target detection).

In another example, in speech signal processing (e.g., fourier transform, feature extraction), it is often necessary to perform type conversion on a time-domain signal (flow 32 sample point), for example, converting to fixed point (int 16) to adapt to embedded hardware, so as to reduce single-channel speech signal conversion delay and satisfy real-time speech interaction (e.g., speech recognition front-end preprocessing). Or in video processing scenarios, vertex data (e.g., float32 coordinates) may be converted to a normalized integer (e.g., uint 16) in the GPU graphics rendering pipeline to be written to the frame buffer to support real-time rendering. Or in the natural language processing model of the text processing scene, the word embedding vector (float 32) is often required to be quantized (such as bf 16) to accelerate reasoning, or the attention weight (float 32) is often required to be subjected to type conversion to be stored in a smaller format (such as float 16) so as to support real-time reasoning of the trillion parameter model, so that more layers of weights are allowed to be loaded simultaneously, and the parallelism of the model is improved.

As can be seen from the above scenario, the optimization method of the data type conversion operator provided by the embodiment of the application can significantly improve the data conversion efficiency in different fields by flexibly constructing the logic structure and the mapping relation and combining the hardware characteristics to adjust the data conversion strategy, thereby reducing the consumption of computing resources and having wide applicability and technical advantages.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an optimization device of the data type conversion operator for realizing the optimization method of the data type conversion operator. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation in the embodiments of the optimizing apparatus for one or more data type conversion operators provided below may refer to the limitation of the optimizing method for a data type conversion operator hereinabove, and will not be described herein.

In an exemplary embodiment, as shown in FIG. 6, an optimization apparatus 600 of a data type conversion operator is provided, including an allocation module 602, a setup module 604, and a conversion module 606, wherein:

The allocation module 602 is configured to allocate memory resources for target data, where the target data is data obtained by performing data type conversion on source data, the source data corresponds to a first data type, and the target data corresponds to a second data type;

The establishing module 604 is configured to determine a logical structure corresponding to the source data, establish a read mapping relationship between each logical index in the logical structure and a memory address of the source data, and establish a write mapping relationship between each logical index in the logical structure and a memory address of the target data;

The conversion module 606 is configured to traverse the logical structure, determine, for any logical index traversed, a read address corresponding to the logical index based on the read mapping relationship, read data from a memory resource corresponding to the source data based on the read address, convert the read data element from the first data type to the second data type, and write the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relationship.

The optimizing device of the data type conversion operator can allocate memory resources for target data, wherein the target data is data after data type conversion of source data, the source data corresponds to a first data type, and the target data corresponds to a second data type. Determining a logic structure corresponding to source data, establishing a read mapping relation between each logic index in the logic structure and a memory address of the source data, establishing a write mapping relation between each logic index in the logic structure and a memory address of target data, traversing the logic structure, determining a read address corresponding to the logic index for any logic index traversed based on the mapping relation, reading data from a memory resource corresponding to the source data based on the read address, converting the read data element from a first data type to a second data type, and writing the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relation. By adopting the optimizing device of the data type conversion operator provided by the embodiment of the application, through establishing the mapping relation between the logic index and the source and target data memory addresses, each process does not need to perform address calculation in the data type conversion process, and directly queries the read-write memory address based on the mapping relation. The method greatly reduces the calculated amount, saves the chip calculation resources and effectively improves the calculation efficiency of the data type conversion operator.

In one embodiment, the method further comprises:

In one embodiment, the iterative step satisfies the following condition:

The respective modules in the optimization means of the above-described data type conversion operator may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one exemplary embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 7. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The Communication interface of the computer device is used for conducting wired or wireless Communication with an external terminal, and the wireless Communication can be realized through WIFI, a mobile cellular network, near field Communication (NEAR FIELD Communication) or other technologies. The computer program is executed by a processor to implement a method of optimizing a data type conversion operator. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile memory and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (RESISTIVE RANDOM ACCESS MEMORY, reRAM), magneto-resistive Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computation, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present application.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for optimizing a data type conversion operator, characterized in that the method comprises:

Allocating memory resources for target data, wherein the target data is data obtained by converting the data type of source data, the source data corresponds to a first data type, and the target data corresponds to a second data type;

determining a logical structure corresponding to the source data, mapping the memory address of the source data to the logical structure based on a preset mapping rule to obtain a read mapping relationship between each logical index in the logical structure and the memory address of the source data, and mapping the memory address of the target data to the logical structure based on a preset mapping rule to obtain a write mapping relationship between each logical index in the logical structure and the memory address of the target data, wherein the logical structure of the source data is constructed by the coordinates of each data element within the source data;

Traverse the logical structure, and for any traversed logical index, determine the read address corresponding to the logical index based on the read mapping relationship, and read data from the memory resource corresponding to the source data based on the read address, convert the read data element from the first data type to the second data type, and write the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relationship.

2. The method according to claim 1, wherein determining the logical structure corresponding to the source data, mapping the memory address of the source data to the logical structure based on a preset mapping rule to obtain a read mapping relationship between each logical index in the logical structure and the memory address of the source data, and mapping the memory address of the target data to the logical structure based on a preset mapping rule to obtain a write mapping relationship between each logical index in the logical structure and the memory address of the target data comprises:

Determining a target conversion strategy based on the first data type and the second data type, the target conversion strategy comprising at least one conversion stage, wherein a first conversion stage comprises converting the first data type into a third data type, and a last conversion stage comprises converting a fourth data type into the second data type;

Determine the logical structure corresponding to the source data, and for any of the conversion stages, determine a first mapping relationship and a second mapping relationship, respectively, wherein the first mapping relationship is a mapping relationship between the logical structure and a memory address corresponding to the first data in the conversion stage, and the second mapping relationship is a mapping relationship between the logical structure and a memory address corresponding to the second data in the conversion stage, the first data is the source data in the conversion stage, and the second data is the target data in the conversion stage.

3. The method according to claim 2, wherein determining the read address corresponding to the logical index based on the read mapping relationship, reading data from the memory resource corresponding to the source data based on the read address, converting the read data element from the first data type to the second data type, and then writing the data element after the data type conversion into the memory resource corresponding to the target data according to the write mapping relationship comprises:

The data type conversion corresponding to each conversion stage is performed in sequence. For any conversion stage, the data type conversion process includes:

A read address corresponding to the logical index is determined based on the first mapping relationship, and data is read from the memory resource corresponding to the first data based on the read address. After the read data element is converted from the data type corresponding to the first data to the data type corresponding to the second data, the data element after the data type conversion is written into the memory resource corresponding to the second data according to the second mapping relationship.

4. The method according to claim 3, further comprising:

For any of the conversion phases, an iterative step length for traversing the logic structure and a length of a burst transmission for reading data from a memory resource in the conversion phase are determined.

5. The method according to any one of claims 2 to 4, wherein determining a target conversion strategy based on the first data type and the second data type comprises:

Searching for a mapping relationship based on the first data type and the second data type;

In the case where a mapping relationship between the first data type and the second data type is found, determining that the target conversion strategy is a direct conversion strategy, the direct conversion strategy includes a conversion stage;

Alternatively, when no mapping relationship from the first data type to the second data type is found, it is determined that the target conversion strategy is a multi-stage conversion strategy, and the multi-stage conversion strategy includes at least two conversion stages.

6. The method according to claim 4, wherein determining the iteration step length of traversing the logic structure and the length of the burst transmission of reading data from the memory resource in the conversion phase comprises:

The iteration step size and the length of the burst transmission are calculated according to the data type of the first data and the data type of the second data, the physical layout of the first data and the physical layout of the second data, and the maximum length supported by the burst transmission.

7. The method according to claim 6, wherein the iteration step size satisfies the following conditions:

After the iteration step is converted into a physical byte step, it does not exceed the maximum length of the supported burst transmission, and the starting address is aligned according to the data type;

The length of the burst transmission is the minimum value of the maximum length of the supported burst transmission and the number of physical bytes of continuous data in the current iteration window.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 7 when executing the computer program.

9. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.

10. A computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.