Detailed Description
In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and the detailed description. Wherein the terms "first," "second," "third," "fourth," and the like in the description and in the above figures are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations of the two, are intended to cover a non-exclusive inclusion. The term "exemplary" means "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
With the rapid iterative development of multi-mode artificial intelligence application, the computing paradigm is structurally changed, mass data processing requirements and the scale expansion of a neural network form bidirectional driving, a computing hardware system of a diversified architecture is born, and heterogeneous computing systems fused with various computing units are widely cited in an intelligent computing scene.
The heterogeneous computing system forms complementary advantages by integrating computing units of different architectures, for example, a general-purpose CPU (Central Processing Unit ) is used for executing logic control and task scheduling tasks, a parallel computing unit such as a GPU (Graphics Processing Unit, a graphic processor)/TPU (Tensor Processing Unit ) is used for executing intensive tensor operation, a reconstruction FPGA (Field Programmable GATE ARRAY, a field programmable gate array) is used for adapting to dynamic algorithm requirements, an ASIC (Application SPECIFIC INTEGRATED Circuit) is customized according to a specific scene, an optimal energy efficiency ratio is realized, and the heterogeneous requirements of different scenes such as image recognition, semantic understanding and the like can be met through a resource dynamic allocation mechanism, and accurate balance of power consumption and performance can be realized according to a workload. With the increase of the cooperative demands of edge computing and cloud computing, distributed computing nodes can be dynamically allocated through the construction of a processor-level interconnection technology and a virtualized resource pool, and the method is widely applied to the fields of ultra-large-scale pre-training model deployment, real-time video analysis and the like. Taking a multi-heterogeneous computing system as an example, the multi-heterogeneous computing system integrates various heterogeneous processors, accelerators and memory components, and is generally composed of various computing units with significant differences in performance, power consumption, programming models, memory consistency models and the like, and the data throughput requirement of an operation core far exceeds the "memory wall effect" of the memory bandwidth supply capability. In order to avoid serious unbalance of the performance of the multi-component heterogeneous computing system, the heterogeneous computing system performs corresponding optimization at a physical layer, a logic layer and a control layer, for example, a hierarchical storage topology comprising a register file, a shared buffer and a global memory is constructed at the physical layer, an asymmetric memory access model based on a NUMA (Non Uniform Memory Access, non-uniform memory access) architecture is implemented at the logic layer, and an intelligent prefetching algorithm and a dynamic bandwidth allocation mechanism are deployed at the control layer. However, due to the huge difference between the semantics and the communication modes of the memory architecture of the heterogeneous computing system, the heterogeneous computing system has the problem of inconsistent cache data states, namely when a plurality of computing units simultaneously hold data copies through private caches, the modification of data by any computing unit can cause logic faults in the global storage space, namely the problem of inconsistent data among caches of the plurality of computing units or inconsistent cache and memory data, particularly when a concurrent execution scene, such as a GPU streaming multiprocessor, updates shared parameters, algorithm calculation result deviation can be caused even serious to cause system-level execution errors if the GPU streaming multiprocessor cannot be synchronized to the final cache of a CPU in time.
In order to ensure stable operation of the heterogeneous computing system, the related technology realizes cache consistency of the heterogeneous computing system through protocol stack innovation and hardware collaborative design. At a software protocol layer, the related technology maintains cache consistency of a heterogeneous computing system in a read intensive and low-frequency computing load scene of write operation through a bus monitoring protocol, and maintains cache consistency of the heterogeneous computing system in a scene with frequent data update through a directory protocol. In the bus snoop protocol, private caches of processing units of the heterogeneous computing system continuously snoop bus transactions, when a bus broadcast consistency request is detected, a predefined response logic is triggered according to a local cache line state, transaction atomicity and sequence assurance is provided by utilizing physical characteristics of a bus medium, so that state conversion logic can be simplified into a deterministic finite state machine, and cache state synchronization can be realized with lower difficulty through a global broadcast communication mode. However, as the scale of the multi-core processor expands, the bus load increases, and the characteristic that only one CPU occupies the bus at a time causes frequent blocking of access requests, which causes significant increase of access delay and resource waste, and in addition, the expansibility of the bus monitoring protocol is poor, so that the bus monitoring protocol is difficult to adapt to the demands of a large-scale processor system. Namely, the cache consistency of the small-scale heterogeneous computing system can be effectively maintained with lower difficulty through the bus monitoring protocol. The target protocol realizes cache consistency through a global public directory recording state information of a global cache line, wherein the global public directory comprises a consistency state and a copy owner list. In the directory protocol, all the consistency messages are forwarded through the directory structure, and efficient point-to-point message transmission can be performed by inquiring the copy owner list, so that network communication overhead is reduced. In addition, the directory protocol can effectively manage copies of shared data, and the storage position and state of the data are tracked by establishing a directory, so that a plurality of processors can conveniently access the shared data, the availability and the access efficiency of the data are improved, and the large-scale multi-core processor system can effectively coordinate cache interaction in a multi-core environment through centralized directory management, so that efficient and reliable cache consistency maintenance is realized. Both the directory storage and update operations require resources to be consumed, which can impact the overall performance of the system. In addition, the directory protocol may introduce additional communication delay, which reduces the speed of data access, which is not beneficial to large-scale heterogeneous computing systems with high real-time performance and limited resources.
For the collaborative realization of cache consistency maintenance of a software protocol layer and a hardware layer, a related technology maintains cache consistency of a multi-element heterogeneous computing system based on a interception updating strategy, a module of a translator converts a memory consistency model in a cluster into a unified consistency protocol based on a C11 memory consistency model, a module of a consistency controller is used as a consistency protocol manager to receive messages sent by each translator, unified reordering and routing scheduling are carried out on the messages, and effective compatibility of various heterogeneous memory consistency models among clusters is realized by utilizing a heterogeneous cluster loose ordering behavior. In addition, a composite consistency protocol fusion method of heterogeneous clusters is designed, a global composite consistency model is followed, meanwhile, a consistency model of an original cluster is reserved, and consistency models of different clusters are fused by using proxy cache, so that the consistency of the whole system is ensured. In addition, there is a related art litmus test scheme that devised a composite consistency model, which uses a formal detection tool to verify whether a heterogeneous protocol satisfies the composite consistency model. These three methods may reference additional overhead for consistency maintenance, which may affect system performance. Yet another related technology uses CXL (Compute Express Link, high-speed interconnect standard) to implement atomic operation support across device caches through protocol level memory semantic unification, but this approach stays at the technology planning and standard level, with relative lag in industrialization landing.
Although the related technology can alleviate the memory access delay to a certain extent, the related technology still has the problems that a fixed global directory is needed for updating the protocol, the protocol strategy cannot be dynamically optimized, the flexibility is poor, and the performance is poor under a complex load scene. In addition, protocol synthesis relies on static cluster configuration, lacks runtime adaptability, and cannot dynamically adjust protocols according to system states, so that performance under different load scenes is not ideal enough, and the method cannot effectively adapt to diversified heterogeneous architectures. In order to adapt to the dynamic characteristics of task load and realize flexible switching of consistency maintenance protocols, the protocols are usually selected through simple read-write proportion threshold values, and the method cannot accurately capture complex cross-device data dependency relationships and is difficult to realize effective protocol optimization.
In view of this, the invention abstracts the cache consistency protocol into three modes of metadata index protocol, multicast synchronization protocol and mixed response protocol, which correspond to the high writing scene, the high reading scene and the dynamic balance scene respectively, determines the entropy weight of different feature dimensions according to the feature values of the data blocks of the heterogeneous computing system in the current scene, and senses whether the current scene is the high reading load scene or the high writing load scene according to the degree of distinction selected by the entropy weight for the protocol, so that the consistency protocol which is most matched with the current task load running state of the heterogeneous computing system can be accurately adapted. The specific application environment architecture or specific hardware architecture upon which execution of the cache coherency maintenance method depends is described herein. Some possible application scenarios related to the technical solution of the present invention are described below by way of example with reference to fig. 1, and may include the following:
In this embodiment, the heterogeneous computing system includes a plurality of servers 1, each server 1 being a computing node of the heterogeneous computing system, each server including at least 2 computing units 10, and at least a computing unit of a different type from the CPU architecture, such as GPU, FPGA, ASIC, being built in addition to the computing unit itself, such as a CPU. The communication structure of the heterogeneous computing system is that all computing units of all computing nodes are divided according to the types of the computing units, the computing units of the same computing unit type are divided into a cluster, all the computing units of the same cluster can be connected through a tree topology structure, one computing unit is selected as a main computing unit, different clusters interact through the main computing unit, and all the clusters are connected through the main computing unit through a ring topology structure.
In order to solve the problems of difficult maintenance of the consistency of the heterogeneous computing system, high complexity of the protocol and the like in the related art and realize the consistency of the dynamic cache of the heterogeneous computing system and ensure the continuous and efficient maintenance process, the invention provides a dynamic consistency maintenance method based on the mixture of bus monitoring and directory protocol. The cache consistency controller of the computer program codes for realizing the following steps can be built in any one of the servers or cloud servers, a data acquisition component is built in each server, acquired data such as read-write operation times and read-write time are redefined into a metadata index protocol, a multicast synchronous protocol and a mixed response protocol through the cache consistency controller, wherein the metadata index protocol is a protocol that each computing unit comprises directory information for storing local metadata, a target data block meeting the condition that the writing frequency is greater than a preset frequency threshold value has central directory information, the multicast synchronous protocol is a protocol that each computing unit executes corresponding operation according to monitoring information of a multicast channel, and the multicast range is dynamically adjusted according to the cache hit rate, and the mixed response protocol is a fusion protocol of the directory protocol and a broadcast protocol. According to the entropy weight of different characteristic dimensions determined by the characteristic values of each data block of the heterogeneous computing system, a protocol which is most matched with the running state of the task load at the current moment is determined according to the entropy weight, so that the effective switching of the consistency maintenance protocol is realized based on the dynamic characteristics of the task load, and the heterogeneous computing system is ensured to maintain high performance.
It should be noted that the above application scenario is only shown for the convenience of understanding the idea and principle of the present invention, and the embodiment of the present invention is not limited in any way. Rather, embodiments of the invention may be applied to any scenario where applicable. Having described aspects of the invention, various non-limiting embodiments of the invention are described in detail below with reference to the drawings and detailed description.
The essential difference between the communication mode and the consistency model of the computing units of different architectures of the heterogeneous computing system is that the traditional CPU adopts a failure mechanism triggered based on read-write operation, such as MESI (modified exclusive SHARED INVALID, classical consistency maintenance protocol), based on the requirement of the traditional CPU on strict memory sequence and low-delay access, and the protocol ensures data consistency by strictly executing single-write Multiple-read (SINGLE WRITER Multiple Reader, SWMR) invariance. Massive parallel architectures such as GPUs employ a more relaxed consistency model, whose self-invalidation protocol, not SWMR, allows for asynchronous maintenance of cache states among computing cores, which can lead to a semantic gap between their communication mechanisms with the CPU by relaxing consistency constraints to increase data throughput efficiency. The protocol specification mutual incompatibility among the units is not calculated, the complexity of the cross-architecture cache state synchronization is increased, and the cache consistency protocol maintenance difficulty of the heterogeneous computing system is increased between the strict consistency guarantee and the parallel computing efficiency.
In addition, the problems of data conflict and deadlock of the heterogeneous computing system are more complicated due to multi-level isomerism, communication mechanism difference and asymmetry of concurrency control models of the heterogeneous computing system, wherein a strong consistency memory model is adopted by a CPU (Central processing Unit) and depends on a fine cache consistency protocol to maintain data visibility, a weak consistency model is adopted by accelerators such as a GPU (graphics processing Unit) and the like, and data coordination is realized through batch synchronization or explicit barriers. The split nature of the memory model makes it difficult to build globally uniform data synchronization logic when accessed across devices, e.g., the atomic operation of the CPU on shared variables may not be correctly perceived by the GPU, forming implicit data race. Second, the heterogeneity of concurrency control mechanisms exacerbates the risk of collisions. The CPU relies on fine-grained synchronization primitives such as locks, semaphores, etc., while the GPU employs thread group synchronization (e.g., line constraint synchronization) or global fencing under SIMT (Single Instruction Multiple Threads, single instruction multithreading) architecture, the interaction of which may create asymmetric wait relationships. For example, CPU thread holds are sometimes preempted by GPU-kernel, and GPU-thread groups are involved in deadlocks by waiting for CPU release resources. Furthermore, heterogeneous computing systems often employ non-uniform memory architectures, and the delay differences across hierarchical data transmissions render conventional deadlock detection algorithms ineffective, and resource contention of the communication links between devices may trigger distributed deadlocks. In addition, the fragmentation of the debug tool causes difficult problem tracing, the debug interface of the CPU-GPU and the performance counter are not compatible, and the abnormal state of the cross-device dependency chain is difficult to capture. In summary, the heterogeneous computing system presents a complex feature of cross-protocol, cross-hierarchy and cross-space, and the conventional single-architecture solution cannot be directly migrated, and this embodiment provides a method for implementing dynamic cache consistency of the heterogeneous computing system and guaranteeing continuous and efficient maintenance process, please refer to fig. 2, fig. 2 is a flow diagram of a cache consistency maintenance method provided in this embodiment, which may include the following contents:
And S201, redefining the cache consistency protocol of the heterogeneous computing system into a metadata index protocol, a multicast synchronization protocol and a mixed response protocol.
The heterogeneous computing system may be a multi-heterogeneous computing system, a non-multi-heterogeneous computing system, or any type of heterogeneous computing system, as long as it meets the definition of heterogeneous computing system, which is not limited in any way. The heterogeneous computing system and the homogeneous computing system have cache consistency difference, the homogeneous computing system is a system formed by computing units using the same type of instruction set and architecture, taking the homogeneous computing system represented by a multi-core CPU as an example, as shown in fig. 3, the homogeneous computing system relies on a shared memory model to realize unified memory address sharing, and maintains cache data consistency among different processor cores by using cache consistency protocols such as MESI, MOESI (Modified Exclusive SHARED INVALID own, cache consistency protocol) and the like. the isomorphic computing system ensures the consistency of cache data by utilizing a bus interception consistency protocol supported by hardware, wherein a cache controller of each kernel monitors all memory access requests on a bus, when one kernel (namely, a core) is present, for example, the core 1 modifies data in a cache, the modification requests are sent to the bus, and after detecting the requests, the cache controllers of other kernels update or invalidate self cache copies as required. The cache consistency maintenance method is simple to implement and low in hardware cost, but as the number of cores is increased, bus bandwidth becomes a system performance bottleneck and is not suitable for a large-scale multi-core computing system. In the process of processing a read request of a kernel by the isomorphic computing system, the maintenance flow of cache consistency comprises a local cache checking stage, wherein the local cache checking stage directly returns data when a valid copy exists in a local cache, and sends the read request to a bus when the local cache is not hit. And in the bus broadcasting domain monitoring stage, the bus broadcasts a read request to all cores, and the cache controllers of other cores check the respective local cache states. And in the data acquisition stage, the main memory or the cache with the latest data returns the data through the bus, the kernel which initiates the request originally receives the data, marks the state of the cache line and completes all operations of the read request. For heterogeneous computing systems, as shown in fig. 4, computing nodes of the heterogeneous computing system, such as computing node a and computing node B, wherein a local coherence directory controller (local coherence controller, LCC) and a protocol converter are deployed inside each computing unit, a system level relies on a global coherence directory controller (global coherence controller, GCC) to implement cross-device coordination. And the local request processing stage is that when the LCC receives the consistency request, the request analysis and response are completed in the affiliated computing node, and when the local resource can not meet the request condition, such as cache line state conflict or data missing, the request forwarding mechanism is triggered. The global coordination phase-incomplete local requests are to be routed to the GCC, which distributes the requests to the LCCs of the target compute node in accordance with the topology protocol, following the cross-device routing rules. And in the cross-equipment response stage, after the target computing node finishes the request processing, a consistency response message is generated and returned to the GCC. The GCC forwards the response to the original requester LCC, completing the closed loop transaction, depending on the transaction identifier. Therefore, cache consistency of the heterogeneous computing system needs to be realized by the function of a GCC interface and a protocol converter, the GCC interface needs to support instruction set mapping of a heterogeneous consistency protocol, and multi-protocol compatibility is realized through a standardized transaction interface. The protocol converter needs to perform conversion functions on the states of the local and global protocols, such as converting the local protocol primitives into a global transaction format, performing semantic analysis and local state update on the global response message, maintaining a protocol conversion state machine, ensuring the atomicity of the conversion process, and the like. Therefore, the homogeneous computing system is large in communication overhead of the bus-based consistency maintenance protocol, suitable for small-scale clusters, strong in expansibility of the directory protocol of the heterogeneous computing system, small in communication overhead, suitable for large-scale clusters, and good and bad in two cache consistency protocols. Therefore, the method combines the advantages of the bus monitoring protocol and the directory, and abstracts the cache consistency protocol according to different read-write ratios, namely, task loads and characteristics of heterogeneous computing systems, so as to obtain three protocol modes respectively applicable to different task load scenes, and as shown in table 1, the protocol modes are flexibly switched and cooperatively matched to jointly complete the cache consistency maintenance process of the heterogeneous computing systems.
The metadata index protocol is implemented in a directory mode, and is based on a directory protocol, and has a lightweight directory tree, a hierarchical directory structure (non-global directory) is adopted, each computing unit maintains own metadata, such as a cache state and a version number, that is, each computing unit stores directory information of local metadata through own directory, wherein the local metadata refers to own metadata of the computing unit, the directory information can exist in a directory table form, and the directory table records the state of each cache line of a local private cache of the computing unit, whether read-write data hit the cache, the version number and the like. The central directory information can be deployed in one of the computing units, or the computing unit with the best performance, or the computing unit designated by the user, such as the computing unit where the data block with high writing frequency is located, that is, the target data block meeting the condition that the writing frequency is greater than the preset frequency threshold has central directory information, the central directory information at least records the corresponding relation between different directory information and the corresponding computing unit, the storage address of the data, that is, records the information capable of reflecting the global caching condition, and the directory information and the central directory information of each computing unit form a lightweight directory tree of the metadata index protocol. The metadata index protocol queries the directory only when necessary through the lightweight directory tree, avoids broadcast storm, reduces global lock competition through the hierarchical directory, reduces write operation delay by 15% -20%, and has more stable network load without broadcast traffic compared with the traditional broadcast protocol. The multicast synchronization protocol is suitable for a high-reading load scene, and the local cache directly responds to the reading request by maximizing the cache hit rate, so that the metadata query times are reduced, the metadata verification delay is reduced, and the reading delay is effectively reduced. The multicast synchronization protocol is a protocol in which each computing unit executes corresponding operation according to the monitoring information of the multicast channel, and the multicast range is dynamically adjusted according to the cache hit rate. That is, the multicast synchronization protocol is multicast by probability, which is to dynamically adjust the multicast range depending on the cache hit rate of the computing unit. Secondly, the method adopts monitoring type buffer verification, namely a computing unit in a multicast range monitors a multicast channel, and executes corresponding operation after receiving update notification. Compared with the traditional directory protocol, the protocol has the advantages that the directory inquiry is not needed in the read operation of the multicast synchronous protocol, the delay is lower, and the read delay is reduced by 25% -30%. Compared with the traditional full-broadcasting protocol, the probability multicast reduces redundant traffic, reduces network load by 10% -15%, optimizes bandwidth and improves bandwidth utilization rate by 20%. The hybrid response protocol is a fusion protocol of a directory protocol and a broadcast protocol, dynamically balances read-write hybrid load, adapts to rapid load change, dynamically fuses directory and broadcast mechanisms compared with the traditional directory protocol, avoids single protocol limitation, supports on-demand switching compared with the traditional broadcast protocol, and has stronger compatibility.
Table 1 comparison table of three protocols and conventional protocol
S202, determining a protocol which is most matched with the task load running state at the current moment from a metadata index protocol, a multicast synchronization protocol and a mixed response protocol according to entropy weight values of different feature dimensions determined by feature values of each data block of the heterogeneous computing system.
The characteristic values at least comprise related values of read-write operation characteristics and related values of data heat characteristics, the characteristic dimensions at least comprise the read-write operation characteristics and the data heat characteristics, and the adaptability to complex loads of heterogeneous computing systems is enhanced by avoiding single characteristic index deviation through multidimensional characteristic weighting. According to any objective weighting method based on information entropy theory, a person skilled in the art can measure the difference degree of the read-write operation characteristic and the data heat characteristic in a metadata index protocol, a multicast synchronization protocol and a mixed response protocol, different weights are distributed for different characteristics, the lower the entropy value is, the higher the distinguishing degree of the characteristic dimension to protocol selection is, namely the magnitude of the entropy weight of different characteristics can distinguish different task load states. Because the three protocols are divided based on different task load states, after the entropy weights corresponding to different feature dimensions are determined, the optimal consistency protocol is the protocol corresponding to the corresponding task load state. If the cache consistency protocol adopted by the current heterogeneous computing system is the same as the selected optimal cache consistency protocol, protocol adjustment is not needed, and if the cache consistency protocol adopted by the current heterogeneous computing system is different from the selected optimal cache consistency protocol, the method is switched to the selected optimal cache consistency protocol in the step, so that an effective dynamic adjustment cache consistency protocol mode is realized.
In the technical scheme provided by the embodiment, the cache consistency protocol is abstracted into three modes of a metadata index protocol, a multicast synchronization protocol and a mixed response protocol, which correspond to a high-writing scene, a high-reading scene and a dynamic balance scene respectively, entropy weight values of different feature dimensions are determined according to feature values of data blocks of a heterogeneous computing system under a current scene, uncertainty of load features is quantized through the entropy weight values, the lower the entropy value is, the higher the distinction degree of the feature dimensions on protocol selection is, the current scene can be perceived as the high-reading load scene or the high-writing load scene according to the entropy weight values of the read-write operation features, hot spot distribution features and time sensitivity of the data blocks can be dynamically evaluated through the entropy weight values of the data heat features, multidimensional load perception of the heterogeneous computing system is achieved, consistency protocols which are best matched with the current task load running state of the heterogeneous computing system can be accurately adapted, dynamic cache consistency of the heterogeneous computing system is achieved, the maintenance process is ensured to be continuous and high-efficiency, and efficient operation of the heterogeneous computing system is ensured.
In the above embodiment, the method for determining the entropy weight of different feature dimensions is not limited, and based on the above embodiment, the method further provides a process for calculating the entropy weight of different feature dimensions according to the feature values of each data block of the heterogeneous computing system, and the method can include the steps of determining the feature probability distribution information of the current data block in the current feature according to the ratio of the original feature value of the current feature of each data block to the sum of the original feature values of the current feature of all data blocks of the heterogeneous computing system, determining the information entropy value of the current feature according to the feature probability distribution information of the current feature of each data block of the heterogeneous computing system, and determining the entropy weight of each feature according to the proportional relation between the information entropy value of the current feature and the information entropy value of all features.
The characteristics may be, for example, a read operation characteristic, a write operation characteristic, an access frequency characteristic, and a local attenuation factor characteristic, the current characteristic may be any characteristic that needs to calculate an entropy weight, the current data block refers to one data block in all data blocks of the heterogeneous computing system, the original characteristic value is a specific value of the current characteristic at the current time of the data block, if the current characteristic is a read operation characteristic, the number of times of read operations of the data block a at the current time is 8, and then the original characteristic value of the read operation characteristic of the data block is 8. The characteristic probability distribution information is used for measuring the normalized duty ratio of the specified data block in the specified dimension, and the duty ratio sum of all the data blocks of the heterogeneous computing system in the same characteristic dimension can be ensured to be 1 through the characteristic probability distribution information. For example, the current data blockFeature probability distribution information at current feature jCan be expressed as:
;
wherein, the For data blocksThe original eigenvalue on the current eigenvalue j, n, represents the total number of data blocks.
An exemplary calculation mode of the information entropy value of each feature is that the product of the feature probability distribution information of the current data block in the current feature and the logarithmic function value of the current feature is calculated, and the negative value of the sum of the product values corresponding to each data block of the heterogeneous computing system is used as the information entropy value of the current feature. As an efficient calculation method, an information entropy value calculation relation may be stored in advance, and the information entropy values of the features are sequentially calculated by calling the information entropy value calculation relation, where the information entropy value calculation relation may be expressed as:
;
wherein, the Is the information entropy value of the feature dimension j.
An exemplary calculation mode of the entropy weight of the current feature is that the difference value between the information entropy value and the target value of each feature is calculated respectively, and the ratio of the sum of the difference value corresponding to the current feature and the difference value corresponding to other features is used as the entropy weight of the current feature. As an efficient calculation mode, an entropy weight calculation relation may be stored in advance, and the entropy weights of the features are sequentially calculated by calling the entropy weight calculation relation, where the entropy weight calculation relation may be expressed as:
;
wherein, the The information entropy value of the feature dimension j is that M is the total number of features, and M is one of the M features.
As can be seen from the foregoing, in this embodiment, the information entropy values of the features are calculated first, the feature probability distribution information ensures that the sum of the ratios of all the data blocks of the heterogeneous computing system in the same feature dimension is 1, and finally the entropy weight of the features is determined based on the information entropy value proportion of different features, and the contribution degree of each feature to the protocol selection can be reflected by the normalized weight. The uncertainty of the features can be quantified more accurately, and the distinction degree of the feature dimension on the protocol selection is improved.
The above embodiment does not limit how to select a protocol according to the entropy weight, and based on the above embodiment, the present invention further defines the corresponding selection manners between the entropy weights of different features and three protocols from various angles, which may include the following:
If the entropy weight of the read operation feature at the current moment is larger than that of other feature dimensions, a multicast synchronization protocol is selected as a protocol which is most matched with the task load running state of the heterogeneous computing system at the current moment, if the entropy weight of the write operation feature at the current moment is larger than that of the other feature dimensions, a metadata index protocol is selected as a protocol which is most matched with the task load running state of the heterogeneous computing system at the current moment, and if the entropy weight of the data heat feature at the current moment is larger than that of the other feature dimensions, a protocol which is most matched with the task load running state at the current moment is determined from the metadata index protocol, the multicast synchronization protocol and the mixed response protocol according to the change trend of the data heat.
In this embodiment, the feature dimensions include a read operation feature, a write operation feature, and a data heat feature, the read operation feature may be represented using a number of read operations reflecting a read strength of the data block, and the write operation feature may be represented by a number of write operations reflecting a write strength of the data block. After the entropy weights of the read operation feature, the write operation feature and the data heat feature are determined according to the method or the method recorded by the related technology, the entropy weights of the features are compared, if the entropy weights of the read operation feature are large, the read operation of the current heterogeneous computing system is large, and in order to maximize the system performance, a multicast synchronization protocol aiming at a high read load scene can be selected as a protocol which is most in line with the current task load state of the heterogeneous computing system. If the entropy weight of the write operation feature is large, the write operation of the current heterogeneous computing system is large, and in order to maximize the system performance, a metadata index protocol aiming at a write-read load scene can be selected as a protocol which best accords with the current task load state of the heterogeneous computing system. For the data heat characteristic, the high-frequency data heat triggers a more aggressive caching strategy, and if the data heat is fast attenuated, protocol switching is suppressed, so that over-optimization of cold data is avoided, and the data heat characteristic can be selected according to the data heat trend.
By way of example, the present embodiment may employ access frequency characteristics, which may be expressed in terms of statistical data access frequency values (times/second), and local decay factor characteristics, which may measure the heat of the data, which may trigger a more aggressive caching strategy, which quantifies the stability of the access pattern by a time interval variation coefficient, which may accurately account for the time sensitivity of data access by a ratio of standard deviation to mean, the greater the value, which may be indicative of the faster the data heat decay. The error judgment of cold data can be reduced through the local attenuation factor, the protocol switching overhead of sporadic access is avoided, the dynamic adaptability can be enhanced, and the robustness of a cache consistency mechanism is improved under a load fluctuation scene. If the entropy weight of the local attenuation factor feature is larger than the entropy weight of the access frequency feature, the local attenuation factor feature can be measured through the access time distribution feature of the data block, namely, the protocol which is most matched with the task load running state at the current moment can be determined according to the access time distribution feature. When the data heat stability is determined according to the access time distribution characteristics, a multicast synchronous protocol can be selected, when the access fluctuation is determined to be severe according to the access time distribution characteristics, protocol switching is restrained, and when the access fluctuation is determined to exist according to the access time distribution characteristics, but medium heat is generally maintained, and a hybrid response protocol can be adopted.
As can be seen from the above, the present embodiment integrates the read-write proportion, the access frequency and the access time distribution through the entropy weight, so that the dynamic characteristics of the task load are more accurate, the precision of protocol selection is improved, and the flexible switching of the consistency maintenance protocol is realized.
The data popularity feature may be represented by a local decay factor feature, which may be represented by an access time distribution feature, and a plurality of access time interval values may be determined according to the continuous multiple access times of the current data block、,If each access time interval value meets the preset burst access limit condition, if there is a larger value deviation between one access interval and other access intervals, for example,,If the access time interval value meets the preset medium fluctuation access condition, the difference value fluctuation between the access interval values is not large, such as,,And selecting the mixed response protocol as the protocol which is the best match between the current moment and the task load running state of the heterogeneous computing system.
Furthermore, in order to accurately measure the local attenuation factor characteristics, the embodiment also provides a quantized representation of the local attenuation factor characteristics, obtains at least four continuous access times of each data block, calculates access time interval values between every two adjacent access times, and takes the ratio of the mean value and the standard deviation of each access time interval value as the local attenuation factor characteristics. As an efficient implementation, the local attenuation factor characteristic of the ith data block may be calculated by storing a local attenuation factor characteristic calculation relation in advance, and by calling the local attenuation factor characteristic calculation relation, the local attenuation factor characteristic calculation relation may be expressed as:
wherein, the ,。
Wherein, the For local decay factor features, time interval variablesRepresenting data blocksThe time interval between the last three consecutive accesses of (a), i.eIndicating the time difference between the 1 st and 2 nd accesses,Indicating the time difference between the 2 nd and 3 rd accesses,Representing the time difference between the 3 rd and 4 th accesses. The physical meaning of the formula is to reflect the access time distribution characteristics of the data block. For example, whenThe short interval is called as short interval, which means that the data is accessed frequently and has high heat, whenReferred to as long intervals, indicates that the data may enter a cold state.、AndThree time intervals are indicated and the time interval is indicated,Representing three time intervals、AndIs measured by measuring the standard deviation of the fluctuation,Representing three time intervals、AndWhen the standard deviation is larger, the time interval difference is larger, which indicates that the access mode is unstable, otherwise, the low standard deviation indicates that the time interval is even and the access mode is regular.
After the local attenuation factor feature is quantized in the above embodiment, a corresponding protocol may be further selected through simple numerical comparison, where if the value of the local attenuation factor feature is approximately equal to 0 and approximately equal to a value equal to about 0, that is, a value near a0±small value, the small value is a number not exceeding 0.1, a multicast synchronization protocol is selected as a protocol that is most matched with the task load running state of the heterogeneous computing system at the current time, if the value of the local attenuation factor feature is greater than 1, a protocol switching suppression prompt message is generated, and if the value of the local attenuation factor feature is approximately equal to 1 and equal to about equal to, that is, a value near a1±small value, the small value is a number not exceeding 0.1, a mixed response protocol is selected as a protocol that is most matched with the task load running state of the heterogeneous computing system at the current time.
In the present embodiment, whenIndicating that the intervals are highly uniform, such as periodic access, for example, time intervals:、,,, Then The data heat is stable and the data heat is stable. When (when)Indicating that the access interval fluctuates drastically, such as a long idle time after burst access, time interval:,,, the data heat is fast attenuated, protocol switching should be suppressed, and cold data is avoided being excessively optimized. Indicating that there is some fluctuation in the access interval, but overall moderate heat is maintained, the heat decay is slow, for example, the time interval:,,,。
As can be seen from the above, in this embodiment, by quantifying the volatility of the access time by using the local attenuation factor, the short-term and long-term cold and hot data are distinguished, so as to realize dynamic heat sensing of the data, and for the case of rapid decay of the data heat, the current protocol tends to be kept, the cold data switching overhead is suppressed, the excessive correspondence to even access is avoided, the efficient protocol is preferentially allocated to stable hot spot data, the cache hit rate is improved, and the resource utilization rate is improved.
Compared with the protocol selection method provided by the embodiment, in order to cope with abrupt load change and long-term fluctuation of the heterogeneous computing system and reduce performance jitter of the heterogeneous computing system, the invention also provides another protocol selection method, which can comprise the following steps:
Based on historical threshold information and attenuation adjustment information, calculating thresholds of different moments in a time window of the current moment, selecting a maximum value and a minimum value from the thresholds, updating the existing maximum threshold and minimum threshold corresponding to the current moment based on the maximum value and the minimum value to obtain a maximum new threshold and a minimum new threshold, determining a protocol matching degree value according to entropy weights of different features and corresponding adjustment factors, and selecting a protocol which is most matched with a task load running state of the current moment from a metadata index protocol, a multicast synchronization protocol and a mixed response protocol according to the numerical relation between the protocol matching degree value and the maximum new threshold and the minimum new threshold.
In this embodiment, the historical threshold information is the value data of the maximum threshold and the minimum threshold in a specified period of time before the current moment, and the historical threshold information may be an average value of the sum of all the thresholds in the period of time, the attenuation adjustment information may be a fixed attenuation constant value, a value flexibly adjusted according to practical situations, or a value linearly changed with time, which does not affect the implementation of the present invention, and in addition, the maximum value and the minimum value may be further adjusted by the rate of decrease of the delay threshold, and the decrease value is decreased from 5 to 4.8 to obtain the maximum threshold and the minimum threshold. The adjusting factors can set corresponding adjusting coefficients for different features according to experience according to actual conditions, and the protocol matching degree value is a value generated by integrating entropy weights of different features and corresponding adjusting coefficients according to operation rules.
The present embodiment also provides an adaptive adjustment of the maximum new threshold using a sliding windowAnd a minimum new thresholdT represents the time value in the current time window T, according toCalculating thresholds at different times within the time window,Represents a historical average threshold value that,Represents the decay constant (default to 0.1) and selects the maximum and minimum values as the maximum new thresholdAnd a minimum new threshold。
An exemplary calculation method of the protocol matching degree value may include determining read specific gravity data according to an entropy weight of a read operation feature, a read adjustment factor, a read operation number and a write operation number, determining write specific gravity data according to an entropy weight of a write operation feature, a write adjustment factor, a read operation number and a write operation number, determining access distribution data according to an entropy weight of a local attenuation factor feature, a numerical value of a local attenuation factor feature and a heat adjustment factor, and determining the protocol matching degree value according to the read specific gravity data, the write specific gravity data and the access distribution data. As an efficient calculation manner, with feature dimensions including a read operation feature, a write operation feature, and a local attenuation factor feature, a protocol matching degree value calculation relational expression may be stored in advance, and the protocol matching degree value of the i-th data block Bi may be calculated by calling the protocol matching degree value calculation relational expression, where the protocol matching degree value calculation relational expression may be expressed as:
;
wherein, the A protocol matching degree value representing the i-th data block Bi,Indicating the number of read operations of the i-th data block Bi,Indicating the number of write operations of the i-th data block Bi,A local decay factor characteristic value representing the i-th data block Bi,For the respective entropy weights of the read operation feature, the write operation feature and the local decay factor feature,Default values may be 1.2,0.8 and 0.5 for each adjustment factor. And the adjustment factors or the entropy weight are used as punishment items to restrain protocol switching of access time dispersion data, so that erroneous judgment of cold data is reduced.
The protocol matching degree value and the threshold value are compared to determine the implementation mode of the protocol, wherein the protocol which is most matched with the task load running state at the current moment is a multicast synchronous protocol if the protocol matching degree value is larger than the maximum new threshold value, the protocol which is most matched with the task load running state at the current moment is a metadata index protocol if the protocol matching degree value is smaller than the minimum new threshold value, and the protocol which is most matched with the task load running state at the current moment is a mixed response protocol if the protocol matching degree value is larger than or equal to the minimum new threshold value and smaller than or equal to the maximum new threshold value.
For example, whenAnd a multicast synchronization protocol suitable for a high-reading low-frequency scene is started, and the metadata query cost is reduced through multicast updating. When (when)And starting a metadata index protocol suitable for the high-writing high-locality scene, and reducing the risk of broadcast storm through a central directory. And if the other values are of the same size, using a mixed response protocol to dynamically integrate two mechanisms, namely, using a metadata index protocol to maintain the directory state in the writing operation, and selecting local cache or metadata verification according to the probability by the reading operation according to the real-time load.
As can be seen from the above, in this embodiment, the protocol may be dynamically selected by the adaptive threshold and the gradient smoothing manner, the threshold may be dynamically adjusted to avoid overfitting caused by the static threshold, so that under sudden load, such as transient writing of dense tasks, the protocol may be quickly converged to the metadata index protocol, and the directory query delay may be reduced, and under stable reading scenario, the multicast synchronization protocol may be maintained for a long period to reduce metadata overhead, so that on the basis of dynamic characteristics of the task load, flexible switching of the consistency maintenance protocol may be realized, and load mutation and long-term fluctuation may be dealt with, and performance jitter may be reduced.
Based on the above embodiment, in order to further reduce the write latency, the implementation manner of dynamic directory compression is further provided, which may include the following contents, if the cache consistency protocol of the heterogeneous computing system is switched to the metadata index protocol, counting the access frequency of each data block of the heterogeneous computing system at the current moment, and compressing and storing directory entries of the target data blocks below a preset access frequency threshold.
According to the metadata index protocol of the embodiment, the catalogs are only queried when necessary through a lightweight catalogue tree, broadcast storm is avoided, global lock competition is reduced through layered catalogues, and in order to further optimize the protocol, dynamic catalogue compression can be performed, namely, catalogue entries of cold data with low access frequency are compressed and stored, a preset access frequency threshold value can be preset, and the catalogue entries with low access frequency are lower than the preset access frequency threshold value, so that memory occupation is reduced, metadata overhead is reduced by 30%, write operation delay is reduced by 15% -20% compared with the traditional catalogue protocol catalogue layered compression, write operation delay in a high write load scene is effectively reduced, and system performance is improved.
Based on the above embodiment, the present embodiment further provides a request processing based on a multicast synchronization protocol, which may include the following contents that if the cache coherence protocol of the heterogeneous computing system is switched to the multicast synchronization protocol, when the target computing unit receives a read operation request through the listening multicast channel, the target computing unit reads corresponding target data from the private local cache of the target computing unit, after the target data is read, if the cache state of the private local cache is a non-dirty state, the private local cache is marked as expired, and an asynchronous refresh operation is performed, and if the cache state of the private local cache is a dirty state, a conflict detection task is triggered to be performed.
The multicast synchronization protocol is suitable for a high-reading load scene, and the local cache directly responds to the reading request by maximizing the cache hit rate, so that the metadata query times are reduced, the metadata verification delay is reduced, and the reading delay is effectively reduced. The multicast synchronization protocol is a protocol in which each computing unit executes corresponding operation according to the monitoring information of the multicast channel, and the multicast range is dynamically adjusted according to the cache hit rate. That is, the multicast synchronization protocol is multicast by probability, which is to dynamically adjust the multicast range depending on the cache hit rate of the computing unit. Secondly, the method adopts monitoring type buffer verification, namely a computing unit in a multicast range monitors a multicast channel, and executes corresponding operation after receiving update notification. For example, if the state of the local cache is in the "non-dirty" state, i.e. the data in the cache line of the local cache is not modified, the data is marked as expired and asynchronously refreshed, and if the local cache is in the "dirty" state, i.e. the data in the cache line of the local cache has been modified but has not been written back to the main memory, a collision detection mechanism, such as a version number comparison, may be triggered to preferentially preserve the latest version. Compared with the traditional directory protocol, the protocol has the advantages that the directory inquiry is not needed in the read operation of the multicast synchronous protocol, the delay is lower, and the read delay is reduced by 25% -30%. Compared with the traditional full-broadcasting protocol, the probability multicast reduces redundant traffic, reduces network load by 10% -15%, optimizes bandwidth and improves bandwidth utilization rate by 20%.
Based on the above embodiment, the present embodiment further provides request processing based on a hybrid response protocol, which may include the following:
If the cache consistency protocol of the heterogeneous computing system is switched to the mixed response protocol, the consistency of the private local caches of all computing units of the heterogeneous computing system is maintained according to the metadata index protocol in the process of processing the write operation request when the write operation request is received, if the task load of the heterogeneous computing system is greater than a preset load threshold value in the process of processing the read operation request when the read operation request is received, the consistency of the private local caches of all computing units of the heterogeneous computing system is maintained according to the metadata index protocol in the process of processing the read operation request, and if the task load of the heterogeneous computing system is less than or equal to the preset load threshold value, the consistency of the private local caches of all computing units of the heterogeneous computing system is maintained according to the multicast synchronization protocol in the process of processing the read operation request.
In this embodiment, the hybrid response protocol is a fusion protocol of a directory protocol and a broadcast protocol, and the protocol decouples a read-write policy, namely, a write operation adopts a directory lock mechanism of a metadata index protocol, so as to ensure consistency, and dynamically selects a policy according to a real-time load for a read operation, if the current load is low, a local cache of a multicast synchronization protocol is started to directly respond, if the load is high, metadata verification is triggered, multicast congestion is avoided, wherein the load is high or low and can be measured by a preset threshold value, a protocol state machine of a data block can be maintained through a state migration engine, the protocol state machine records the cache states of all the data blocks, such as a modified state, an exclusive state, a dirty state, an invalid state and a shared state, and ensures validity of data through the cache states of the data blocks, so as to ensure consistency of the data, and make copies of the same data cached in different local private caches of a heterogeneous computing system consistent. The protocol dynamically balances the read-write mixed load, adapts to the rapid change of the load, and the state migration engine avoids the cache failure during the protocol switching process, and avoids the performance oscillation by realizing smooth transition during the protocol switching process. Compared with the traditional directory protocol, the directory and broadcasting mechanism are dynamically fused, the single protocol limitation is avoided, and compared with the traditional broadcasting protocol, the on-demand switching is supported, and the compatibility is stronger.
It should be noted that, in the present invention, the steps are not strictly executed sequentially, so long as they conform to the logic sequence, the steps may be executed simultaneously, or may be executed according to a certain preset sequence, and fig. 2 is only a schematic manner, and is not meant to represent only such an execution sequence.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. The invention also provides a corresponding device for the cache consistency maintenance method, so that the method has higher practicability. Wherein the device may be described separately from the functional module and the hardware. The following description will introduce functions of each program module of the present embodiment, where the cache coherence maintaining device and the cache coherence maintaining method described above may be referred to correspondingly.
Based on the angles of the functional modules, referring to fig. 5, fig. 5 is a block diagram of a cache consistency maintenance device provided in this embodiment under a specific implementation manner, where the device may include:
The protocol abstraction module 501 is configured to redefine a cache coherence protocol of a heterogeneous computing system into a metadata index protocol, a multicast synchronization protocol and a hybrid response protocol, where the metadata index protocol is a protocol that each computing unit includes directory information for storing local metadata, a target data block meeting a condition that a writing frequency is greater than a preset frequency threshold has central directory information, the multicast synchronization protocol is a protocol that each computing unit executes a corresponding operation according to monitoring information of a multicast channel, and a multicast range is dynamically adjusted according to a cache hit rate, and the hybrid response protocol is a fusion protocol of the directory protocol and a broadcast protocol.
The protocol dynamic selection module 502 is configured to determine, according to entropy weights of different feature dimensions determined by feature values of data blocks of the heterogeneous computing system, a protocol that is most matched with a task load running state at a current moment from a metadata index protocol, a multicast synchronization protocol and a mixed response protocol, where the feature dimensions at least include a read-write operation feature and a data heat feature.
In some implementations of the present embodiment, the protocol dynamic selection module 502 may be further configured to select the multicast synchronization protocol as a protocol that is most matched with the task load running state of the heterogeneous computing system at the current time if the entropy weight of the read operation feature at the current time is greater than the entropy weights of the other feature dimensions, select the metadata index protocol as a protocol that is most matched with the task load running state of the heterogeneous computing system at the current time if the entropy weight of the write operation feature at the current time is greater than the entropy weights of the other feature dimensions, and determine the protocol that is most matched with the task load running state at the current time from the metadata index protocol, the multicast synchronization protocol, and the mixed response protocol according to the trend of the change in data heat if the entropy weight of the data heat feature at the current time is greater than the entropy weight of the other feature dimensions.
As an exemplary implementation manner of the foregoing embodiment, the foregoing protocol dynamic selection module 502 may be further configured to, for a case where the data popularity feature includes an access frequency feature and a local attenuation factor feature, determine the local attenuation factor feature according to an access time distribution feature of the data block, if an entropy weight of the access frequency feature is greater than an entropy weight of the local attenuation factor feature, select a metadata index protocol as a protocol that best matches a task load running state of the heterogeneous computing system at a current time, and if the entropy weight of the local attenuation factor feature is greater than the entropy weight of the access frequency feature, determine the local attenuation factor feature according to the access time distribution feature of the data block, and determine a protocol that best matches the task load running state at the current time according to the access time distribution feature.
As another exemplary implementation manner of the foregoing embodiment, the protocol dynamic selection module 502 may be further configured to determine a plurality of access time interval values according to a plurality of consecutive access times of the current data block, select a multicast synchronization protocol as a protocol that is most matched with a task load running state of the heterogeneous computing system at the current time if each access time interval value meets a preset same similar condition, generate a protocol switching suppression prompt message if each access time interval value meets a preset burst access post-constraint condition, and select a hybrid response protocol as a protocol that is most matched with the task load running state of the heterogeneous computing system at the current time if each access time interval value meets a preset medium fluctuation access condition.
As another exemplary implementation of the foregoing embodiment, the foregoing protocol dynamic selection module 502 may be further configured to obtain access times of each data block at least four times continuously, calculate access time interval values between every two adjacent access times, and use a ratio of a mean value and a standard deviation of each access time interval value as the local attenuation factor feature.
As an exemplary implementation manner of the foregoing embodiment, the foregoing protocol dynamic selection module 502 may be further configured to select the multicast synchronization protocol as a protocol that is most matched with the task load running state of the heterogeneous computing system at the current time if the value of the local attenuation factor characteristic is approximately equal to 0, generate the protocol switching suppression prompt message if the value of the local attenuation factor characteristic is greater than 1, and select the hybrid response protocol as the protocol that is most matched with the task load running state of the heterogeneous computing system at the current time if the value of the local attenuation factor characteristic is approximately equal to 1.
In other implementations of the present embodiment, the protocol dynamic selection module 502 may be further configured to determine, for each data block, feature probability distribution information of the current feature of the current data block according to a ratio of an original feature value of the current feature of the current data block to a sum of original feature values of current features of all data blocks of the heterogeneous computing system, determine, for each data block, an information entropy value of the current feature according to the feature probability distribution information of the current feature of each data block of the heterogeneous computing system, and determine, for each data block, an entropy weight of each feature according to a proportional relationship between the information entropy value of the current feature and the information entropy values of all features.
As an exemplary implementation manner of the foregoing embodiment, the foregoing protocol dynamic selection module 502 may be further configured to calculate a product of the feature probability distribution information of the current data block in the current feature and the logarithmic function value of the current data block, and use a negative value of a sum of product values corresponding to the data blocks of the heterogeneous computing system as the information entropy value of the current feature.
As another exemplary implementation manner of the foregoing embodiment, the foregoing protocol dynamic selection module 502 may be further configured to calculate a difference value between an information entropy value of each feature and a target value, and use a ratio of a difference value corresponding to the current feature to a sum of corresponding difference values of other features as an entropy weight of the current feature.
In other implementations of the present embodiment, the protocol abstraction module 501 may be further configured to, if the cache coherence protocol of the heterogeneous computing system is switched to the metadata index protocol, count the access frequency of each data block of the heterogeneous computing system at the current time, and compress and store directory entries of the target data blocks below a preset access frequency threshold.
In other embodiments of the present embodiment, the protocol abstraction module 501 may be further configured to switch the cache coherence protocol of the heterogeneous computing system to a multicast synchronization protocol, when the target computing unit receives a read operation request through the listening multicast channel, read corresponding target data from a private local cache of the target computing unit, after the target data is read, mark the private local cache as expired if the cache state of the private local cache is a non-dirty state, and execute an asynchronous refresh operation, and trigger to execute a conflict detection task if the cache state of the private local cache is a dirty state.
In other implementations of the present embodiment, the protocol abstraction module 501 may be further configured to, if the cache coherence protocol of the heterogeneous computing system is switched to the hybrid response protocol, maintain the coherence of the private local caches of the computing units of the heterogeneous computing system according to the metadata index protocol during processing the write operation request when the write operation request is received, maintain the coherence of the private local caches of the computing units of the heterogeneous computing system according to the metadata index protocol during processing the read operation request when the read operation request is received, and maintain the coherence of the private local caches of the computing units of the heterogeneous computing system according to the metadata index protocol during processing the read operation request when the task load of the heterogeneous computing system is less than or equal to the preset load threshold.
In other implementations of the present embodiment, the protocol dynamic selection module 502 may be further configured to calculate thresholds at different times within a time window where the current time is located based on the historical threshold information and the attenuation adjustment information, and select a maximum value and a minimum value from the thresholds, update the maximum threshold at the current time according to the maximum value, update the minimum threshold at the current time according to the minimum value, update the maximum threshold and the minimum threshold at the current time to obtain a maximum new threshold and a minimum new threshold, determine a protocol matching degree value according to entropy weights of different features and corresponding adjustment factors thereof, and select a protocol that is most matched with a task load running state at the current time from among a metadata index protocol, a multicast synchronization protocol, and a hybrid response protocol according to a numerical relation between the protocol matching degree value and the maximum new threshold and the minimum new threshold.
As an exemplary implementation of the above embodiment, the protocol dynamic selection module 502 may be further configured to determine the read specific gravity data according to the entropy weight of the read operation feature, the read adjustment factor, the read operation number and the write operation number, determine the write specific gravity data according to the entropy weight of the write operation feature, the write adjustment factor, the read operation number and the write operation number, determine the access distribution data according to the entropy weight of the local attenuation factor feature, the numerical value of the local attenuation factor feature and the heat adjustment factor, and determine the protocol matching degree value according to the read specific gravity data, the write specific gravity data and the access distribution data.
As another exemplary implementation manner of the foregoing embodiment, the foregoing protocol dynamic selection module 502 may be further configured to, if the protocol matching degree value is greater than the maximum new threshold, determine that the protocol that is most matched with the task load running state at the current time is a multicast synchronization protocol, if the protocol matching degree value is less than the minimum new threshold, determine that the protocol that is most matched with the task load running state at the current time is a metadata index protocol, and if the protocol matching degree value is greater than or equal to the minimum new threshold and less than or equal to the maximum new threshold, determine that the protocol that is most matched with the task load running state at the current time is a hybrid response protocol.
The cache consistency maintenance device is described from the perspective of a functional module, and further, the invention also provides an electronic device, which is described from the perspective of hardware. Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device comprises a memory 601 and a processor 602, the memory 601 having stored therein a computer program, the processor 602 being arranged to run the computer program to perform the steps of any of the cache coherence maintenance method embodiments described above.
Embodiments of the present application also provide a non-volatile storage medium having a computer program stored therein, wherein the computer program is configured to perform the steps of any of the cache coherence maintenance method embodiments described above when run.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the cache coherency maintenance method embodiments described above.
Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of any of the cache coherence maintenance method embodiments described above.
Finally, referring to fig. 7, the heterogeneous computing system further provides a heterogeneous computing system, where the heterogeneous computing system at least includes a first computing node 701, a second computing node 702 and a cache consistency controller 703, the cache consistency controller 703 is connected to the first computing node 701 and the second computing node 702, and the first computing node 701 and the second computing node 702 include at least two computing units, and the cache consistency controller 703 is configured to implement the cache consistency maintenance method described in any one of the embodiments of the cache consistency maintenance method when executing a computer program, so that a cache consistency protocol can be dynamically optimized, performance and reliability of the heterogeneous computing system are significantly improved, and efficient technical support is provided for fields such as intelligent driving, cloud computing, edge computing, and the like.
Furthermore, in order to overcome the problems of difficult consistency maintenance, high protocol complexity and poor expansibility of the heterogeneous computing system in the related art, the invention correspondingly improves the communication structure of each computing unit in the system based on a dynamic consistency maintenance method based on the mixture of bus monitoring and directory protocol, the embodiment defines the connection mode of each computing unit or node in the heterogeneous computing system, and builds a communication network cluster, and the method can comprise the following steps:
Information interaction is inevitably involved among all computing nodes and all computing units in the heterogeneous computing system, and the network topology structures among the computing nodes and the computing units and the information transmission synchronization mode have influence on communication performance. Currently, a general communication architecture in the related art adopts a centralized method, namely, a node with the strongest computing performance is selected as a central processing node, other remaining heterogeneous computing nodes are selected as edge nodes, all the edge nodes are independently communicated with the central processing node, and the edge nodes are not directly communicated. Although easy to implement, as the scale of the computing nodes or computing units expands, the central node needs to process exponentially growing communication requests, resulting in increased contention of communication bandwidth, which is extremely prone to causing systematic throughput degradation and propagation delay surges. In addition, in a physical bandwidth limited scene, multi-node concurrent access is easy to trigger data packet conflict and queue overflow, so that a key gradient synchronous request is lost or response is overtime. Furthermore, the single point failure risk of the central node directly leads to global breakdown of the whole system, which is unfavorable for task continuity and affects the normal operation of the service.
In view of this, the present embodiment divides the computing units of all the computing nodes of the heterogeneous computing system, and in this embodiment, two computing nodes are taken as an example, each computing unit of the first computing node and each computing unit of the second computing node are divided into a plurality of clusters according to the types of computing units, the types of computing units in the same cluster are the same and include one main computing unit, and different clusters are connected according to a ring topology through respective main computing units. In order to avoid description, the leader node selected from the cluster is defined as a master computing unit, and since the computing units in the same cluster belong to the same type, the performance is not greatly different, one node can be randomly selected as the master computing unit, other nodes in the cluster are slave computing units, and the master and slave are only for distinguishing, and the master and slave relationship between the two is not limited.
As can be seen from the above, the heterogeneous computing system of the present embodiment adopts a simple and scalable ring topology structure with a communication mode between clusters, and the data transfer is performed on the ring according to a fixed order, without involving large-scale data exchange and synchronization operations, so that the heterogeneous computing system has lower communication overhead. Furthermore, the ring topology structure can be well suitable for heterogeneous computing systems with different scales, the computing nodes are organized into a ring structure, and the ring topology structure is applied to artificial intelligence tasks by utilizing ring communication and aggregation operation, so that global updating of model parameters can be efficiently realized, the training process is accelerated, and the parallel computing efficiency is improved.
In the above embodiment, the computing units of the same type are divided into a cluster, and the ring topology connection is used between multiple clusters, so as to further improve the communication performance, the present invention also compares the influence of different network topologies on the overall performance by the computing units in each cluster, and determines the optimal intra-cluster topology structure, which may include the following contents:
In this embodiment, in the scenario that the communication bandwidth of all the computing units of the heterogeneous computing system is 2, and if any two nodes are connected, the communication bandwidth is recorded as 1, the tree topology structure shown in fig. 8 and the star topology structure shown in fig. 9 are adopted in the computing group, and the information of the edge computing unit is sent to the local head computing unit, that is, the delay of the main computing unit.
For the tree topology, at the moment 1, the computing units B and C send the respective information to the computing unit A, and the computing units F and G send the information to the computing unit E. The above operation can be completed in 1 unit time. At time 2, the computing units A and E can send their information to the computing unit 1 at the same time. Thus, the above operation can be completed in 1 unit time. In summary, under the tree topology, 2 unit time is required to complete the information of the edge calculation unit to be sent to the local head node calculation unit. For the star topology, time 1, considering that the total communication bandwidth is 2, in order to avoid the condition that all the computing units send information simultaneously, rules are formulated for information sending, and the information is sent sequentially according to the sequence of the serial numbers of the computing units without losing generality. First, the computing units a, B simultaneously transmit their respective information to the computing unit 1, and the other computing units remain stationary. Therefore, the operation can be completed within 1 unit time, and the time 2. The computing units C and E send the respective information to the computing unit 1 at the same time, and the other computing units keep a static state. Therefore, the above operation can be completed in 1 unit time, and the time 3. The calculation unit G transmits its information to the calculation unit 1 and is completed in 1 unit time. It follows that the star topology requires 3 units of time to complete the information of the edge computation unit to be sent to the local head computation unit. Therefore, the communication delay of the tree topology is more advantageous than that of the star topology under the condition that the number of the edge calculation units is the same and the bandwidth is limited. In addition, the tree topology has the advantages that the tree topology can be used for organizing and managing nodes and data in a cross-domain scene in terms of cross-domain communication and information transfer, an effective information exchange and transfer mechanism is provided, and the tree topology can increase the fault tolerance capability of a system through redundancy and backup in terms of fault tolerance and redundancy. When one computing unit fails, the tree topology structure can automatically route traffic to other available computing units, so that the stability and usability of the system are guaranteed, and in terms of cross-domain data consistency and synchronization, the data consistency and synchronization are important problems in a cross-domain scene. The tree topology structure can be used for synchronizing and updating data among different domains, ensuring the consistency of the data, and can conveniently add new computing units through the hierarchical structure of the tree topology structure in the aspects of expansibility and maintainability, thereby realizing the expansion of heterogeneous computing systems.
Based on the above, in the embodiment of the invention, the heterogeneous computing system at least comprises a first cluster and a second cluster, wherein the first cluster at least comprises a first master computing unit and a plurality of first slave computing units, the second cluster at least comprises a second master computing unit and a plurality of second slave computing units, the first cluster and the second cluster are communicated through the first master computing unit and the second master computing unit which are mutually connected, the first master computing unit is connected with each first slave computing unit according to a tree topology structure, and the second master computing unit is connected with each second slave computing unit according to the tree topology structure.
In order to make the technical solution of the present invention more clear for those skilled in the art, the present invention further provides a communication architecture of an exemplary heterogeneous computing system, as shown in fig. 10, which may include that the computing units of the heterogeneous computing system are divided into 5 clusters, and the main computing units of each cluster are respectively a computing unit 1, a computing unit 2, a computing unit 3, a computing unit 4 and a computing unit 5, where the computing units 1 in the group 1, the computing units 2 in the group 2, the computing units 3 in the group 3, the computing units 4 in the group 4 and the computing units in the group 5 are connected to form a ring topology, and the computing nodes in the same group are connected in a multi-layer tree topology, and one main computing unit is selected. If the computing units A-G in the group 1 are arranged in multiple layers in a binary tree mode, and the computing unit 1 is selected as a leading node, the problems of communication congestion and delay of the nodes can be effectively solved.
As can be seen from the above, the heterogeneous computing system of the embodiment adopts a system communication architecture of a local tree structure and a global loop structure, realizes hierarchical route optimization through a tree topology structure by decoupling design of physical links and logic paths, reduces the cross-level communication load, constructs redundant links by means of loop topology, and realizes load balancing on the basis of improving topology robustness. The traffic bottleneck of the central node and the single-point failure risk are avoided, and the anti-congestion capability of the system is obviously improved through a multipath transmission mechanism, so that the optimal balance between the data communication efficiency and the system reliability is achieved. In addition, the intra-cluster consistency maintenance protocol formed by isomorphic devices in the cluster is easy to realize, and the protocol conversion and translation problems are avoided, so that the additional performance cost of the protocol can be reduced, and each group selects a main computing unit which is responsible for receiving and transmitting global information, thereby remarkably reducing the topology complexity and communication flow under the full interconnection mode. Furthermore, the inter-group ring connection is easy to expand, and the protocol synchronization only needs to consider the adjacent computing units, but does not need to consider other computing units, so that the difficulty of consistency maintenance is reduced. In conclusion, the heterogeneous computing system topology design module combines the ring network and the tree network to fully exert respective advantages and meet the requirements of the system in various computing task scenes. Through reasonable design and configuration of the network structure, the flexibility, expandability, safety, performance and efficiency of the system can be improved, and communication, cooperation and data consistency among heterogeneous nodes and load balancing are promoted.
The above describes in detail a heterogeneous computing system, a cache consistency maintenance method, a device, an electronic apparatus, a nonvolatile storage medium and a computer program product thereof. In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. The various example units and algorithm steps described in the various disclosed embodiments, whether implemented in electronic hardware or in computer software, depend on the particular application and design constraints of the solution, and one skilled in the art may use different methods for each particular application to implement the described functions without departing from the scope of the invention. The present invention is capable of numerous modifications and adaptations without departing from the principles of the present invention, and such modifications and adaptations are intended to be within the scope of the present invention.