CN120123108A

CN120123108A - Kafka storage and computing separation data processing method and device

Info

Publication number: CN120123108A
Application number: CN202510616765.6A
Authority: CN
Inventors: 王珣; 陆阳; 袁娟
Original assignee: Fullsee Technology Co ltd
Current assignee: Fullsee Technology Co ltd
Priority date: 2025-05-14
Filing date: 2025-05-14
Publication date: 2025-06-10
Anticipated expiration: 2045-05-14
Also published as: CN120123108B

Abstract

The embodiment of the present application provides a data processing method and device for Kafka storage and computing separation, by constructing a multi-layer data processing architecture, including four levels: network communication, task distribution, processing scheduling and storage. The separation of hot and cold data is achieved through the hierarchical design of the pre-write log area and the main storage area, and the intelligent scheduling of system resources is achieved by combining the replica manager, coordinator and load predictor. An adaptive compression algorithm and verification mechanism are used to ensure the efficiency and integrity of data transmission, create a message cache area to store hot spot data, and realize flexible access to hot and cold data based on the message site. This method effectively solves the shortcomings of traditional technologies in storage and computing coupling, storage efficiency and load balancing, and significantly improves the performance and scalability of the message queue system.

Description

Data processing method and device for kafka memory calculation separation

Technical Field

The application relates to the field of data processing, in particular to a data processing method and device for kafka calculation separation.

Background

The existing Kafka data processing method has obvious defects. The traditional architecture has high coupling degree between storage and calculation, is difficult to realize the elastic expansion of resources, and has the problem of unbalanced data access hot spots.

Furthermore, the prior art has a bottleneck in data storage efficiency. Most systems fail to effectively distinguish cold and hot data, and lack of an intelligent data compression and caching mechanism results in low storage space utilization and large access delay.

Existing systems have technology shortboards in terms of load balancing and fault tolerance mechanisms. The lack of dynamic partition adjustment and prediction capabilities makes it difficult to cope with bursty traffic and system failures. The resolution of these problems is of great importance to improve the performance and reliability of the message queuing system.

Disclosure of Invention

Aiming at the problems in the prior art, the application provides a data processing method and device for kafka calculation separation, which can effectively solve the defects in the aspects of calculation coupling, storage efficiency, load balancing and the like in the traditional technology and remarkably improve the performance and expandability of a message queue system.

In order to solve at least one of the problems, the application provides the following technical scheme:

in a first aspect, the present application provides a data processing method of kafka memory separation, comprising:

The method comprises the steps of constructing a multi-layer data processing architecture, wherein the multi-layer data processing architecture comprises a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer comprises a stream storage library, the stream storage library comprises a pre-write log area and a main storage area, a plurality of log fragments are divided in the pre-write log area, the main storage area realizes data persistence through object storage service, the processing scheduling layer comprises a copy manager, a coordinator and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for distribution and group management, and the load predictor is used for establishing a prediction model based on the utilization rate of system resources;

Receiving batch message data sent by a producer, calculating data characteristics of the batch message data, selecting a compression algorithm according to the data characteristics, executing compression coding on the batch message data to generate a check code, writing the compressed message data into log fragments in the pre-written log area, storing the log fragments by adopting a solid state disk, establishing a master-slave copy relationship among a plurality of log fragments according to configuration of the copy manager, and asynchronously uploading the message data in the log fragments to the main storage area;

Creating a message buffer area, wherein the message buffer area is used for storing hot message data, the number of message partitions is adjusted according to the prediction result of the load predictor, a consumer reads the hot message data from the message buffer area based on a message position point, when target message data does not exist in the message buffer area, the consumer acquires cold data from the main storage area according to the message position point and decompresses the cold data, the coordinator records the consumption progress of the consumer, and the data integrity is verified based on the check code.

Dividing a data processing architecture into call links, setting a socket monitor at a network communication layer to process a client connection request, creating a message transmission channel and configuring transmission parameters, constructing a routing table at a task distribution layer to map the corresponding relation between a processor and a message type, establishing a task queue, setting a task priority scheduling mechanism, and distributing tasks to the corresponding processors according to the message type;

The method comprises the steps of deploying a copy manager in a processing scheduling layer to realize partition master-slave copy, managing partition and consumption groups by a configuration coordinator, collecting system resource use data by a deployment load predictor, creating a stream storage library in a storage layer, dividing a pre-write log area and a main storage area, wherein the pre-write log area adopts a solid state disk storage medium, and the main storage area adopts object storage service as persistent storage.

Calculating the number of log fragments based on the total capacity of a pre-written log area, distributing a unique identifier for each log fragment, organizing the log fragments into a linked list structure according to time sequence, setting a message offset and a time stamp index in the log fragments, configuring a reserved space threshold of the log fragments, establishing a mapping relation between the log fragments and physical storage equipment, dividing storage barrels by a main storage area according to a time range, and establishing a metadata index for each storage barrel;

Configuring master-slave replication parameters of a replication manager, establishing a synchronization mechanism of partition replicas, constructing a partition allocation table and a consumption group management table in a coordinator, recording consumption group member information and consumption progress, acquiring system resource indexes such as processor utilization rate, memory occupancy rate, disk read-write rate and the like by a load predictor, constructing a time sequence prediction model to calculate resource utilization trend, and generating a load balancing strategy based on a prediction result.

Analyzing message heads and message bodies of batch message data sent by a producer, counting data distribution characteristics of the message bodies, calculating data characteristic indexes such as entropy values, repeatability, numerical ranges and the like of the message bodies, establishing a compression algorithm evaluation matrix, calculating compression rates and processing time delays of different compression algorithms according to the data characteristic indexes, and selecting an optimal compression algorithm in the evaluation matrix;

The selected compression algorithm is applied to batch message data, a compressed byte stream is generated, CRC32 check codes of the compressed byte stream are calculated, the compressed data length, the compression algorithm identification and the check codes are packed into metadata heads, the metadata heads and the compressed byte stream are assembled into data blocks, and available log fragments are selected in a pre-write log area to write the data blocks.

Further, the method further comprises the steps of distributing a solid state disk storage space for log fragments, configuring storage parameters of the log fragments, setting a storage buffer area size and a disk brushing strategy, selecting slave nodes based on configuration information of a copy manager, establishing a copy channel between the master node and the slave node, sending log fragment data to the slave nodes according to batch size, and returning confirmation information to the master node after the slave nodes verify data integrity;

Establishing a message data uploading task queue, scanning the confirmed copied data area in the log partition according to a preset time interval, grouping and packaging the messages in the data area according to a time range, calculating a storage path of a data packet, asynchronously uploading the data packet to a storage barrel corresponding to a main storage area through an object storage interface, and updating a storage position index of the data packet.

Further, the method further comprises the steps of distributing a memory space for a message cache region, constructing a cache index table to record the mapping relation between message sites and storage addresses, setting the upper limit of capacity and elimination strategy of the cache region, establishing a hot message scoring mechanism, calculating a message hotness score based on the access frequency and time attenuation factor of the message, and loading message data with the hotness score exceeding a preset threshold value into the cache region;

And monitoring read-write load data of the message partition, wherein the load predictor trains a prediction model based on historical load data, calculates a load trend in a future time window, triggers partition capacity expansion when the predicted load exceeds a preset threshold, redistributes the message data in the existing partition to a newly added partition according to a consistent hash algorithm, and updates a partition routing table.

Further, the method comprises the steps that a consumer sends a site query request to a coordinator, the coordinator returns storage position information corresponding to a message site, the consumer searches target message data in a message cache area according to the storage position information, if the target message data does not exist in the cache area, a corresponding data packet is obtained from a main storage area, a metadata header in the data packet is read to analyze a compression algorithm identifier, and a corresponding decompression method is called to restore the message data;

And after the verification is passed, the message data is returned to the consumer, the consumer submits the consumption site to the coordinator after processing the message, the coordinator records the consumption site to the site management table, and the site management table is regularly persisted to the storage system.

In a second aspect, the present application provides a kafka deposit separation data processing apparatus comprising:

The architecture building module is used for building a multi-layer data processing architecture, the multi-layer data processing architecture comprises a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer comprises a stream storage library, the stream storage library comprises a pre-write log area and a main storage area, the pre-write log area is divided into a plurality of log fragments, the main storage area realizes data persistence through object storage service, the processing scheduling layer comprises a copy manager, a coordinator and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for distribution and group management, and the load predictor is used for building a prediction model based on the utilization rate of system resources;

the message storage module is used for receiving batch message data sent by a producer, calculating the data characteristics of the batch message data, selecting a compression algorithm according to the data characteristics, executing compression coding on the batch message data to generate a check code, writing the compressed message data into log fragments in the pre-written log area, storing the log fragments by adopting a solid state disk, establishing a master-slave copy relationship among a plurality of log fragments according to the configuration of the copy manager, and asynchronously uploading the message data in the log fragments to the main storage area;

And the message cache module is used for creating a message cache area, the message cache area is used for storing hot message data, the number of message partitions is adjusted according to the prediction result of the load predictor, a consumer reads the hot message data from the message cache area based on a message position point, when target message data does not exist in the message cache area, the consumer acquires cold data from the main storage area according to the message position point and decompresses the cold data, the coordinator records the consumption progress of the consumer, and the data integrity is verified based on the check code.

In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the kafka separate data processing method when the program is executed.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the kafka calculation split data processing method.

In a fifth aspect, the application provides a computer program product comprising computer programs/instructions which when executed by a processor implement the steps of the kafka memory separation data processing method.

According to the technical scheme, the application provides a data processing method and device with kafka calculation separation, and the data processing method and device comprises four layers of network communication, task distribution, processing scheduling and storage through construction of a multi-layer data processing architecture. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a data processing method of kafka memory separation in an embodiment of the application;

FIG. 2 is a block diagram of a kafka memory separation data processing apparatus in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Reference numerals:

An electronic device 9600, a central processor 9100, a memory 9140, a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, a power supply 9170, a buffer memory 9141, an application/function storage portion 9142, a data storage portion 9143, a driver storage portion 9144, an antenna 9111, a speaker 9131, and a microphone 9132.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The technical scheme of the application obtains, stores, uses, processes and the like the data, which all meet the relevant regulations of national laws and regulations.

In view of the problems existing in the prior art, the application provides a data processing method and device for kafka memory separation, which are used for constructing a multi-layer data processing architecture, including four layers of network communication, task distribution, processing scheduling and storage. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.

In order to effectively solve the defects of the traditional technology in terms of memory coupling, memory efficiency, load balancing and the like, the application provides an embodiment of a data processing method for kafka memory separation, which specifically comprises the following steps:

Step S101, constructing a multi-layer data processing architecture, wherein the multi-layer data processing architecture comprises a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer comprises a stream storage library, the stream storage library comprises a pre-write log area and a main storage area, a plurality of log fragments are divided in the pre-write log area, the main storage area realizes data persistence through object storage service, the processing scheduling layer comprises a copy manager, a coordinator and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for distribution and group management, and the load predictor is used for establishing a prediction model based on the utilization rate of system resources;

optionally, the embodiment constructs a data processing architecture with explicit hierarchical division based on the architecture features of Kafka. In the network communication layer, high-performance network communication is realized by adopting an asynchronous IO framework based on Netty, and a plurality of processors, including a coder-decoder, an idle detector and a compression processor, are organized through a Pipeline (Pipeline) mechanism. Each processor independently completes specific functions, and the processors work cooperatively through an event transmission mechanism, so that high concurrency processing capacity of a network layer is realized.

The embodiment realizes an intelligent routing mechanism at a task distribution layer. And constructing a message partition routing table through a consistent hash algorithm to ensure that messages are uniformly distributed among partitions. The routing table adopts a double-layer structure, wherein the first layer is the mapping from the theme to the partition, and the second layer is the mapping from the partition to the processor. When partition expansion or contraction is needed, the affected messages are rerouted through the rebalancing operation of the consistency hash ring, and the data migration cost is minimized.

The present embodiment designs an innovative processing scheduling layer structure. The copy manager employs a Raft protocol based consistency replication mechanism to ensure data consistency through a pre-written log (WAL). Each partition maintains an independent replication group, members in the group detect node states through a heartbeat mechanism, and when a main node fails, a new main node is selected through an election algorithm. The coordinator is responsible for the management of the consumption groups, realizes a consumer balancing algorithm, and ensures that the partition load is balanced and distributed to consumers.

The present embodiment implements an LSTM (Long Short-Term Memory) -based deep learning model in the load predictor. The model receives multidimensional time sequence data as input, and comprises indexes such as CPU utilization rate, memory occupancy rate, disk IO rate, network bandwidth and the like. The time dependence of these indicators is captured by a multi-layer LSTM network, predicting the resource usage trend within a future time window. The prediction result is used for guiding the partitioning expansion and contraction capacity decision, and effectively preventing the overload of the system.

The present embodiment creates a dual layer memory structure in the memory layer. The pre-writing log area is stored by a solid state disk, and high-speed writing is realized by a zero-copy technology. The size of the log slices is dynamically adjusted according to expected throughput, each slice containing an index file and a data file. The index file records a mapping of message offsets to physical locations, supporting fast message locating. The data file adopts an additional writing mode to ensure sequential IO performance. The main storage area realizes data persistence through the object storage service, and the hot data is stored in the storage layer with higher performance preferentially by adopting a layered storage strategy.

The embodiment realizes an efficient data synchronization mechanism. And the main storage area adopts an asynchronous uploading strategy, sets a data uploading task queue, and uploads the data in the pre-written log area to the object storage in batches according to a predefined time interval. And in the uploading process, a multi-level caching strategy is used, frequently accessed data blocks are cached in a memory, and the access frequency of object storage is reduced. Meanwhile, an intelligent data preheating mechanism is realized, and the data blocks which are possibly accessed are predicted according to the access mode and are loaded into the cache in advance.

The embodiment constructs a complete monitoring and alarming system. Standardized measurement index acquisition points are established among layers, and the running state of the system is monitored in real time. When an abnormal condition is detected, such as network delay exceeding a threshold value, insufficient storage space and the like, operation and maintenance personnel are notified through a multi-level alarm mechanism. Meanwhile, an automatic fault recovery mechanism is realized, and recovery operation can be automatically executed for common fault scenes.

Through the technical innovation, the key problems of poor expansibility, insufficient data reliability, low resource utilization rate and the like caused by storage computing coupling in the traditional Kafka architecture are effectively solved. In practical application, the scheme can support a large-scale distributed message processing scene, and the expandability and the resource utilization rate of the system are obviously improved through a memory and calculation separation framework. The method is particularly suitable for scenes with higher requirements on data reliability and processing performance, such as finance, electronic commerce and the like, and the overall performance of the message processing system is improved through multi-layer architecture design and intelligent scheduling strategies.

Step S102, receiving batch message data sent by a producer, calculating data characteristics of the batch message data, selecting a compression algorithm according to the data characteristics, executing compression coding on the batch message data to generate a check code, writing the compressed message data into log fragments in the pre-written log area, storing the log fragments by adopting a solid state disk, establishing a master-slave copy relationship among a plurality of log fragments according to configuration of a copy manager, and asynchronously uploading the message data in the log fragments to the main storage area;

Optionally, the embodiment adopts a batch processing mechanism on the message receiving process, and the producer organizes the messages by adopting ProducerBatch objects, and controls the data volume of single batch messages through a configurable batch size parameter. The message receiving buffer zone adopts a ring buffer zone design, the writing and the extraction of data are controlled through a read-write pointer, and when the data volume of the buffer zone reaches a batch threshold or the waiting time exceeds a configured delay threshold, the processing flow of batch messages is triggered.

The embodiment realizes a self-adaptive data characteristic analysis mechanism. And carrying out multidimensional feature extraction on the batch message data, wherein the multidimensional feature extraction comprises repeated pattern analysis, entropy calculation and numerical distribution statistics. The repeated pattern is identified by the LZ77 algorithm of the sliding window, and the occurrence frequency of repeated substrings in the data is calculated.

The entropy value H is calculated by the formula h= - Σ (pi×log2 (Pi)), where Pi represents the probability of occurrence of different byte values, and lower entropy values represent stronger compressibility of the data. The numerical distribution adopts histogram statistics to identify the concentration and dispersion degree of the data.

The embodiment designs an intelligent compression algorithm selection strategy. And (3) establishing a compression algorithm characteristic matrix which comprises common compression algorithms such as LZ4, snappy, ZSTD and the like. For each algorithm, its compression ratio and compression speed index under different data characteristics are recorded. And comprehensively considering the data characteristics, the algorithm performance and the system resource state through a multi-factor decision model, and selecting an optimal compression algorithm. For example, LZ4 algorithm is preferred for log data with significant repetitive patterns, and compression algorithm specific to the values is preferred for time series data in the value distribution set.

The embodiment realizes an efficient compression processing flow. The compression process adopts multithread parallel processing, and the compression is executed in parallel after the batch message data are fragmented. The compression result comprises a compressed data block and a metadata header, wherein the metadata header records information such as the type of a compression algorithm, the size of original data, the size after compression and the like. And meanwhile, a CRC32 check code is generated, and the check code is used for calculating and covering the compressed data and the metadata header, so that the integrity of the data is ensured.

The present embodiment builds a reliable data writing mechanism. The pre-writing log area is stored by a solid state disk, and the writing performance is improved by an asynchronous writing mode. The log fragments are written in an additional writing mode, and each fragment maintains a writing position pointer to ensure sequential writing. The writing process uses a memory mapping file technology, and the system call overhead is reduced. And meanwhile, write-in buffer management is realized, and when the buffer utilization exceeds a threshold value, asynchronous disk brushing operation is triggered.

The present embodiment designs an innovative master-slave replication mechanism. The replica manager maintains data consistency of the master-slave nodes through a Raft-based consensus protocol. After receiving the writing request, the master node firstly writes the data into the local log fragment and then sends the data to all the slave nodes in parallel. By adopting the pipeline replication technology, the next batch is sent while waiting for the confirmation of the previous batch, thereby improving replication efficiency. After receiving the data from the node, the slave node verifies the integrity of the data, writes the data into the local log partition, and returns a confirmation message to the master node.

The embodiment realizes an intelligent data uploading strategy. The primary storage area adopts an object storage service to support long-term persistence of data. And the data uploading adopts an asynchronous mode, and an uploading task queue is set to manage uploading requests. And dynamically adjusting the uploading frequency according to the use condition of the pre-written log area, and improving the uploading frequency when the log slicing use ratio is higher. The uploading process supports breakpoint continuous uploading, and when the uploading is interrupted due to a network problem, the uploading can be continued from the breakpoint position.

Through the technical innovation, the key problems in message processing, such as low data compression efficiency, storage performance bottleneck, insufficient data reliability and the like, are effectively solved. In practical application, the scheme can adapt to different types of message data, and the data processing efficiency and reliability are remarkably improved through an intelligent compression strategy and a reliable storage mechanism. The method is particularly suitable for large-scale log collection, time sequence data processing and other scenes, and the overall performance of the message processing system is improved through a multi-level storage architecture and an asynchronous processing mechanism. The system and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive improvement of the message processing efficiency is realized through continuous optimization and improvement.

And step 103, creating a message buffer area, wherein the message buffer area is used for storing hot message data, regulating the number of message partitions according to the prediction result of the load predictor, reading the hot message data from the message buffer area by a consumer based on a message position, acquiring cold data from the main storage area and decompressing according to the message position by the consumer when target message data does not exist in the message buffer area, recording the consumption progress of the consumer by the coordinator, and verifying the data integrity based on the check code.

Optionally, the embodiment realizes a multi-level cache management architecture based on the message access feature. The message buffer area adopts the out-of-heap memory design, and directly allocates the memory space through DirectByteBuffer, so that the limit of JVM heap memory management is avoided. The buffer area divides a plurality of memory pages according to a fixed size, each memory page maintains an independent memory mapping table, and records the position information of the message in the physical memory. The data is read quickly by the zero-copy technology, so that the memory copy cost is obviously reduced.

The present embodiment designs an innovative hotspot message identification mechanism. And counting the access frequency of the message by adopting a sliding time window, wherein the size of the window can be dynamically adjusted according to the service characteristics. A hotness score is calculated for each message, and the score calculation formula is as follows:

Score = Frequency×Recency×Weight,

Where Frequency is the access Frequency, recency is the time decay factor, weight is the message priority Weight. The natural decay process of the message heat with time is simulated by the exponential decay function, and the real-time heat change of the message is accurately reflected.

The embodiment realizes an intelligent partition dynamic adjustment strategy. The load predictor analyzes the historical load data through a deep learning model, and establishes a time sequence prediction network based on LSTM. The input features comprise indexes such as partition read-write QPS, message accumulation amount, consumption delay and the like, and the time sequence dependency relationship among the indexes is learned through a multi-layer LSTM network. And outputting a load trend in a predicted future time window by the model, and triggering partition capacity expansion operation when the predicted load exceeds a preset threshold value.

The embodiment constructs an efficient message reading flow. The consumer firstly obtains the consumption site information from the coordinator, and the coordinator maintains a site mapping table of the consumption group and records the consumption progress of each consumer. The consumer calculates the logical offset of the message according to the location information, and rapidly locates the position of the message in the buffer through multi-level searching. If the target message is not in the buffer area, a cold data loading flow is started, and the corresponding compressed data packet is acquired from the main storage area.

The present embodiment designs a reliable data integrity verification mechanism. In the message decompression process, firstly, the metadata header of the data packet is analyzed, and the compression algorithm type and the check code information are obtained. And selecting a corresponding decompression method according to the type of the compression algorithm, calculating a CRC32 check value of the message data after decompression is completed, and comparing the CRC32 check value with a check code in the metadata header. If the verification fails, triggering a retry mechanism to acquire the data from the storage system again, so as to ensure the integrity of the data.

The embodiment realizes an optimized cache elimination strategy. The buffer space is managed by adopting a modified LRU-K algorithm, the time stamp of the latest K accesses of each message is recorded, and the future access probability of the message is more accurately predicted through the access records. When the buffer space is insufficient, messages with lower access probability are preferentially eliminated. Meanwhile, a preloading mechanism is realized, and the message which is possibly accessed is predicted according to the access mode of the message and is loaded into the cache area in advance.

The embodiment builds a complete consumption progress management mechanism. The coordinator periodically persists the consumption site to the storage system, and a two-phase commit protocol is employed to ensure atomicity of the site update. The site information comprises fields such as a consumption group ID, a topic partition, a consumption site, a timestamp and the like, and concurrent update is processed through a version number mechanism. When the consumer restarts or re-equalizes, the previous consumption schedule can be quickly restored.

By the technical innovation, the key problems of high access delay, low cache utilization rate, unbalanced load and the like of the hot spot message in the distributed message system are effectively solved. In practical application, the scheme can intelligently identify hot spot messages and optimize access paths, and the throughput and response speed of message processing are remarkably improved through multi-level caching and dynamic partition adjustment. The method is particularly suitable for scenes with obvious hot spot access characteristics, such as social media hot spot events, e-commerce second killing activities and the like, and the performance optimization of the message processing system is realized through accurate hot spot identification and efficient cache management. The systematic and innovative design of the scheme can adapt to the message processing requirements of different scales, and the comprehensive improvement of a message system is realized through continuous optimization and improvement.

As can be seen from the above description, the kafka separate data processing method provided by the embodiment of the present application can construct a multi-layer data processing architecture, including four layers of network communication, task distribution, processing scheduling and storage. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.

In an embodiment of the kafka deposit separation data processing method of the present application, the following may be specifically included:

Step S201, dividing a data processing architecture into call links, setting a socket monitor in a network communication layer to process a client connection request, creating a message transmission channel and configuring transmission parameters, constructing a routing table in a task distribution layer to map the corresponding relation between a processor and a message type, establishing a task queue, setting a task priority scheduling mechanism, and distributing tasks to the corresponding processor according to the message type;

Step S202, a copy manager is deployed in a processing scheduling layer to realize partition master-slave copy, a configuration coordinator manages partition allocation and consumption groups, a deployment load predictor collects system resource use data, a stream storage library is created in a storage layer, a pre-write log area and a main storage area are divided, the pre-write log area adopts a solid state disk storage medium, and the main storage area adopts object storage service as persistent storage.

Optionally, the embodiment adopts a layered calling link design, so that the full flow tracking from the access to the processing of the request is realized. And constructing an event driven framework by adopting a Reactor mode in a network communication layer, wherein a main Reactor thread is responsible for receiving a client connection request, and a sub-Reactor thread pool processes data transceiving of the established connection. The efficient IO multiplexing is realized through the epoll mechanism, and the connection processing capacity is remarkably improved. Each connection is assigned a unique session ID for tracking the lifecycle of the request.

The present embodiment implements an adaptive flow control mechanism in the message transmission channel. The transmission rate is dynamically adjusted through a sliding window algorithm, and the window size is automatically adjusted according to the network delay and the packet loss rate. The transmission parameters include TCP send buffer size, receive buffer size, heartbeat interval, etc., which may be dynamically optimized according to actual network conditions. Meanwhile, the message slicing transmission is realized, and the oversized message is sliced, so that the reliability of transmission is ensured.

The present embodiment designs an efficient task routing mechanism. The routing table adopts a multi-level hash structure, wherein the first level is based on message subject hash, the second level is based on message type hash, and finally the routing table is mapped to a specific processor. In order to improve the searching efficiency, the routing table supports periodic optimization, and the routing item with higher access frequency is migrated to the quick access area. The processor pool adopts elastic telescopic design, and the number of processors can be dynamically adjusted according to the load condition.

The embodiment builds an intelligent task scheduling system. The task queue adopts a multi-stage feedback queue design, and tasks with different priorities enter different queue levels. Priority calculation takes into account a number of factors including the timeliness requirements of the message, the user service level, queue latency, etc. Through dynamic time slice allocation, high-priority tasks can be ensured to be processed quickly, and starvation phenomenon of low-priority tasks is prevented.

The embodiment realizes a reliable copy management mechanism. The copy manager implements master-slave replication based on the modified Raft protocol, introducing parallel replication optimization, allowing slave nodes to write multiple log entries in parallel. The replication process uses a pipelined design, sending the next batch of data while waiting for the acknowledgement of the previous batch. Meanwhile, an intelligent fault transfer mechanism is realized, and when the master node fails, a new master node can be rapidly elected.

The present embodiment designs an innovative partition management strategy. The coordinator maintains a global partition allocation table, and adopts a consistent hash algorithm to perform partition allocation. The allocation process takes into account a number of constraints, brooker load balancing, rack awareness, data locality, etc. When partition rebalancing occurs, data transfer overhead is minimized by incremental migration. The consumer group management adopts a distributed session mechanism, so that quick perception and processing of consumer faults are ensured.

The embodiment builds an accurate load monitoring system. The load predictor collects system indexes through JMX, cgroups and other ways, including CPU utilization rate, memory occupation, disk IO, network throughput and the like. The collected data is stored through a time sequence database, and multi-dimensional index aggregation analysis is established. The prediction model adopts an LSTM neural network, input characteristics comprise historical load data and a system event log, and the output predicts the resource use trend in a future time window.

The embodiment realizes an efficient storage management mechanism. The pre-write log area utilizes the high IOPS characteristic of the solid state disk, and the parallel write performance is improved by adopting a multi-queue design. The log file is fragmented according to a fixed size, and the read-write operation is accelerated through a memory mapping technology. The main storage area adopts object storage service, so that hot layered storage of data is realized, hot data is stored in a storage layer with higher performance preferentially, and cold data is automatically migrated to a storage layer with lower cost.

The key problems of limited connection processing capacity, unbalanced task scheduling, insufficient data reliability and the like in the distributed message system are effectively solved through the technical innovation. In practical application, the scheme can support large-scale message processing requirements, and the throughput and the reliability of the system are obviously improved through a multi-layer architecture and intelligent scheduling. The method is particularly suitable for the scenes of financial transactions, data processing of the Internet of things and the like, and the overall performance of the message system is improved through a flexible storage strategy and a reliable copying mechanism. The systematicness and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the message processing system is realized through continuous optimization and improvement.

Step 301, calculating the number of log fragments based on the total capacity of a pre-written log area, distributing a unique identifier for each log fragment, organizing the log fragments into a linked list structure according to time sequence, setting a message offset and a time stamp index in the log fragments, configuring a reserved space threshold of the log fragments, establishing a mapping relation between the log fragments and a physical storage device, dividing storage barrels by a main storage area according to a time range, and establishing a metadata index for each storage barrel;

Step S302, configuring master-slave replication parameters of a replication manager, establishing a synchronization mechanism of partition replicas, constructing a partition allocation table and a consumption group management table in a coordinator, recording consumption group member information and consumption progress, acquiring system resource indexes such as processor utilization rate, memory occupancy rate, disk read-write rate and the like by a load predictor, constructing a time sequence prediction model to calculate resource utilization trend, and generating a load balancing strategy based on a prediction result.

Optionally, the embodiment adopts a dynamic allocation policy in log partition management. The initial number of fragments is calculated by the ratio of the total capacity of the pre-written log area to the individual fragment reference size. The size of the fragments adopts a logarithmic growth mode, the size of the new fragments is 1.5 times of that of the last fragment, and the design can adapt to the growth trend of the message flow. Each slice is assigned a 64-bit unique identifier, including a time stamp, a sequence number, and node ID information, ensuring uniqueness in a distributed environment.

The embodiment realizes an efficient slicing index structure. The log fragments are connected through a doubly linked list, and each fragment node contains reference pointers of the front fragments and the rear fragments. And establishing a sparse index in the fragments, and establishing an index item every 4KB of message data, wherein the index item comprises message offset, physical position and timestamp information. And the index items are organized by adopting the jump table structure, so that the quick search of the O (log n) time complexity is realized. Meanwhile, index caches of the hot spot fragments are maintained in the memory, and the frequently accessed searching performance is improved.

The present embodiment designs an innovative spatial management mechanism. The head space threshold is calculated dynamically taking into account the historical write rate and the remaining capacity. When the partition utilization reaches a threshold, an asynchronous migration mechanism is triggered to migrate old data to the primary storage area. The migration process adopts a batch copy strategy, and the migration efficiency is improved through multithreading parallel transmission. Meanwhile, intelligent defragmentation is realized, small fragments are combined regularly, and the utilization rate of storage space is optimized.

The present embodiment builds reliable bucket management. The main memory area divides memory buckets according to a day-level time range, and each memory bucket contains a plurality of data files. The metadata index of the bucket adopts a B+ tree structure, and the index item comprises information such as a time range, a message offset range, a file path and the like. And the index storage space is optimized through a prefix compression technology, an incremental updating mechanism of the index is realized, and the maintenance cost of the index is reduced.

The embodiment realizes an accurate copy synchronization mechanism. The copy manager employs a modified Raft protocol, which introduces parallel copy optimization. The replication parameters include batch size, replication thread count, heartbeat interval, etc., which can be dynamically adjusted according to network conditions. The synchronization process uses a pipeline design, allows data of multiple batches to be sent simultaneously, and significantly improves replication efficiency. Meanwhile, an intelligent catch-up mechanism is realized, and the latest data can be quickly synchronized by the lagged copies.

The present embodiment designs an innovative partition management strategy. The partition allocation table is realized by adopting a consistent hash ring, and the allocation equilibrium is improved by a virtual node technology. The consumption group management table records session information, partition allocation status and consumption sites of consumers. The site management adopts a two-stage commit protocol to ensure that the site management can be restored to a consistent state when a fault occurs. And meanwhile, incremental adjustment of consumer rebalancing is realized, and the influence of the rebalancing process on the service is minimized.

The embodiment constructs an intelligent load prediction system. Through multi-dimension index collection, a complete resource portrait is established. The collected indexes comprise CPU utilization rate, memory occupancy rate, disk IO and the like at the system level, and request queue length, processing delay and the like at the application level. The prediction model adopts an LSTM neural network, and captures the correlation among different indexes through an attention mechanism. The model training adopts a sliding window strategy, and the window size is dynamically adjusted according to the load change period.

The embodiment realizes an adaptive load balancing strategy. Calculating a load score based on the prediction result, wherein the score calculation formula is as follows:

Score = w1CPU + w2Memory + w3IO + w4Network,

Where wi is a weight factor and Network characterizes the Network broadband rate. When the load fraction of a certain node exceeds a threshold value, a load balancing operation is triggered. The balancing process considers the principle of data locality, preferentially migrates the load to the node with the data copy, and reduces the data transmission overhead.

Through the technical innovation, the key problems of low storage space management efficiency, poor data synchronization performance, unbalanced load and the like in the distributed storage system are effectively solved. In practical application, the scheme can support large-scale message storage and processing requirements, and the performance and reliability of the system are remarkably improved through intelligent partition management and predictive load balancing. The method is particularly suitable for scenes requiring high throughput and low delay, such as real-time data analysis, log collection and the like, and the overall performance of the storage system is improved through a multi-level optimization strategy. The system and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the storage system is realized through continuous optimization and improvement.

Step S401, analyzing message heads and message bodies of batch message data sent by a producer, counting data distribution characteristics of the message bodies, calculating data characteristic indexes such as entropy values, repeatability, numerical range and the like of the message bodies, establishing a compression algorithm evaluation matrix, calculating compression rates and processing time delays of different compression algorithms according to the data characteristic indexes, and selecting an optimal compression algorithm in the evaluation matrix;

Step S402, applying the selected compression algorithm to batch message data, generating a compressed byte stream, calculating CRC32 check code of the compressed byte stream, packing the compressed data length, the compression algorithm identification and the check code into metadata header, assembling the metadata header and the compressed byte stream into data blocks, and selecting available log fragments in a pre-write log area to write the data blocks.

Optionally, in this embodiment, a zero copy technique is used in the message parsing stage, and the message data is directly accessed through memory mapping. The message header analysis adopts a fixed-length field design, contains information such as message length, version number, theme, partition and the like, and rapidly locates the positions of all fields through displacement operation. The message body adopts variable Length coding, uses TLV (Type-Length-Value) format storage, supports a nested structure and adapts to a complex message format. The parsing process realizes parallel processing, and a plurality of parsing threads process different message batches simultaneously.

The embodiment realizes accurate data characteristic analysis. The entropy calculation adopts a sliding window method, and the window size can be dynamically adjusted according to the message size.

The entropy calculation formula is h= - Σ (pi×log2 (Pi)), where Pi represents the probability that the byte value i occurs. The repeatability analysis uses a modified LZ77 algorithm to identify repeated data segments by maintaining a sliding window of fixed size. The numerical range analysis adopts histogram statistics, divides the numerical distribution into a plurality of intervals, and calculates the data density of each interval.

The present embodiment designs an innovative compression algorithm evaluation mechanism. The evaluation matrix comprises a plurality of dimensions, such as compression rate, compression speed, decompression speed, memory occupation, and the like. Each dimension is assigned a weight coefficient, and the weight value can be dynamically adjusted according to the application scene. For example, in a scene with high real-time requirements, the compression speed weight increases accordingly. The evaluation process uses a sample testing method to extract representative samples from the message batch for compression testing.

The present embodiment builds an adaptive algorithm selection strategy. And establishing a decision tree model according to the message characteristics and the evaluation result. The branch nodes of the decision tree comprise entropy threshold, repetition threshold and other conditions, and the leaf nodes correspond to specific compression algorithms. By traversing the decision tree, the most appropriate compression algorithm is quickly selected. Meanwhile, an algorithm caching mechanism is realized, and message batches with similar characteristics can be multiplexed with the previous algorithm selection result.

The embodiment realizes an efficient compression processing flow. After the algorithm is selected, a multithreading parallel compression strategy is adopted. Message batches are fragmented according to a fixed size, with each thread being responsible for the compression task of one fragment. The compression result is transferred through the shared memory queue, so that data copying among threads is avoided. Meanwhile, the dynamic adjustment of the compression level is realized, and the proper compression level is selected according to the load condition of the CPU.

The present embodiment designs a reliable data verification mechanism. The CRC32 check is realized by hardware acceleration, and the calculation speed is obviously improved by utilizing the CRC32 instruction of the modern CPU. The verification process covers the compressed data and metadata header, ensuring the integrity of the data. Meanwhile, an incremental verification mechanism is realized, and when the data block is partially modified, only the verification value of the modified part is needed to be recalculated.

This embodiment builds a complete metadata management. The metadata header is in a compact binary format, reducing storage overhead by bit-domain techniques. The compression algorithm identification is represented by an enumeration value, and rapid algorithm identification is supported. The data length field adopts variable length coding, and is suitable for data blocks with different sizes. Meanwhile, the version control of the metadata is realized, and the backward compatibility of the metadata format is supported.

The embodiment realizes an optimized data write strategy. The selection of a log slice is based on a number of factors including the remaining space of the slice, the write load, the data locality, etc. The writing process adopts an asynchronous mode, a plurality of data blocks are accumulated through the writing buffer area and then written in batches, and the writing efficiency is improved. And meanwhile, a pre-writing mechanism is realized, and before the actual writing of the data, the meta information of the data block is written into a pre-writing log, so that the durability of the data is ensured.

The key problems in message compression processing, such as inaccurate compression algorithm selection, low compression efficiency, insufficient data integrity assurance and the like, are effectively solved through the technical innovation. In practical application, the scheme can intelligently select the optimal compression algorithm, and the throughput and the reliability of message processing are obviously improved through parallel processing and an optimized write strategy. The method is particularly suitable for large-scale message processing scenes, such as log collection, monitoring data processing and the like, and the overall performance of the message compression and storage system is improved through a multi-level optimization strategy. The systematic and innovative scheme can adapt to different types of message data, and the comprehensive improvement of the message processing efficiency is realized through continuous optimization and improvement.

Step S501, distributing solid state disk storage space for log slicing, configuring storage parameters of the log slicing, setting storage buffer size and a disk brushing strategy, selecting slave nodes based on configuration information of a copy manager, establishing a copy channel between the master node and the slave node, sending log slicing data to the slave nodes according to batch size, and returning confirmation information to the master node after the slave nodes verify data integrity;

Step S502, establishing a message data uploading task queue, scanning the confirmed copied data area in the log partition according to a preset time interval, grouping and packaging the messages in the data area according to a time range, calculating a storage path of a data packet, asynchronously uploading the data packet to a storage barrel corresponding to a main storage area through an object storage interface, and updating a storage position index of the data packet.

Optionally, in this embodiment, a dynamic management policy is adopted in the solid state disk space allocation. Continuous physical blocks are distributed through a direct IO interface of the file system, and the additional overhead brought by the caching of the file system is avoided. The storage space is pre-allocated according to the fixed size, the initial size of each fragment is 1GB, and dynamic expansion is supported. The distribution process uses an asynchronous mode, a plurality of fragments can be created in parallel, and the resource initialization efficiency is improved.

The embodiment realizes the optimized storage parameter configuration. The storage buffer area adopts direct memory allocation, bypasses the JVM heap memory management, and each partition is provided with an independent writing buffer area and a pre-reading buffer area. The buffer size is dynamically adjusted according to the historical IO mode, the write buffer default is set to 64MB, and the pre-read buffer size is optimized based on the sequential read mode. The swiping strategy combines a time threshold and a data amount threshold, and triggers a swiping operation when the accumulated data exceeds the threshold or reaches a time interval.

This embodiment designs a reliable copy channel mechanism. The copy channel is realized based on the Netty framework and adopts a zero copy transmission technology. The channel establishment process includes identity authentication and capability negotiation, supporting SSL encryption and compression transmission. The data transmission adopts a pipeline design, so that a plurality of batches can be sent simultaneously, and the size of each batch is dynamically adjusted according to network conditions. And the sending rate is controlled through the sliding window, so that network congestion is avoided.

The embodiment constructs a complete data verification flow. After receiving the data from the node, the integrity of the data packet is first verified, including checksum verification and sequence number checking. After verification is passed, the data is written into the local storage, and the data reliability is ensured by adopting a double-writing mechanism. The confirmation message contains the batch ID and the writing position information, and is returned to the master node in an asynchronous mode, so that confirmation delay is reduced.

The embodiment realizes efficient uploading task management. The task queues adopt a priority queue structure, and the priorities are calculated based on time sensitivity and storage pressure of the data. The queue processing adopts a multithreading model, and the processing thread number is dynamically adjusted according to the CPU core number and the IO load. The task execution process realizes a failure retry mechanism, and an exponential backoff strategy is adopted in the retry interval.

The present embodiment designs an innovative data packing strategy. The grouping process considers multiple dimensions, time ranges, data volumes, access patterns, etc. The optimal packing size is determined through aggregation analysis, and the storage efficiency and the retrieval performance are balanced. The packing process supports parallel processing, and data in multiple time ranges can be packed simultaneously. Meanwhile, an incremental packaging mechanism is realized, and only the newly added data area is processed.

The embodiment constructs an intelligent storage path calculation scheme. The path generation takes the time attribute and the service attribute of the data into consideration, and adopts a multi-level directory structure. And the path calculation adopts a consistent hash algorithm, so that the related data are ensured to be stored in similar positions, and the batch reading efficiency is improved. Meanwhile, path conflict processing is realized, and the homonymous conflict is solved by adding the unique identifier.

The embodiment realizes a reliable asynchronous uploading mechanism. The object storage interface encapsulates retry and concurrency control logic supporting fragment upload and breakpoint resume. The uploading process adopts stream processing, and the transmission efficiency is improved through pipelining design. Meanwhile, the uploading progress tracking is realized, and the progress query and the cancel operation with fine granularity are supported. The selection of buckets is based on a lifecycle policy of the data, supporting automatic migration of the data between different storage tiers.

Through the technical innovation, the key problems of storage performance bottleneck, data replication delay, low uploading efficiency and the like in the distributed storage system are effectively solved. In practical application, the scheme can support large-scale data storage and rapid replication requirements, and the throughput and the reliability of the system are obviously improved through a multi-level optimization strategy. The method is particularly suitable for scenes requiring high-performance storage and disaster recovery, such as financial transactions, log archiving and the like, and the overall performance of the storage system is improved through an optimized storage strategy and a reliable replication mechanism. The system and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the storage system is realized through continuous optimization and improvement.

step S601, memory space is allocated for a message cache region, a mapping relation between a message position point and a storage address is constructed by a cache index table, a capacity upper limit and an elimination strategy of the cache region are set, a hot spot message scoring mechanism is established, a message hotness score is calculated based on the access frequency and time attenuation factor of the message, and message data with the hotness score exceeding a preset threshold value is loaded to the cache region;

And step S602, monitoring read-write load data of the message partition, training a prediction model by the load predictor based on the historical load data, calculating a load trend in a future time window, triggering partition capacity expansion when the predicted load exceeds a preset threshold value, redistributing the message data in the existing partition to a new partition according to a consistent hash algorithm, and updating a partition routing table.

Optionally, the present embodiment employs a multi-level cache architecture on memory space allocation. And the message buffer area is managed through the out-of-heap memory pool, so that the GC pressure is avoided. The memory pool adopts a sectional lock design, the whole memory space is divided into a plurality of areas, each area is controlled by an independent lock, and thread competition is reduced. The allocation granularity of the buffer area is dynamically adjusted according to the size of the message, and the memory block management is carried out by adopting a partner algorithm, so that the memory fragments are effectively reduced.

The embodiment realizes an efficient cache index structure. The method adopts a three-level index design, wherein the first level is mapping from a location to a partition, quick searching is realized by using a jump table, the second level is a message index in the partition, a B+ tree structure is adopted, and the third level is memory address mapping of specific messages and uses a direct addressing table. The multi-level index structure can realize the O (log n) searching efficiency in mass information and support the range query operation.

The present embodiment designs an innovative message popularity scoring mechanism. The hotness score calculation formula is:

Score=f×e (- λt) ×w, where F is the access frequency, λ is the time decay coefficient, t is the time interval from the last access, and W is the message weight. The weighting factors take into account the traffic priority and data size of the message. The scoring process adopts sliding window statistics, the window size can be dynamically adjusted according to the business characteristics, and the real-time heat change of the message is accurately reflected.

The embodiment constructs an intelligent cache elimination strategy. And combining the LRU-K algorithm and the ARC algorithm to realize an adaptive elimination mechanism. Two cache queues, a frequent access queue and a general access queue, are maintained. The message enters a general queue when being accessed for the first time, and is upgraded to a frequent queue when the access times exceed a threshold value. The size proportion of the two queues is dynamically adjusted through the access mode, so that the optimal utilization of the cache space is realized.

The embodiment realizes an accurate load monitoring system. And acquiring multi-dimensional load indexes such as message throughput, delay distribution, queue depth and the like. The data acquisition adopts a self-adaptive sampling strategy, and the sampling frequency is increased when the load change is remarkable. And storing historical load data through a time sequence database, establishing a multi-dimensional load portrait, and providing training data for the prediction model.

The present embodiment designs an advanced load prediction model. With LSTM neural network architecture, the input features include historical load metrics, temporal features (hours, weeks, holidays) and system events. The model captures the correlation among different characteristics through a multi-head attention mechanism and accurately predicts the short-term and medium-term load trend. The training process adopts an online learning mode, continuously optimizes model parameters and adapts to the dynamic change of a load mode.

The embodiment constructs a reliable partition capacity expansion mechanism. The capacity expansion triggering condition comprehensively considers a plurality of factors such as predicted load, resource utilization rate, performance index and the like. The expansion process adopts a progressive strategy, a new partition is firstly created and preheated, and then message data is redistributed through a consistent hash algorithm. The hash algorithm uses a virtual node technique to ensure data distribution uniformity while minimizing data migration.

The embodiment realizes an efficient data migration flow. And in the migration process, a batch processing mode is adopted, and a migration task queue is set to manage migration requests. Migration efficiency is improved through multi-thread parallel processing, and each thread is responsible for data migration of one batch. Meanwhile, a migration progress tracking and fault recovery mechanism is realized, and the reliability of a migration process is ensured.

The key problems of low cache hit rate, inaccurate load prediction, poor capacity expansion efficiency and the like in the distributed message system are effectively solved through the technical innovation. In practical application, the scheme can intelligently identify hot spot messages and optimize a caching strategy, and the response speed and expansibility of the system are remarkably improved through accurate load prediction and an efficient capacity expansion mechanism. The method is particularly suitable for scenes with obvious access hotspots, such as social media pushing, real-time recommendation and the like, and the overall performance of the message system is improved through a multi-level optimization strategy. The systematicness and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the message processing system is realized through continuous optimization and improvement.

Step S701, a consumer sends a site query request to a coordinator, the coordinator returns storage position information corresponding to a message site, the consumer searches target message data in a message cache area according to the storage position information, if the target message data does not exist in the cache area, a corresponding data packet is acquired from a main storage area, a metadata header in the data packet is read to analyze a compression algorithm identifier, and a corresponding decompression method is called to restore the message data;

step S702, the check code of the decompressed message data is calculated and compared with the check code recorded in the metadata header, the message data is returned to the consumer after the check is passed, the consumer submits the consumption site to the coordinator after processing the message, the coordinator records the consumption site to the site management table, and the site management table is regularly persisted to the storage system.

Optionally, the present embodiment employs a multi-level caching mechanism on the bit-point query processing. The coordinator maintains the memory cache of the site map, and quick searching is realized by adopting ConcurrentHashMap. The caches are organized by consuming groups and partitions, using a double-layer hash table structure. For the location information of the hot spot query, the access pressure of the coordinator is reduced through the local cache. The consistency of the cache is maintained through a version number mechanism, and when the sites are updated, the local cache is informed to each consumer through a broadcasting mechanism.

The embodiment realizes intelligent storage position analysis. The storage position information adopts a compression coding format and comprises information such as storage hierarchy, file path, offset and the like. The parsing process uses a state machine design that is capable of handling different versions of the encoding format. By means of the locality principle of the location information, the possible accessed adjacent message locations are predicted and prefetched, and subsequent query delays are reduced.

The present embodiment designs an efficient cache lookup strategy. The multi-level index is adopted in the message buffer area, whether the message possibly exists or not is judged rapidly through a bloom filter, and then a specific position is located through a skip list. And starting asynchronous preloading when the cache is not hit, predicting and loading the message which is possibly accessed later according to the access mode of the message, and improving the cache hit rate.

The present embodiment builds a reliable packet read mechanism. When the data packet is acquired from the main storage area, an asynchronous IO mode is adopted, and the concurrent reading quantity is controlled through the thread pool. Reading of the data packet supports a range request, acquiring only the necessary data portion. For large data packets, a slicing reading strategy is realized, and excessive data is prevented from being loaded to a memory at one time.

The embodiment realizes the optimized decompression processing flow. The selection of decompression algorithms supports a variety of compression algorithms such as LZ4, snapy, ZSTD, etc. by fast positioning through the identification in the metadata header. The decompression process adopts thread pool management, and for a large number of small messages, a batch decompression strategy is adopted to improve efficiency. And meanwhile, decompression caching is realized, and the same compressed data block is decompressed only once.

The embodiment designs a strict data checking flow. The check code calculation adopts a hardware-accelerated CRC32 algorithm, and the calculation speed is improved through SIMD instructions. The verification process overlays the complete contents of the message, including metadata and payload data. When the verification fails, a multi-stage retry mechanism is implemented, first attempting to re-read the data, and if the verification fails continuously, acquiring the data from the replica node.

The present embodiment builds a reliable site submission mechanism. Consumers submit sites asynchronously, accumulating a certain number of sites through a local buffer, and then submitting the sites in batches. The commit process achieves idempotent, and repeated commit of the same site does not lead to errors. The site submission adopts a two-stage submission protocol, so that the consistency in a distributed environment is ensured.

The embodiment realizes an efficient site management strategy. The site management table adopts a segmented lock design, so that lock competition of concurrent updating is reduced. The table records detailed location information including consumer group ID, topic, partition, location value, timestamp, etc. The persistence process adopts an incremental mode, and only the changed site information is saved. The reliability of the site data is ensured through a pre-write log mechanism.

By the technical innovation, the key problems in the consumption of the distributed message, such as high delay of site inquiry, low data decompression efficiency, unreliable site management and the like, are effectively solved. In practical application, the scheme can support a large-scale message consumption scene, and the efficiency and the reliability of message consumption are obviously improved through multi-level caching and optimized data processing flow. The method is particularly suitable for scenes requiring high throughput and low delay, such as real data processing, event stream processing and the like, and the overall performance of the message consumption system is improved through a multi-level optimization strategy. The systematicness and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the message consumption system is realized through continuous optimization and improvement.

In order to effectively solve the shortcomings of the conventional technology in terms of memory coupling, memory efficiency, load balancing and the like, and significantly improve the performance and expandability of a message queue system, the present application provides an embodiment of a kafka memory separated data processing apparatus for implementing all or part of the contents of the kafka memory separated data processing method, and referring to fig. 2, the kafka memory separated data processing apparatus specifically includes the following contents:

The architecture construction module 10 is configured to construct a multi-layer data processing architecture, where the multi-layer data processing architecture includes a network communication layer, a task distribution layer, a processing scheduling layer, and a storage layer, the storage layer includes a flow repository, the flow repository includes a pre-write log area and a main storage area, the pre-write log area is divided into a plurality of log slices, the main storage area implements data persistence through an object storage service, the processing scheduling layer includes a copy manager, a coordinator, and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for allocation and group management, and the load predictor builds a prediction model based on a system resource utilization rate;

The message storage module 20 is configured to receive batch message data sent by a producer, calculate data characteristics of the batch message data, select a compression algorithm according to the data characteristics, perform compression encoding on the batch message data to generate a check code, write the compressed message data into log slices in the pre-written log area, store the log slices by using a solid state disk, establish a master-slave copy relationship between a plurality of log slices according to configuration of the copy manager, and asynchronously upload the message data in the log slices to the main storage area;

And the message cache module 30 is used for creating a message cache area, the message cache area is used for storing hot message data, the number of message partitions is adjusted according to the prediction result of the load predictor, a consumer reads the hot message data from the message cache area based on a message location, when no target message data exists in the message cache area, the consumer acquires cold data from the main storage area according to the message location and decompresses the cold data, the coordinator records the consumption progress of the consumer, and the data integrity is verified based on the check code.

As can be seen from the above description, the kafka memory and storage separated data processing apparatus provided by the embodiments of the present application is capable of constructing a multi-layer data processing architecture including four layers of network communication, task distribution, process scheduling and storage. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.

In order to effectively solve the shortcomings of the traditional technology in terms of memory coupling, memory efficiency, load balancing and the like and significantly improve the performance and expandability of a message queue system, the application provides an embodiment of an electronic device for realizing all or part of contents in the data processing method for kafka memory separation, wherein the electronic device specifically comprises the following contents:

The device comprises a processor (processor), a memory (memory), a communication interface (Communications Interface) and a bus, wherein the processor, the memory and the communication interface are used for completing communication among the processor, the memory and the communication interface through the bus, the communication interface is used for realizing information transmission between a data processing device with separate kafka memory and related equipment such as a core service system, a user terminal and a related database, and the logic controller can be a desktop computer, a tablet personal computer, a mobile terminal and the like. In this embodiment, the logic controller may be implemented with reference to the embodiment of the data processing method of kafka calculation separation in the embodiment and the embodiment of the data processing apparatus of kafka calculation separation, and the contents thereof are incorporated herein, and the repetition is omitted.

It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), a vehicle-mounted device, a smart wearable device, etc. Wherein, intelligent wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..

In practical applications, the kafka stores part of the separate data processing method, which may be executed on the electronic device side as described above, or all operations may be performed in the client device. Specifically, the selection may be made according to the processing capability of the client device, and restrictions of the use scenario of the user. The application is not limited in this regard. If all operations are performed in the client device, the client device may further include a processor.

The client device may have a communication module (i.e. a communication unit) and may be connected to a remote server in a communication manner, so as to implement data transmission with the server. The server may include a server on the side of the task scheduling center, and in other implementations may include a server of an intermediate platform, such as a server of a third party server platform having a communication link with the task scheduling center server. The server may include a single computer device, a server cluster formed by a plurality of servers, or a server structure of a distributed device.

Fig. 3 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 3, the electronic device 9600 can include a central processor 9100 and a memory 9140, the memory 9140 being coupled to the central processor 9100. It is noted that this fig. 3 is exemplary, and that other types of structures may be used in addition to or in place of the structures to implement telecommunications functions or other functions.

In one embodiment, the kafka memory separate data processing method functions may be integrated into the central processor 9100. The central processor 9100 may be configured to perform the following control:

As can be seen from the above description, the electronic device provided by the embodiment of the present application constructs a multi-layer data processing architecture, including four layers of network communication, task distribution, processing scheduling and storage. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.

In another embodiment, the data processing apparatus with the kafka calculation separation may be configured separately from the central processor 9100, for example, the data processing apparatus with the kafka calculation separation may be configured as a chip connected to the central processor 9100, and the data processing method function with the kafka calculation separation is implemented by control of the central processor.

As shown in fig. 3, the electronic device 9600 may further include a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 does not necessarily include all the components shown in fig. 3, and furthermore, the electronic device 9600 may include components not shown in fig. 3, to which reference is made in the prior art.

As shown in fig. 3, the central processor 9100, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which central processor 9100 receives inputs and controls the operation of the various components of the electronic device 9600.

The memory 9140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information about failure may be stored, and a program for executing the information may be stored. And the central processor 9100 can execute the program stored in the memory 9140 to realize information storage or processing, and the like.

The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. The power supply 9170 is used to provide power to the electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, but not limited to, an LCD display.

The memory 9140 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), SIM card, etc. But also a memory which holds information even when powered down, can be selectively erased and provided with further data, an example of which is sometimes referred to as EPROM or the like. The memory 9140 may also be some other type of device. The memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 storing application programs and function programs or a flow for executing operations of the electronic device 9600 by the central processor 9100.

The memory 9140 may also include a data store 9143, the data store 9143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).

The communication module 9110 is a transmitter/receiver that transmits and receives signals via the antenna 9111. The communication module 9110 (transmitter/receiver) is coupled to the central processor 9100 to provide input signals and receive output signals, as in the case of conventional mobile communication terminals.

Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, etc., may be provided in the same electronic device. The communication module 9110 (transmitter/receiver) is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and to receive audio input from the microphone 9132 to implement usual telecommunications functions. The audio processor 9130 can include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100 so that sound can be recorded locally through the microphone 9132 and sound stored locally can be played through the speaker 9131.

An embodiment of the present application also provides a computer-readable storage medium capable of realizing all steps in the data processing method in which the execution subject is the kafka calculation separation of the server or the client in the above embodiment, the computer-readable storage medium having stored thereon a computer program which, when executed by a processor, realizes all steps in the data processing method in which the execution subject is the kafka calculation separation of the server or the client in the above embodiment, for example, the processor realizes the following steps when executing the computer program:

From the above description, it can be seen that the computer readable storage medium provided by the embodiments of the present application includes four levels of network communication, task distribution, process scheduling and storage by constructing a multi-layer data processing architecture. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.

The embodiment of the present application also provides a computer program product capable of realizing all the steps in the data processing method of kafka calculation separation in which the execution subject is a server or a client in the above embodiment, the computer program/instructions realizing the steps of the data processing method of kafka calculation separation when executed by a processor, for example, the computer program/instructions realizing the steps of:

From the above description, it can be seen that the computer program product provided by the embodiments of the present application includes four levels of network communication, task distribution, process scheduling and storage by constructing a multi-level data processing architecture. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the principles and embodiments of the present invention have been described in detail in the foregoing application of the principles and embodiments of the present invention, the above examples are provided for the purpose of aiding in the understanding of the principles and concepts of the present invention and may be varied in many ways by those of ordinary skill in the art in light of the teachings of the present invention, and the above descriptions should not be construed as limiting the invention.

Claims

1. A data processing method for Kafka storage and computing separation, characterized in that the method comprises:

Construct a multi-layer data processing architecture, the multi-layer data processing architecture includes a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer includes a stream storage library, the stream storage library includes a pre-write log area and a main storage area, a plurality of log shards are divided in the pre-write log area, the main storage area realizes data persistence through an object storage service, the processing scheduling layer includes a replica manager, a coordinator and a load predictor, the replica manager is responsible for data replica synchronization, the coordinator is responsible for partition allocation and group management, and the load predictor establishes a prediction model based on system resource utilization;

Receive batch message data sent by the producer, calculate the data features of the batch message data, select a compression algorithm according to the data features, perform compression encoding on the batch message data to generate a checksum, write the compressed message data into the log shards in the pre-write log area, the log shards are stored in solid-state hard disks, establish a master-slave replication relationship between the multiple log shards according to the configuration of the replica manager, and asynchronously upload the message data in the log shards to the main storage area;

A message buffer is created, where the message buffer is used to store hot message data. The number of message partitions is adjusted according to the prediction result of the load predictor. The consumer reads the hot message data from the message buffer based on the message site. When the target message data does not exist in the message buffer, the consumer obtains and decompresses the cold data from the main storage area according to the message site. The coordinator records the consumer's consumption progress and verifies the data integrity based on the check code.

2. The data processing method for separating storage and computing of Kafka according to claim 1 is characterized in that a multi-layer data processing architecture is constructed, the multi-layer data processing architecture includes a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer includes a stream storage library, the stream storage library includes a pre-write log area and a main storage area, including:

Divide the data processing architecture into call links, set up socket listeners in the network communication layer to handle client connection requests, create message transmission channels and configure transmission parameters, build routing tables in the task distribution layer to map the correspondence between processors and message types, establish task queues, set up task priority scheduling mechanisms, and distribute tasks to corresponding processors according to message types;

Deploy a replica manager at the processing and scheduling layer to implement partition master-slave replication, configure a coordinator to manage partition allocation and consumer groups, deploy a load predictor to collect system resource usage data, create a stream repository at the storage layer and divide the pre-write log area and the main storage area. The pre-write log area uses a solid-state hard disk storage medium, and the main storage area uses an object storage service as persistent storage.

3. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the pre-write log area is divided into multiple log shards, the main storage area realizes data persistence through the object storage service, the processing scheduling layer includes a replica manager, a coordinator and a load predictor, the replica manager is responsible for data replica synchronization, the coordinator is responsible for partition allocation and group management, and the load predictor establishes a prediction model based on system resource utilization, including:

The number of log shards is calculated based on the total capacity of the pre-write log area, a unique identifier is assigned to each log shard, the log shards are organized into a linked list structure in chronological order, a message offset and a timestamp index are set in the log shards, a reserved space threshold of the log shards is configured, a mapping relationship between the log shards and the physical storage device is established, the main storage area is divided into buckets according to the time range, and a metadata index is created for each bucket;

Configure the master-slave replication parameters of the replica manager, establish a synchronization mechanism for partition replicas, build a partition allocation table and a consumer group management table in the coordinator, record consumer group member information and consumption progress, and the load predictor collects processor usage, memory occupancy, disk read and write rate system resource indicators, builds a time series prediction model to calculate resource usage trends, and generates a load balancing strategy based on the prediction results.

4. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the receiving of batch message data sent by the producer, calculating the data features of the batch message data, selecting a compression algorithm according to the data features, performing compression encoding on the batch message data to generate a checksum, and writing the compressed message data into the log shard in the write-ahead log area comprises:

Parse the message header and message body of the batch message data sent by the producer, count the data distribution characteristics of the message body, calculate the entropy value, repetition, and value range data characteristic indicators of the message body, establish a compression algorithm evaluation matrix, calculate the compression rate and processing delay of different compression algorithms based on the data characteristic indicators, and select the optimal compression algorithm in the evaluation matrix;

Apply the selected compression algorithm to the batch message data to generate a compressed byte stream, calculate the CRC32 check code of the compressed byte stream, package the compressed data length, compression algorithm identifier, and check code into a metadata header, assemble the metadata header and the compressed byte stream into a data block, and select an available log segment in the pre-write log area to write the data block.

5. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the log shards are stored in solid-state hard disks, a master-slave replication relationship is established between the multiple log shards according to the configuration of the replica manager, and the message data in the log shards are asynchronously uploaded to the main storage area, including:

Allocate SSD storage space for log shards, configure storage parameters for log shards, set storage buffer size and flushing strategy, select slave nodes based on the configuration information of the replica manager, establish replication channels between master and slave nodes, send log shard data to slave nodes according to batch size, and return confirmation information to the master node after the slave node verifies data integrity;

Establish a message data upload task queue, scan the confirmed replicated data area in the log shard at preset time intervals, group and package the messages in the data area according to the time range, calculate the storage path of the data packet, asynchronously upload the data packet to the storage bucket corresponding to the main storage area through the object storage interface, and update the storage location index of the data packet.

6. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the creation of a message buffer area, the message buffer area is used to store hot message data, and the number of message partitions is adjusted according to the prediction result of the load predictor, including:

Allocate memory space for the message cache, build a cache index table to record the mapping relationship between message locations and storage addresses, set the cache capacity limit and elimination strategy, establish a hot message scoring mechanism, calculate the message heat score based on the message access frequency and time decay factor, and load the message data whose heat score exceeds the preset threshold into the cache;

Monitor the read and write load data of the message partition. The load predictor trains a prediction model based on historical load data, calculates the load trend in the future time window, triggers partition expansion when the predicted load exceeds a preset threshold, redistributes the message data in the existing partition to the newly added partition according to the consistent hashing algorithm, and updates the partition routing table.

7. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the consumer reads the hot message data from the message buffer area based on the message site. When the target message data does not exist in the message buffer area, the consumer obtains the cold data from the main storage area according to the message site and decompresses it. The coordinator records the consumer's consumption progress and verifies the data integrity based on the check code, including:

The consumer sends a location query request to the coordinator, and the coordinator returns the storage location information corresponding to the message location. The consumer searches for the target message data in the message cache according to the storage location information. If the target message data does not exist in the cache, the consumer obtains the corresponding data packet from the main storage area, reads the metadata header in the data packet to parse the compression algorithm identifier, and calls the corresponding decompression method to restore the message data.

The checksum of the decompressed message data is calculated and compared with the checksum recorded in the metadata header. After the checksum passes, the message data is returned to the consumer. After processing the message, the consumer submits the consumption location to the coordinator. The coordinator records the consumption location in the location management table and regularly persists the location management table to the storage system.

8. A data processing device with Kafka storage and computing separation, characterized in that the device comprises:

An architecture building module is used to build a multi-layer data processing architecture, wherein the multi-layer data processing architecture includes a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, wherein the storage layer includes a stream storage library, the stream storage library includes a pre-write log area and a main storage area, wherein a plurality of log shards are divided in the pre-write log area, and the main storage area realizes data persistence through an object storage service, wherein the processing scheduling layer includes a replica manager, a coordinator and a load predictor, wherein the replica manager is responsible for data replica synchronization, the coordinator is responsible for partition allocation and group management, and the load predictor establishes a prediction model based on system resource utilization;

A message storage module, used to receive batch message data sent by a producer, calculate data features of the batch message data, select a compression algorithm according to the data features, perform compression encoding on the batch message data to generate a checksum, write the compressed message data into a log shard in the pre-write log area, the log shard is stored in a solid-state hard disk, establish a master-slave replication relationship between the plurality of log shards according to the configuration of the replica manager, and asynchronously upload the message data in the log shard to the main storage area;

A message cache module is used to create a message cache area, wherein the message cache area is used to store hot message data, and the number of message partitions is adjusted according to the prediction result of the load predictor. The consumer reads the hot message data from the message cache area based on the message site. When the target message data does not exist in the message cache area, the consumer obtains and decompresses the cold data from the main storage area according to the message site. The coordinator records the consumer's consumption progress and verifies the data integrity based on the check code.

9. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of the Kafka storage-computation separation data processing method described in any one of claims 1 to 7 are implemented.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that when the computer program is executed by a processor, the steps of the Kafka storage and computing separation data processing method described in any one of claims 1 to 7 are implemented.