CN120123108A - Kafka storage and computing separation data processing method and device - Google Patents

Kafka storage and computing separation data processing method and device Download PDF

Info

Publication number
CN120123108A
CN120123108A CN202510616765.6A CN202510616765A CN120123108A CN 120123108 A CN120123108 A CN 120123108A CN 202510616765 A CN202510616765 A CN 202510616765A CN 120123108 A CN120123108 A CN 120123108A
Authority
CN
China
Prior art keywords
data
message
storage
log
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202510616765.6A
Other languages
Chinese (zh)
Other versions
CN120123108B (en
Inventor
王珣
陆阳
袁娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fullsee Technology Co ltd
Original Assignee
Fullsee Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fullsee Technology Co ltd filed Critical Fullsee Technology Co ltd
Priority to CN202510616765.6A priority Critical patent/CN120123108B/en
Publication of CN120123108A publication Critical patent/CN120123108A/en
Application granted granted Critical
Publication of CN120123108B publication Critical patent/CN120123108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供一种kafka存算分离的数据处理方法及装置,通过构建多层数据处理架构,包括网络通信、任务分发、处理调度和存储四个层次。通过预写日志区与主存储区的分层设计实现冷热数据分离,结合副本管理器、协调器和负载预测器实现系统资源的智能调度。采用自适应压缩算法和校验机制保证数据传输效率和完整性,创建消息缓存区存储热点数据,基于消息位点实现冷热数据的灵活访问。该方法有效解决了传统技术在存算耦合、存储效率和负载均衡等方面的不足,显著提升了消息队列系统的性能和可扩展性。

The embodiment of the present application provides a data processing method and device for Kafka storage and computing separation, by constructing a multi-layer data processing architecture, including four levels: network communication, task distribution, processing scheduling and storage. The separation of hot and cold data is achieved through the hierarchical design of the pre-write log area and the main storage area, and the intelligent scheduling of system resources is achieved by combining the replica manager, coordinator and load predictor. An adaptive compression algorithm and verification mechanism are used to ensure the efficiency and integrity of data transmission, create a message cache area to store hot spot data, and realize flexible access to hot and cold data based on the message site. This method effectively solves the shortcomings of traditional technologies in storage and computing coupling, storage efficiency and load balancing, and significantly improves the performance and scalability of the message queue system.

Description

Data processing method and device for kafka memory calculation separation
Technical Field
The application relates to the field of data processing, in particular to a data processing method and device for kafka calculation separation.
Background
The existing Kafka data processing method has obvious defects. The traditional architecture has high coupling degree between storage and calculation, is difficult to realize the elastic expansion of resources, and has the problem of unbalanced data access hot spots.
Furthermore, the prior art has a bottleneck in data storage efficiency. Most systems fail to effectively distinguish cold and hot data, and lack of an intelligent data compression and caching mechanism results in low storage space utilization and large access delay.
Existing systems have technology shortboards in terms of load balancing and fault tolerance mechanisms. The lack of dynamic partition adjustment and prediction capabilities makes it difficult to cope with bursty traffic and system failures. The resolution of these problems is of great importance to improve the performance and reliability of the message queuing system.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a data processing method and device for kafka calculation separation, which can effectively solve the defects in the aspects of calculation coupling, storage efficiency, load balancing and the like in the traditional technology and remarkably improve the performance and expandability of a message queue system.
In order to solve at least one of the problems, the application provides the following technical scheme:
in a first aspect, the present application provides a data processing method of kafka memory separation, comprising:
The method comprises the steps of constructing a multi-layer data processing architecture, wherein the multi-layer data processing architecture comprises a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer comprises a stream storage library, the stream storage library comprises a pre-write log area and a main storage area, a plurality of log fragments are divided in the pre-write log area, the main storage area realizes data persistence through object storage service, the processing scheduling layer comprises a copy manager, a coordinator and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for distribution and group management, and the load predictor is used for establishing a prediction model based on the utilization rate of system resources;
Receiving batch message data sent by a producer, calculating data characteristics of the batch message data, selecting a compression algorithm according to the data characteristics, executing compression coding on the batch message data to generate a check code, writing the compressed message data into log fragments in the pre-written log area, storing the log fragments by adopting a solid state disk, establishing a master-slave copy relationship among a plurality of log fragments according to configuration of the copy manager, and asynchronously uploading the message data in the log fragments to the main storage area;
Creating a message buffer area, wherein the message buffer area is used for storing hot message data, the number of message partitions is adjusted according to the prediction result of the load predictor, a consumer reads the hot message data from the message buffer area based on a message position point, when target message data does not exist in the message buffer area, the consumer acquires cold data from the main storage area according to the message position point and decompresses the cold data, the coordinator records the consumption progress of the consumer, and the data integrity is verified based on the check code.
Dividing a data processing architecture into call links, setting a socket monitor at a network communication layer to process a client connection request, creating a message transmission channel and configuring transmission parameters, constructing a routing table at a task distribution layer to map the corresponding relation between a processor and a message type, establishing a task queue, setting a task priority scheduling mechanism, and distributing tasks to the corresponding processors according to the message type;
The method comprises the steps of deploying a copy manager in a processing scheduling layer to realize partition master-slave copy, managing partition and consumption groups by a configuration coordinator, collecting system resource use data by a deployment load predictor, creating a stream storage library in a storage layer, dividing a pre-write log area and a main storage area, wherein the pre-write log area adopts a solid state disk storage medium, and the main storage area adopts object storage service as persistent storage.
Calculating the number of log fragments based on the total capacity of a pre-written log area, distributing a unique identifier for each log fragment, organizing the log fragments into a linked list structure according to time sequence, setting a message offset and a time stamp index in the log fragments, configuring a reserved space threshold of the log fragments, establishing a mapping relation between the log fragments and physical storage equipment, dividing storage barrels by a main storage area according to a time range, and establishing a metadata index for each storage barrel;
Configuring master-slave replication parameters of a replication manager, establishing a synchronization mechanism of partition replicas, constructing a partition allocation table and a consumption group management table in a coordinator, recording consumption group member information and consumption progress, acquiring system resource indexes such as processor utilization rate, memory occupancy rate, disk read-write rate and the like by a load predictor, constructing a time sequence prediction model to calculate resource utilization trend, and generating a load balancing strategy based on a prediction result.
Analyzing message heads and message bodies of batch message data sent by a producer, counting data distribution characteristics of the message bodies, calculating data characteristic indexes such as entropy values, repeatability, numerical ranges and the like of the message bodies, establishing a compression algorithm evaluation matrix, calculating compression rates and processing time delays of different compression algorithms according to the data characteristic indexes, and selecting an optimal compression algorithm in the evaluation matrix;
The selected compression algorithm is applied to batch message data, a compressed byte stream is generated, CRC32 check codes of the compressed byte stream are calculated, the compressed data length, the compression algorithm identification and the check codes are packed into metadata heads, the metadata heads and the compressed byte stream are assembled into data blocks, and available log fragments are selected in a pre-write log area to write the data blocks.
Further, the method further comprises the steps of distributing a solid state disk storage space for log fragments, configuring storage parameters of the log fragments, setting a storage buffer area size and a disk brushing strategy, selecting slave nodes based on configuration information of a copy manager, establishing a copy channel between the master node and the slave node, sending log fragment data to the slave nodes according to batch size, and returning confirmation information to the master node after the slave nodes verify data integrity;
Establishing a message data uploading task queue, scanning the confirmed copied data area in the log partition according to a preset time interval, grouping and packaging the messages in the data area according to a time range, calculating a storage path of a data packet, asynchronously uploading the data packet to a storage barrel corresponding to a main storage area through an object storage interface, and updating a storage position index of the data packet.
Further, the method further comprises the steps of distributing a memory space for a message cache region, constructing a cache index table to record the mapping relation between message sites and storage addresses, setting the upper limit of capacity and elimination strategy of the cache region, establishing a hot message scoring mechanism, calculating a message hotness score based on the access frequency and time attenuation factor of the message, and loading message data with the hotness score exceeding a preset threshold value into the cache region;
And monitoring read-write load data of the message partition, wherein the load predictor trains a prediction model based on historical load data, calculates a load trend in a future time window, triggers partition capacity expansion when the predicted load exceeds a preset threshold, redistributes the message data in the existing partition to a newly added partition according to a consistent hash algorithm, and updates a partition routing table.
Further, the method comprises the steps that a consumer sends a site query request to a coordinator, the coordinator returns storage position information corresponding to a message site, the consumer searches target message data in a message cache area according to the storage position information, if the target message data does not exist in the cache area, a corresponding data packet is obtained from a main storage area, a metadata header in the data packet is read to analyze a compression algorithm identifier, and a corresponding decompression method is called to restore the message data;
And after the verification is passed, the message data is returned to the consumer, the consumer submits the consumption site to the coordinator after processing the message, the coordinator records the consumption site to the site management table, and the site management table is regularly persisted to the storage system.
In a second aspect, the present application provides a kafka deposit separation data processing apparatus comprising:
The architecture building module is used for building a multi-layer data processing architecture, the multi-layer data processing architecture comprises a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer comprises a stream storage library, the stream storage library comprises a pre-write log area and a main storage area, the pre-write log area is divided into a plurality of log fragments, the main storage area realizes data persistence through object storage service, the processing scheduling layer comprises a copy manager, a coordinator and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for distribution and group management, and the load predictor is used for building a prediction model based on the utilization rate of system resources;
the message storage module is used for receiving batch message data sent by a producer, calculating the data characteristics of the batch message data, selecting a compression algorithm according to the data characteristics, executing compression coding on the batch message data to generate a check code, writing the compressed message data into log fragments in the pre-written log area, storing the log fragments by adopting a solid state disk, establishing a master-slave copy relationship among a plurality of log fragments according to the configuration of the copy manager, and asynchronously uploading the message data in the log fragments to the main storage area;
And the message cache module is used for creating a message cache area, the message cache area is used for storing hot message data, the number of message partitions is adjusted according to the prediction result of the load predictor, a consumer reads the hot message data from the message cache area based on a message position point, when target message data does not exist in the message cache area, the consumer acquires cold data from the main storage area according to the message position point and decompresses the cold data, the coordinator records the consumption progress of the consumer, and the data integrity is verified based on the check code.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the kafka separate data processing method when the program is executed.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the kafka calculation split data processing method.
In a fifth aspect, the application provides a computer program product comprising computer programs/instructions which when executed by a processor implement the steps of the kafka memory separation data processing method.
According to the technical scheme, the application provides a data processing method and device with kafka calculation separation, and the data processing method and device comprises four layers of network communication, task distribution, processing scheduling and storage through construction of a multi-layer data processing architecture. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a data processing method of kafka memory separation in an embodiment of the application;
FIG. 2 is a block diagram of a kafka memory separation data processing apparatus in an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Reference numerals:
An electronic device 9600, a central processor 9100, a memory 9140, a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, a power supply 9170, a buffer memory 9141, an application/function storage portion 9142, a data storage portion 9143, a driver storage portion 9144, an antenna 9111, a speaker 9131, and a microphone 9132.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The technical scheme of the application obtains, stores, uses, processes and the like the data, which all meet the relevant regulations of national laws and regulations.
In view of the problems existing in the prior art, the application provides a data processing method and device for kafka memory separation, which are used for constructing a multi-layer data processing architecture, including four layers of network communication, task distribution, processing scheduling and storage. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.
In order to effectively solve the defects of the traditional technology in terms of memory coupling, memory efficiency, load balancing and the like, the application provides an embodiment of a data processing method for kafka memory separation, which specifically comprises the following steps:
Step S101, constructing a multi-layer data processing architecture, wherein the multi-layer data processing architecture comprises a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer comprises a stream storage library, the stream storage library comprises a pre-write log area and a main storage area, a plurality of log fragments are divided in the pre-write log area, the main storage area realizes data persistence through object storage service, the processing scheduling layer comprises a copy manager, a coordinator and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for distribution and group management, and the load predictor is used for establishing a prediction model based on the utilization rate of system resources;
optionally, the embodiment constructs a data processing architecture with explicit hierarchical division based on the architecture features of Kafka. In the network communication layer, high-performance network communication is realized by adopting an asynchronous IO framework based on Netty, and a plurality of processors, including a coder-decoder, an idle detector and a compression processor, are organized through a Pipeline (Pipeline) mechanism. Each processor independently completes specific functions, and the processors work cooperatively through an event transmission mechanism, so that high concurrency processing capacity of a network layer is realized.
The embodiment realizes an intelligent routing mechanism at a task distribution layer. And constructing a message partition routing table through a consistent hash algorithm to ensure that messages are uniformly distributed among partitions. The routing table adopts a double-layer structure, wherein the first layer is the mapping from the theme to the partition, and the second layer is the mapping from the partition to the processor. When partition expansion or contraction is needed, the affected messages are rerouted through the rebalancing operation of the consistency hash ring, and the data migration cost is minimized.
The present embodiment designs an innovative processing scheduling layer structure. The copy manager employs a Raft protocol based consistency replication mechanism to ensure data consistency through a pre-written log (WAL). Each partition maintains an independent replication group, members in the group detect node states through a heartbeat mechanism, and when a main node fails, a new main node is selected through an election algorithm. The coordinator is responsible for the management of the consumption groups, realizes a consumer balancing algorithm, and ensures that the partition load is balanced and distributed to consumers.
The present embodiment implements an LSTM (Long Short-Term Memory) -based deep learning model in the load predictor. The model receives multidimensional time sequence data as input, and comprises indexes such as CPU utilization rate, memory occupancy rate, disk IO rate, network bandwidth and the like. The time dependence of these indicators is captured by a multi-layer LSTM network, predicting the resource usage trend within a future time window. The prediction result is used for guiding the partitioning expansion and contraction capacity decision, and effectively preventing the overload of the system.
The present embodiment creates a dual layer memory structure in the memory layer. The pre-writing log area is stored by a solid state disk, and high-speed writing is realized by a zero-copy technology. The size of the log slices is dynamically adjusted according to expected throughput, each slice containing an index file and a data file. The index file records a mapping of message offsets to physical locations, supporting fast message locating. The data file adopts an additional writing mode to ensure sequential IO performance. The main storage area realizes data persistence through the object storage service, and the hot data is stored in the storage layer with higher performance preferentially by adopting a layered storage strategy.
The embodiment realizes an efficient data synchronization mechanism. And the main storage area adopts an asynchronous uploading strategy, sets a data uploading task queue, and uploads the data in the pre-written log area to the object storage in batches according to a predefined time interval. And in the uploading process, a multi-level caching strategy is used, frequently accessed data blocks are cached in a memory, and the access frequency of object storage is reduced. Meanwhile, an intelligent data preheating mechanism is realized, and the data blocks which are possibly accessed are predicted according to the access mode and are loaded into the cache in advance.
The embodiment constructs a complete monitoring and alarming system. Standardized measurement index acquisition points are established among layers, and the running state of the system is monitored in real time. When an abnormal condition is detected, such as network delay exceeding a threshold value, insufficient storage space and the like, operation and maintenance personnel are notified through a multi-level alarm mechanism. Meanwhile, an automatic fault recovery mechanism is realized, and recovery operation can be automatically executed for common fault scenes.
Through the technical innovation, the key problems of poor expansibility, insufficient data reliability, low resource utilization rate and the like caused by storage computing coupling in the traditional Kafka architecture are effectively solved. In practical application, the scheme can support a large-scale distributed message processing scene, and the expandability and the resource utilization rate of the system are obviously improved through a memory and calculation separation framework. The method is particularly suitable for scenes with higher requirements on data reliability and processing performance, such as finance, electronic commerce and the like, and the overall performance of the message processing system is improved through multi-layer architecture design and intelligent scheduling strategies.
Step S102, receiving batch message data sent by a producer, calculating data characteristics of the batch message data, selecting a compression algorithm according to the data characteristics, executing compression coding on the batch message data to generate a check code, writing the compressed message data into log fragments in the pre-written log area, storing the log fragments by adopting a solid state disk, establishing a master-slave copy relationship among a plurality of log fragments according to configuration of a copy manager, and asynchronously uploading the message data in the log fragments to the main storage area;
Optionally, the embodiment adopts a batch processing mechanism on the message receiving process, and the producer organizes the messages by adopting ProducerBatch objects, and controls the data volume of single batch messages through a configurable batch size parameter. The message receiving buffer zone adopts a ring buffer zone design, the writing and the extraction of data are controlled through a read-write pointer, and when the data volume of the buffer zone reaches a batch threshold or the waiting time exceeds a configured delay threshold, the processing flow of batch messages is triggered.
The embodiment realizes a self-adaptive data characteristic analysis mechanism. And carrying out multidimensional feature extraction on the batch message data, wherein the multidimensional feature extraction comprises repeated pattern analysis, entropy calculation and numerical distribution statistics. The repeated pattern is identified by the LZ77 algorithm of the sliding window, and the occurrence frequency of repeated substrings in the data is calculated.
The entropy value H is calculated by the formula h= - Σ (pi×log2 (Pi)), where Pi represents the probability of occurrence of different byte values, and lower entropy values represent stronger compressibility of the data. The numerical distribution adopts histogram statistics to identify the concentration and dispersion degree of the data.
The embodiment designs an intelligent compression algorithm selection strategy. And (3) establishing a compression algorithm characteristic matrix which comprises common compression algorithms such as LZ4, snappy, ZSTD and the like. For each algorithm, its compression ratio and compression speed index under different data characteristics are recorded. And comprehensively considering the data characteristics, the algorithm performance and the system resource state through a multi-factor decision model, and selecting an optimal compression algorithm. For example, LZ4 algorithm is preferred for log data with significant repetitive patterns, and compression algorithm specific to the values is preferred for time series data in the value distribution set.
The embodiment realizes an efficient compression processing flow. The compression process adopts multithread parallel processing, and the compression is executed in parallel after the batch message data are fragmented. The compression result comprises a compressed data block and a metadata header, wherein the metadata header records information such as the type of a compression algorithm, the size of original data, the size after compression and the like. And meanwhile, a CRC32 check code is generated, and the check code is used for calculating and covering the compressed data and the metadata header, so that the integrity of the data is ensured.
The present embodiment builds a reliable data writing mechanism. The pre-writing log area is stored by a solid state disk, and the writing performance is improved by an asynchronous writing mode. The log fragments are written in an additional writing mode, and each fragment maintains a writing position pointer to ensure sequential writing. The writing process uses a memory mapping file technology, and the system call overhead is reduced. And meanwhile, write-in buffer management is realized, and when the buffer utilization exceeds a threshold value, asynchronous disk brushing operation is triggered.
The present embodiment designs an innovative master-slave replication mechanism. The replica manager maintains data consistency of the master-slave nodes through a Raft-based consensus protocol. After receiving the writing request, the master node firstly writes the data into the local log fragment and then sends the data to all the slave nodes in parallel. By adopting the pipeline replication technology, the next batch is sent while waiting for the confirmation of the previous batch, thereby improving replication efficiency. After receiving the data from the node, the slave node verifies the integrity of the data, writes the data into the local log partition, and returns a confirmation message to the master node.
The embodiment realizes an intelligent data uploading strategy. The primary storage area adopts an object storage service to support long-term persistence of data. And the data uploading adopts an asynchronous mode, and an uploading task queue is set to manage uploading requests. And dynamically adjusting the uploading frequency according to the use condition of the pre-written log area, and improving the uploading frequency when the log slicing use ratio is higher. The uploading process supports breakpoint continuous uploading, and when the uploading is interrupted due to a network problem, the uploading can be continued from the breakpoint position.
Through the technical innovation, the key problems in message processing, such as low data compression efficiency, storage performance bottleneck, insufficient data reliability and the like, are effectively solved. In practical application, the scheme can adapt to different types of message data, and the data processing efficiency and reliability are remarkably improved through an intelligent compression strategy and a reliable storage mechanism. The method is particularly suitable for large-scale log collection, time sequence data processing and other scenes, and the overall performance of the message processing system is improved through a multi-level storage architecture and an asynchronous processing mechanism. The system and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive improvement of the message processing efficiency is realized through continuous optimization and improvement.
And step 103, creating a message buffer area, wherein the message buffer area is used for storing hot message data, regulating the number of message partitions according to the prediction result of the load predictor, reading the hot message data from the message buffer area by a consumer based on a message position, acquiring cold data from the main storage area and decompressing according to the message position by the consumer when target message data does not exist in the message buffer area, recording the consumption progress of the consumer by the coordinator, and verifying the data integrity based on the check code.
Optionally, the embodiment realizes a multi-level cache management architecture based on the message access feature. The message buffer area adopts the out-of-heap memory design, and directly allocates the memory space through DirectByteBuffer, so that the limit of JVM heap memory management is avoided. The buffer area divides a plurality of memory pages according to a fixed size, each memory page maintains an independent memory mapping table, and records the position information of the message in the physical memory. The data is read quickly by the zero-copy technology, so that the memory copy cost is obviously reduced.
The present embodiment designs an innovative hotspot message identification mechanism. And counting the access frequency of the message by adopting a sliding time window, wherein the size of the window can be dynamically adjusted according to the service characteristics. A hotness score is calculated for each message, and the score calculation formula is as follows:
Score = Frequency×Recency×Weight,
Where Frequency is the access Frequency, recency is the time decay factor, weight is the message priority Weight. The natural decay process of the message heat with time is simulated by the exponential decay function, and the real-time heat change of the message is accurately reflected.
The embodiment realizes an intelligent partition dynamic adjustment strategy. The load predictor analyzes the historical load data through a deep learning model, and establishes a time sequence prediction network based on LSTM. The input features comprise indexes such as partition read-write QPS, message accumulation amount, consumption delay and the like, and the time sequence dependency relationship among the indexes is learned through a multi-layer LSTM network. And outputting a load trend in a predicted future time window by the model, and triggering partition capacity expansion operation when the predicted load exceeds a preset threshold value.
The embodiment constructs an efficient message reading flow. The consumer firstly obtains the consumption site information from the coordinator, and the coordinator maintains a site mapping table of the consumption group and records the consumption progress of each consumer. The consumer calculates the logical offset of the message according to the location information, and rapidly locates the position of the message in the buffer through multi-level searching. If the target message is not in the buffer area, a cold data loading flow is started, and the corresponding compressed data packet is acquired from the main storage area.
The present embodiment designs a reliable data integrity verification mechanism. In the message decompression process, firstly, the metadata header of the data packet is analyzed, and the compression algorithm type and the check code information are obtained. And selecting a corresponding decompression method according to the type of the compression algorithm, calculating a CRC32 check value of the message data after decompression is completed, and comparing the CRC32 check value with a check code in the metadata header. If the verification fails, triggering a retry mechanism to acquire the data from the storage system again, so as to ensure the integrity of the data.
The embodiment realizes an optimized cache elimination strategy. The buffer space is managed by adopting a modified LRU-K algorithm, the time stamp of the latest K accesses of each message is recorded, and the future access probability of the message is more accurately predicted through the access records. When the buffer space is insufficient, messages with lower access probability are preferentially eliminated. Meanwhile, a preloading mechanism is realized, and the message which is possibly accessed is predicted according to the access mode of the message and is loaded into the cache area in advance.
The embodiment builds a complete consumption progress management mechanism. The coordinator periodically persists the consumption site to the storage system, and a two-phase commit protocol is employed to ensure atomicity of the site update. The site information comprises fields such as a consumption group ID, a topic partition, a consumption site, a timestamp and the like, and concurrent update is processed through a version number mechanism. When the consumer restarts or re-equalizes, the previous consumption schedule can be quickly restored.
By the technical innovation, the key problems of high access delay, low cache utilization rate, unbalanced load and the like of the hot spot message in the distributed message system are effectively solved. In practical application, the scheme can intelligently identify hot spot messages and optimize access paths, and the throughput and response speed of message processing are remarkably improved through multi-level caching and dynamic partition adjustment. The method is particularly suitable for scenes with obvious hot spot access characteristics, such as social media hot spot events, e-commerce second killing activities and the like, and the performance optimization of the message processing system is realized through accurate hot spot identification and efficient cache management. The systematic and innovative design of the scheme can adapt to the message processing requirements of different scales, and the comprehensive improvement of a message system is realized through continuous optimization and improvement.
As can be seen from the above description, the kafka separate data processing method provided by the embodiment of the present application can construct a multi-layer data processing architecture, including four layers of network communication, task distribution, processing scheduling and storage. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.
In an embodiment of the kafka deposit separation data processing method of the present application, the following may be specifically included:
Step S201, dividing a data processing architecture into call links, setting a socket monitor in a network communication layer to process a client connection request, creating a message transmission channel and configuring transmission parameters, constructing a routing table in a task distribution layer to map the corresponding relation between a processor and a message type, establishing a task queue, setting a task priority scheduling mechanism, and distributing tasks to the corresponding processor according to the message type;
Step S202, a copy manager is deployed in a processing scheduling layer to realize partition master-slave copy, a configuration coordinator manages partition allocation and consumption groups, a deployment load predictor collects system resource use data, a stream storage library is created in a storage layer, a pre-write log area and a main storage area are divided, the pre-write log area adopts a solid state disk storage medium, and the main storage area adopts object storage service as persistent storage.
Optionally, the embodiment adopts a layered calling link design, so that the full flow tracking from the access to the processing of the request is realized. And constructing an event driven framework by adopting a Reactor mode in a network communication layer, wherein a main Reactor thread is responsible for receiving a client connection request, and a sub-Reactor thread pool processes data transceiving of the established connection. The efficient IO multiplexing is realized through the epoll mechanism, and the connection processing capacity is remarkably improved. Each connection is assigned a unique session ID for tracking the lifecycle of the request.
The present embodiment implements an adaptive flow control mechanism in the message transmission channel. The transmission rate is dynamically adjusted through a sliding window algorithm, and the window size is automatically adjusted according to the network delay and the packet loss rate. The transmission parameters include TCP send buffer size, receive buffer size, heartbeat interval, etc., which may be dynamically optimized according to actual network conditions. Meanwhile, the message slicing transmission is realized, and the oversized message is sliced, so that the reliability of transmission is ensured.
The present embodiment designs an efficient task routing mechanism. The routing table adopts a multi-level hash structure, wherein the first level is based on message subject hash, the second level is based on message type hash, and finally the routing table is mapped to a specific processor. In order to improve the searching efficiency, the routing table supports periodic optimization, and the routing item with higher access frequency is migrated to the quick access area. The processor pool adopts elastic telescopic design, and the number of processors can be dynamically adjusted according to the load condition.
The embodiment builds an intelligent task scheduling system. The task queue adopts a multi-stage feedback queue design, and tasks with different priorities enter different queue levels. Priority calculation takes into account a number of factors including the timeliness requirements of the message, the user service level, queue latency, etc. Through dynamic time slice allocation, high-priority tasks can be ensured to be processed quickly, and starvation phenomenon of low-priority tasks is prevented.
The embodiment realizes a reliable copy management mechanism. The copy manager implements master-slave replication based on the modified Raft protocol, introducing parallel replication optimization, allowing slave nodes to write multiple log entries in parallel. The replication process uses a pipelined design, sending the next batch of data while waiting for the acknowledgement of the previous batch. Meanwhile, an intelligent fault transfer mechanism is realized, and when the master node fails, a new master node can be rapidly elected.
The present embodiment designs an innovative partition management strategy. The coordinator maintains a global partition allocation table, and adopts a consistent hash algorithm to perform partition allocation. The allocation process takes into account a number of constraints, brooker load balancing, rack awareness, data locality, etc. When partition rebalancing occurs, data transfer overhead is minimized by incremental migration. The consumer group management adopts a distributed session mechanism, so that quick perception and processing of consumer faults are ensured.
The embodiment builds an accurate load monitoring system. The load predictor collects system indexes through JMX, cgroups and other ways, including CPU utilization rate, memory occupation, disk IO, network throughput and the like. The collected data is stored through a time sequence database, and multi-dimensional index aggregation analysis is established. The prediction model adopts an LSTM neural network, input characteristics comprise historical load data and a system event log, and the output predicts the resource use trend in a future time window.
The embodiment realizes an efficient storage management mechanism. The pre-write log area utilizes the high IOPS characteristic of the solid state disk, and the parallel write performance is improved by adopting a multi-queue design. The log file is fragmented according to a fixed size, and the read-write operation is accelerated through a memory mapping technology. The main storage area adopts object storage service, so that hot layered storage of data is realized, hot data is stored in a storage layer with higher performance preferentially, and cold data is automatically migrated to a storage layer with lower cost.
The key problems of limited connection processing capacity, unbalanced task scheduling, insufficient data reliability and the like in the distributed message system are effectively solved through the technical innovation. In practical application, the scheme can support large-scale message processing requirements, and the throughput and the reliability of the system are obviously improved through a multi-layer architecture and intelligent scheduling. The method is particularly suitable for the scenes of financial transactions, data processing of the Internet of things and the like, and the overall performance of the message system is improved through a flexible storage strategy and a reliable copying mechanism. The systematicness and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the message processing system is realized through continuous optimization and improvement.
In an embodiment of the kafka deposit separation data processing method of the present application, the following may be specifically included:
Step 301, calculating the number of log fragments based on the total capacity of a pre-written log area, distributing a unique identifier for each log fragment, organizing the log fragments into a linked list structure according to time sequence, setting a message offset and a time stamp index in the log fragments, configuring a reserved space threshold of the log fragments, establishing a mapping relation between the log fragments and a physical storage device, dividing storage barrels by a main storage area according to a time range, and establishing a metadata index for each storage barrel;
Step S302, configuring master-slave replication parameters of a replication manager, establishing a synchronization mechanism of partition replicas, constructing a partition allocation table and a consumption group management table in a coordinator, recording consumption group member information and consumption progress, acquiring system resource indexes such as processor utilization rate, memory occupancy rate, disk read-write rate and the like by a load predictor, constructing a time sequence prediction model to calculate resource utilization trend, and generating a load balancing strategy based on a prediction result.
Optionally, the embodiment adopts a dynamic allocation policy in log partition management. The initial number of fragments is calculated by the ratio of the total capacity of the pre-written log area to the individual fragment reference size. The size of the fragments adopts a logarithmic growth mode, the size of the new fragments is 1.5 times of that of the last fragment, and the design can adapt to the growth trend of the message flow. Each slice is assigned a 64-bit unique identifier, including a time stamp, a sequence number, and node ID information, ensuring uniqueness in a distributed environment.
The embodiment realizes an efficient slicing index structure. The log fragments are connected through a doubly linked list, and each fragment node contains reference pointers of the front fragments and the rear fragments. And establishing a sparse index in the fragments, and establishing an index item every 4KB of message data, wherein the index item comprises message offset, physical position and timestamp information. And the index items are organized by adopting the jump table structure, so that the quick search of the O (log n) time complexity is realized. Meanwhile, index caches of the hot spot fragments are maintained in the memory, and the frequently accessed searching performance is improved.
The present embodiment designs an innovative spatial management mechanism. The head space threshold is calculated dynamically taking into account the historical write rate and the remaining capacity. When the partition utilization reaches a threshold, an asynchronous migration mechanism is triggered to migrate old data to the primary storage area. The migration process adopts a batch copy strategy, and the migration efficiency is improved through multithreading parallel transmission. Meanwhile, intelligent defragmentation is realized, small fragments are combined regularly, and the utilization rate of storage space is optimized.
The present embodiment builds reliable bucket management. The main memory area divides memory buckets according to a day-level time range, and each memory bucket contains a plurality of data files. The metadata index of the bucket adopts a B+ tree structure, and the index item comprises information such as a time range, a message offset range, a file path and the like. And the index storage space is optimized through a prefix compression technology, an incremental updating mechanism of the index is realized, and the maintenance cost of the index is reduced.
The embodiment realizes an accurate copy synchronization mechanism. The copy manager employs a modified Raft protocol, which introduces parallel copy optimization. The replication parameters include batch size, replication thread count, heartbeat interval, etc., which can be dynamically adjusted according to network conditions. The synchronization process uses a pipeline design, allows data of multiple batches to be sent simultaneously, and significantly improves replication efficiency. Meanwhile, an intelligent catch-up mechanism is realized, and the latest data can be quickly synchronized by the lagged copies.
The present embodiment designs an innovative partition management strategy. The partition allocation table is realized by adopting a consistent hash ring, and the allocation equilibrium is improved by a virtual node technology. The consumption group management table records session information, partition allocation status and consumption sites of consumers. The site management adopts a two-stage commit protocol to ensure that the site management can be restored to a consistent state when a fault occurs. And meanwhile, incremental adjustment of consumer rebalancing is realized, and the influence of the rebalancing process on the service is minimized.
The embodiment constructs an intelligent load prediction system. Through multi-dimension index collection, a complete resource portrait is established. The collected indexes comprise CPU utilization rate, memory occupancy rate, disk IO and the like at the system level, and request queue length, processing delay and the like at the application level. The prediction model adopts an LSTM neural network, and captures the correlation among different indexes through an attention mechanism. The model training adopts a sliding window strategy, and the window size is dynamically adjusted according to the load change period.
The embodiment realizes an adaptive load balancing strategy. Calculating a load score based on the prediction result, wherein the score calculation formula is as follows:
Score = w1CPU + w2Memory + w3IO + w4Network,
Where wi is a weight factor and Network characterizes the Network broadband rate. When the load fraction of a certain node exceeds a threshold value, a load balancing operation is triggered. The balancing process considers the principle of data locality, preferentially migrates the load to the node with the data copy, and reduces the data transmission overhead.
Through the technical innovation, the key problems of low storage space management efficiency, poor data synchronization performance, unbalanced load and the like in the distributed storage system are effectively solved. In practical application, the scheme can support large-scale message storage and processing requirements, and the performance and reliability of the system are remarkably improved through intelligent partition management and predictive load balancing. The method is particularly suitable for scenes requiring high throughput and low delay, such as real-time data analysis, log collection and the like, and the overall performance of the storage system is improved through a multi-level optimization strategy. The system and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the storage system is realized through continuous optimization and improvement.
In an embodiment of the kafka deposit separation data processing method of the present application, the following may be specifically included:
Step S401, analyzing message heads and message bodies of batch message data sent by a producer, counting data distribution characteristics of the message bodies, calculating data characteristic indexes such as entropy values, repeatability, numerical range and the like of the message bodies, establishing a compression algorithm evaluation matrix, calculating compression rates and processing time delays of different compression algorithms according to the data characteristic indexes, and selecting an optimal compression algorithm in the evaluation matrix;
Step S402, applying the selected compression algorithm to batch message data, generating a compressed byte stream, calculating CRC32 check code of the compressed byte stream, packing the compressed data length, the compression algorithm identification and the check code into metadata header, assembling the metadata header and the compressed byte stream into data blocks, and selecting available log fragments in a pre-write log area to write the data blocks.
Optionally, in this embodiment, a zero copy technique is used in the message parsing stage, and the message data is directly accessed through memory mapping. The message header analysis adopts a fixed-length field design, contains information such as message length, version number, theme, partition and the like, and rapidly locates the positions of all fields through displacement operation. The message body adopts variable Length coding, uses TLV (Type-Length-Value) format storage, supports a nested structure and adapts to a complex message format. The parsing process realizes parallel processing, and a plurality of parsing threads process different message batches simultaneously.
The embodiment realizes accurate data characteristic analysis. The entropy calculation adopts a sliding window method, and the window size can be dynamically adjusted according to the message size.
The entropy calculation formula is h= - Σ (pi×log2 (Pi)), where Pi represents the probability that the byte value i occurs. The repeatability analysis uses a modified LZ77 algorithm to identify repeated data segments by maintaining a sliding window of fixed size. The numerical range analysis adopts histogram statistics, divides the numerical distribution into a plurality of intervals, and calculates the data density of each interval.
The present embodiment designs an innovative compression algorithm evaluation mechanism. The evaluation matrix comprises a plurality of dimensions, such as compression rate, compression speed, decompression speed, memory occupation, and the like. Each dimension is assigned a weight coefficient, and the weight value can be dynamically adjusted according to the application scene. For example, in a scene with high real-time requirements, the compression speed weight increases accordingly. The evaluation process uses a sample testing method to extract representative samples from the message batch for compression testing.
The present embodiment builds an adaptive algorithm selection strategy. And establishing a decision tree model according to the message characteristics and the evaluation result. The branch nodes of the decision tree comprise entropy threshold, repetition threshold and other conditions, and the leaf nodes correspond to specific compression algorithms. By traversing the decision tree, the most appropriate compression algorithm is quickly selected. Meanwhile, an algorithm caching mechanism is realized, and message batches with similar characteristics can be multiplexed with the previous algorithm selection result.
The embodiment realizes an efficient compression processing flow. After the algorithm is selected, a multithreading parallel compression strategy is adopted. Message batches are fragmented according to a fixed size, with each thread being responsible for the compression task of one fragment. The compression result is transferred through the shared memory queue, so that data copying among threads is avoided. Meanwhile, the dynamic adjustment of the compression level is realized, and the proper compression level is selected according to the load condition of the CPU.
The present embodiment designs a reliable data verification mechanism. The CRC32 check is realized by hardware acceleration, and the calculation speed is obviously improved by utilizing the CRC32 instruction of the modern CPU. The verification process covers the compressed data and metadata header, ensuring the integrity of the data. Meanwhile, an incremental verification mechanism is realized, and when the data block is partially modified, only the verification value of the modified part is needed to be recalculated.
This embodiment builds a complete metadata management. The metadata header is in a compact binary format, reducing storage overhead by bit-domain techniques. The compression algorithm identification is represented by an enumeration value, and rapid algorithm identification is supported. The data length field adopts variable length coding, and is suitable for data blocks with different sizes. Meanwhile, the version control of the metadata is realized, and the backward compatibility of the metadata format is supported.
The embodiment realizes an optimized data write strategy. The selection of a log slice is based on a number of factors including the remaining space of the slice, the write load, the data locality, etc. The writing process adopts an asynchronous mode, a plurality of data blocks are accumulated through the writing buffer area and then written in batches, and the writing efficiency is improved. And meanwhile, a pre-writing mechanism is realized, and before the actual writing of the data, the meta information of the data block is written into a pre-writing log, so that the durability of the data is ensured.
The key problems in message compression processing, such as inaccurate compression algorithm selection, low compression efficiency, insufficient data integrity assurance and the like, are effectively solved through the technical innovation. In practical application, the scheme can intelligently select the optimal compression algorithm, and the throughput and the reliability of message processing are obviously improved through parallel processing and an optimized write strategy. The method is particularly suitable for large-scale message processing scenes, such as log collection, monitoring data processing and the like, and the overall performance of the message compression and storage system is improved through a multi-level optimization strategy. The systematic and innovative scheme can adapt to different types of message data, and the comprehensive improvement of the message processing efficiency is realized through continuous optimization and improvement.
In an embodiment of the kafka deposit separation data processing method of the present application, the following may be specifically included:
Step S501, distributing solid state disk storage space for log slicing, configuring storage parameters of the log slicing, setting storage buffer size and a disk brushing strategy, selecting slave nodes based on configuration information of a copy manager, establishing a copy channel between the master node and the slave node, sending log slicing data to the slave nodes according to batch size, and returning confirmation information to the master node after the slave nodes verify data integrity;
Step S502, establishing a message data uploading task queue, scanning the confirmed copied data area in the log partition according to a preset time interval, grouping and packaging the messages in the data area according to a time range, calculating a storage path of a data packet, asynchronously uploading the data packet to a storage barrel corresponding to a main storage area through an object storage interface, and updating a storage position index of the data packet.
Optionally, in this embodiment, a dynamic management policy is adopted in the solid state disk space allocation. Continuous physical blocks are distributed through a direct IO interface of the file system, and the additional overhead brought by the caching of the file system is avoided. The storage space is pre-allocated according to the fixed size, the initial size of each fragment is 1GB, and dynamic expansion is supported. The distribution process uses an asynchronous mode, a plurality of fragments can be created in parallel, and the resource initialization efficiency is improved.
The embodiment realizes the optimized storage parameter configuration. The storage buffer area adopts direct memory allocation, bypasses the JVM heap memory management, and each partition is provided with an independent writing buffer area and a pre-reading buffer area. The buffer size is dynamically adjusted according to the historical IO mode, the write buffer default is set to 64MB, and the pre-read buffer size is optimized based on the sequential read mode. The swiping strategy combines a time threshold and a data amount threshold, and triggers a swiping operation when the accumulated data exceeds the threshold or reaches a time interval.
This embodiment designs a reliable copy channel mechanism. The copy channel is realized based on the Netty framework and adopts a zero copy transmission technology. The channel establishment process includes identity authentication and capability negotiation, supporting SSL encryption and compression transmission. The data transmission adopts a pipeline design, so that a plurality of batches can be sent simultaneously, and the size of each batch is dynamically adjusted according to network conditions. And the sending rate is controlled through the sliding window, so that network congestion is avoided.
The embodiment constructs a complete data verification flow. After receiving the data from the node, the integrity of the data packet is first verified, including checksum verification and sequence number checking. After verification is passed, the data is written into the local storage, and the data reliability is ensured by adopting a double-writing mechanism. The confirmation message contains the batch ID and the writing position information, and is returned to the master node in an asynchronous mode, so that confirmation delay is reduced.
The embodiment realizes efficient uploading task management. The task queues adopt a priority queue structure, and the priorities are calculated based on time sensitivity and storage pressure of the data. The queue processing adopts a multithreading model, and the processing thread number is dynamically adjusted according to the CPU core number and the IO load. The task execution process realizes a failure retry mechanism, and an exponential backoff strategy is adopted in the retry interval.
The present embodiment designs an innovative data packing strategy. The grouping process considers multiple dimensions, time ranges, data volumes, access patterns, etc. The optimal packing size is determined through aggregation analysis, and the storage efficiency and the retrieval performance are balanced. The packing process supports parallel processing, and data in multiple time ranges can be packed simultaneously. Meanwhile, an incremental packaging mechanism is realized, and only the newly added data area is processed.
The embodiment constructs an intelligent storage path calculation scheme. The path generation takes the time attribute and the service attribute of the data into consideration, and adopts a multi-level directory structure. And the path calculation adopts a consistent hash algorithm, so that the related data are ensured to be stored in similar positions, and the batch reading efficiency is improved. Meanwhile, path conflict processing is realized, and the homonymous conflict is solved by adding the unique identifier.
The embodiment realizes a reliable asynchronous uploading mechanism. The object storage interface encapsulates retry and concurrency control logic supporting fragment upload and breakpoint resume. The uploading process adopts stream processing, and the transmission efficiency is improved through pipelining design. Meanwhile, the uploading progress tracking is realized, and the progress query and the cancel operation with fine granularity are supported. The selection of buckets is based on a lifecycle policy of the data, supporting automatic migration of the data between different storage tiers.
Through the technical innovation, the key problems of storage performance bottleneck, data replication delay, low uploading efficiency and the like in the distributed storage system are effectively solved. In practical application, the scheme can support large-scale data storage and rapid replication requirements, and the throughput and the reliability of the system are obviously improved through a multi-level optimization strategy. The method is particularly suitable for scenes requiring high-performance storage and disaster recovery, such as financial transactions, log archiving and the like, and the overall performance of the storage system is improved through an optimized storage strategy and a reliable replication mechanism. The system and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the storage system is realized through continuous optimization and improvement.
In an embodiment of the kafka deposit separation data processing method of the present application, the following may be specifically included:
step S601, memory space is allocated for a message cache region, a mapping relation between a message position point and a storage address is constructed by a cache index table, a capacity upper limit and an elimination strategy of the cache region are set, a hot spot message scoring mechanism is established, a message hotness score is calculated based on the access frequency and time attenuation factor of the message, and message data with the hotness score exceeding a preset threshold value is loaded to the cache region;
And step S602, monitoring read-write load data of the message partition, training a prediction model by the load predictor based on the historical load data, calculating a load trend in a future time window, triggering partition capacity expansion when the predicted load exceeds a preset threshold value, redistributing the message data in the existing partition to a new partition according to a consistent hash algorithm, and updating a partition routing table.
Optionally, the present embodiment employs a multi-level cache architecture on memory space allocation. And the message buffer area is managed through the out-of-heap memory pool, so that the GC pressure is avoided. The memory pool adopts a sectional lock design, the whole memory space is divided into a plurality of areas, each area is controlled by an independent lock, and thread competition is reduced. The allocation granularity of the buffer area is dynamically adjusted according to the size of the message, and the memory block management is carried out by adopting a partner algorithm, so that the memory fragments are effectively reduced.
The embodiment realizes an efficient cache index structure. The method adopts a three-level index design, wherein the first level is mapping from a location to a partition, quick searching is realized by using a jump table, the second level is a message index in the partition, a B+ tree structure is adopted, and the third level is memory address mapping of specific messages and uses a direct addressing table. The multi-level index structure can realize the O (log n) searching efficiency in mass information and support the range query operation.
The present embodiment designs an innovative message popularity scoring mechanism. The hotness score calculation formula is:
Score=f×e (- λt) ×w, where F is the access frequency, λ is the time decay coefficient, t is the time interval from the last access, and W is the message weight. The weighting factors take into account the traffic priority and data size of the message. The scoring process adopts sliding window statistics, the window size can be dynamically adjusted according to the business characteristics, and the real-time heat change of the message is accurately reflected.
The embodiment constructs an intelligent cache elimination strategy. And combining the LRU-K algorithm and the ARC algorithm to realize an adaptive elimination mechanism. Two cache queues, a frequent access queue and a general access queue, are maintained. The message enters a general queue when being accessed for the first time, and is upgraded to a frequent queue when the access times exceed a threshold value. The size proportion of the two queues is dynamically adjusted through the access mode, so that the optimal utilization of the cache space is realized.
The embodiment realizes an accurate load monitoring system. And acquiring multi-dimensional load indexes such as message throughput, delay distribution, queue depth and the like. The data acquisition adopts a self-adaptive sampling strategy, and the sampling frequency is increased when the load change is remarkable. And storing historical load data through a time sequence database, establishing a multi-dimensional load portrait, and providing training data for the prediction model.
The present embodiment designs an advanced load prediction model. With LSTM neural network architecture, the input features include historical load metrics, temporal features (hours, weeks, holidays) and system events. The model captures the correlation among different characteristics through a multi-head attention mechanism and accurately predicts the short-term and medium-term load trend. The training process adopts an online learning mode, continuously optimizes model parameters and adapts to the dynamic change of a load mode.
The embodiment constructs a reliable partition capacity expansion mechanism. The capacity expansion triggering condition comprehensively considers a plurality of factors such as predicted load, resource utilization rate, performance index and the like. The expansion process adopts a progressive strategy, a new partition is firstly created and preheated, and then message data is redistributed through a consistent hash algorithm. The hash algorithm uses a virtual node technique to ensure data distribution uniformity while minimizing data migration.
The embodiment realizes an efficient data migration flow. And in the migration process, a batch processing mode is adopted, and a migration task queue is set to manage migration requests. Migration efficiency is improved through multi-thread parallel processing, and each thread is responsible for data migration of one batch. Meanwhile, a migration progress tracking and fault recovery mechanism is realized, and the reliability of a migration process is ensured.
The key problems of low cache hit rate, inaccurate load prediction, poor capacity expansion efficiency and the like in the distributed message system are effectively solved through the technical innovation. In practical application, the scheme can intelligently identify hot spot messages and optimize a caching strategy, and the response speed and expansibility of the system are remarkably improved through accurate load prediction and an efficient capacity expansion mechanism. The method is particularly suitable for scenes with obvious access hotspots, such as social media pushing, real-time recommendation and the like, and the overall performance of the message system is improved through a multi-level optimization strategy. The systematicness and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the message processing system is realized through continuous optimization and improvement.
In an embodiment of the kafka deposit separation data processing method of the present application, the following may be specifically included:
Step S701, a consumer sends a site query request to a coordinator, the coordinator returns storage position information corresponding to a message site, the consumer searches target message data in a message cache area according to the storage position information, if the target message data does not exist in the cache area, a corresponding data packet is acquired from a main storage area, a metadata header in the data packet is read to analyze a compression algorithm identifier, and a corresponding decompression method is called to restore the message data;
step S702, the check code of the decompressed message data is calculated and compared with the check code recorded in the metadata header, the message data is returned to the consumer after the check is passed, the consumer submits the consumption site to the coordinator after processing the message, the coordinator records the consumption site to the site management table, and the site management table is regularly persisted to the storage system.
Optionally, the present embodiment employs a multi-level caching mechanism on the bit-point query processing. The coordinator maintains the memory cache of the site map, and quick searching is realized by adopting ConcurrentHashMap. The caches are organized by consuming groups and partitions, using a double-layer hash table structure. For the location information of the hot spot query, the access pressure of the coordinator is reduced through the local cache. The consistency of the cache is maintained through a version number mechanism, and when the sites are updated, the local cache is informed to each consumer through a broadcasting mechanism.
The embodiment realizes intelligent storage position analysis. The storage position information adopts a compression coding format and comprises information such as storage hierarchy, file path, offset and the like. The parsing process uses a state machine design that is capable of handling different versions of the encoding format. By means of the locality principle of the location information, the possible accessed adjacent message locations are predicted and prefetched, and subsequent query delays are reduced.
The present embodiment designs an efficient cache lookup strategy. The multi-level index is adopted in the message buffer area, whether the message possibly exists or not is judged rapidly through a bloom filter, and then a specific position is located through a skip list. And starting asynchronous preloading when the cache is not hit, predicting and loading the message which is possibly accessed later according to the access mode of the message, and improving the cache hit rate.
The present embodiment builds a reliable packet read mechanism. When the data packet is acquired from the main storage area, an asynchronous IO mode is adopted, and the concurrent reading quantity is controlled through the thread pool. Reading of the data packet supports a range request, acquiring only the necessary data portion. For large data packets, a slicing reading strategy is realized, and excessive data is prevented from being loaded to a memory at one time.
The embodiment realizes the optimized decompression processing flow. The selection of decompression algorithms supports a variety of compression algorithms such as LZ4, snapy, ZSTD, etc. by fast positioning through the identification in the metadata header. The decompression process adopts thread pool management, and for a large number of small messages, a batch decompression strategy is adopted to improve efficiency. And meanwhile, decompression caching is realized, and the same compressed data block is decompressed only once.
The embodiment designs a strict data checking flow. The check code calculation adopts a hardware-accelerated CRC32 algorithm, and the calculation speed is improved through SIMD instructions. The verification process overlays the complete contents of the message, including metadata and payload data. When the verification fails, a multi-stage retry mechanism is implemented, first attempting to re-read the data, and if the verification fails continuously, acquiring the data from the replica node.
The present embodiment builds a reliable site submission mechanism. Consumers submit sites asynchronously, accumulating a certain number of sites through a local buffer, and then submitting the sites in batches. The commit process achieves idempotent, and repeated commit of the same site does not lead to errors. The site submission adopts a two-stage submission protocol, so that the consistency in a distributed environment is ensured.
The embodiment realizes an efficient site management strategy. The site management table adopts a segmented lock design, so that lock competition of concurrent updating is reduced. The table records detailed location information including consumer group ID, topic, partition, location value, timestamp, etc. The persistence process adopts an incremental mode, and only the changed site information is saved. The reliability of the site data is ensured through a pre-write log mechanism.
By the technical innovation, the key problems in the consumption of the distributed message, such as high delay of site inquiry, low data decompression efficiency, unreliable site management and the like, are effectively solved. In practical application, the scheme can support a large-scale message consumption scene, and the efficiency and the reliability of message consumption are obviously improved through multi-level caching and optimized data processing flow. The method is particularly suitable for scenes requiring high throughput and low delay, such as real data processing, event stream processing and the like, and the overall performance of the message consumption system is improved through a multi-level optimization strategy. The systematicness and innovation of the scheme can adapt to the application requirements of enterprises with different scales, and the comprehensive upgrading of the message consumption system is realized through continuous optimization and improvement.
In order to effectively solve the shortcomings of the conventional technology in terms of memory coupling, memory efficiency, load balancing and the like, and significantly improve the performance and expandability of a message queue system, the present application provides an embodiment of a kafka memory separated data processing apparatus for implementing all or part of the contents of the kafka memory separated data processing method, and referring to fig. 2, the kafka memory separated data processing apparatus specifically includes the following contents:
The architecture construction module 10 is configured to construct a multi-layer data processing architecture, where the multi-layer data processing architecture includes a network communication layer, a task distribution layer, a processing scheduling layer, and a storage layer, the storage layer includes a flow repository, the flow repository includes a pre-write log area and a main storage area, the pre-write log area is divided into a plurality of log slices, the main storage area implements data persistence through an object storage service, the processing scheduling layer includes a copy manager, a coordinator, and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for allocation and group management, and the load predictor builds a prediction model based on a system resource utilization rate;
The message storage module 20 is configured to receive batch message data sent by a producer, calculate data characteristics of the batch message data, select a compression algorithm according to the data characteristics, perform compression encoding on the batch message data to generate a check code, write the compressed message data into log slices in the pre-written log area, store the log slices by using a solid state disk, establish a master-slave copy relationship between a plurality of log slices according to configuration of the copy manager, and asynchronously upload the message data in the log slices to the main storage area;
And the message cache module 30 is used for creating a message cache area, the message cache area is used for storing hot message data, the number of message partitions is adjusted according to the prediction result of the load predictor, a consumer reads the hot message data from the message cache area based on a message location, when no target message data exists in the message cache area, the consumer acquires cold data from the main storage area according to the message location and decompresses the cold data, the coordinator records the consumption progress of the consumer, and the data integrity is verified based on the check code.
As can be seen from the above description, the kafka memory and storage separated data processing apparatus provided by the embodiments of the present application is capable of constructing a multi-layer data processing architecture including four layers of network communication, task distribution, process scheduling and storage. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.
In order to effectively solve the shortcomings of the traditional technology in terms of memory coupling, memory efficiency, load balancing and the like and significantly improve the performance and expandability of a message queue system, the application provides an embodiment of an electronic device for realizing all or part of contents in the data processing method for kafka memory separation, wherein the electronic device specifically comprises the following contents:
The device comprises a processor (processor), a memory (memory), a communication interface (Communications Interface) and a bus, wherein the processor, the memory and the communication interface are used for completing communication among the processor, the memory and the communication interface through the bus, the communication interface is used for realizing information transmission between a data processing device with separate kafka memory and related equipment such as a core service system, a user terminal and a related database, and the logic controller can be a desktop computer, a tablet personal computer, a mobile terminal and the like. In this embodiment, the logic controller may be implemented with reference to the embodiment of the data processing method of kafka calculation separation in the embodiment and the embodiment of the data processing apparatus of kafka calculation separation, and the contents thereof are incorporated herein, and the repetition is omitted.
It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), a vehicle-mounted device, a smart wearable device, etc. Wherein, intelligent wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
In practical applications, the kafka stores part of the separate data processing method, which may be executed on the electronic device side as described above, or all operations may be performed in the client device. Specifically, the selection may be made according to the processing capability of the client device, and restrictions of the use scenario of the user. The application is not limited in this regard. If all operations are performed in the client device, the client device may further include a processor.
The client device may have a communication module (i.e. a communication unit) and may be connected to a remote server in a communication manner, so as to implement data transmission with the server. The server may include a server on the side of the task scheduling center, and in other implementations may include a server of an intermediate platform, such as a server of a third party server platform having a communication link with the task scheduling center server. The server may include a single computer device, a server cluster formed by a plurality of servers, or a server structure of a distributed device.
Fig. 3 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 3, the electronic device 9600 can include a central processor 9100 and a memory 9140, the memory 9140 being coupled to the central processor 9100. It is noted that this fig. 3 is exemplary, and that other types of structures may be used in addition to or in place of the structures to implement telecommunications functions or other functions.
In one embodiment, the kafka memory separate data processing method functions may be integrated into the central processor 9100. The central processor 9100 may be configured to perform the following control:
Step S101, constructing a multi-layer data processing architecture, wherein the multi-layer data processing architecture comprises a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer comprises a stream storage library, the stream storage library comprises a pre-write log area and a main storage area, a plurality of log fragments are divided in the pre-write log area, the main storage area realizes data persistence through object storage service, the processing scheduling layer comprises a copy manager, a coordinator and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for distribution and group management, and the load predictor is used for establishing a prediction model based on the utilization rate of system resources;
Step S102, receiving batch message data sent by a producer, calculating data characteristics of the batch message data, selecting a compression algorithm according to the data characteristics, executing compression coding on the batch message data to generate a check code, writing the compressed message data into log fragments in the pre-written log area, storing the log fragments by adopting a solid state disk, establishing a master-slave copy relationship among a plurality of log fragments according to configuration of a copy manager, and asynchronously uploading the message data in the log fragments to the main storage area;
And step 103, creating a message buffer area, wherein the message buffer area is used for storing hot message data, regulating the number of message partitions according to the prediction result of the load predictor, reading the hot message data from the message buffer area by a consumer based on a message position, acquiring cold data from the main storage area and decompressing according to the message position by the consumer when target message data does not exist in the message buffer area, recording the consumption progress of the consumer by the coordinator, and verifying the data integrity based on the check code.
As can be seen from the above description, the electronic device provided by the embodiment of the present application constructs a multi-layer data processing architecture, including four layers of network communication, task distribution, processing scheduling and storage. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.
In another embodiment, the data processing apparatus with the kafka calculation separation may be configured separately from the central processor 9100, for example, the data processing apparatus with the kafka calculation separation may be configured as a chip connected to the central processor 9100, and the data processing method function with the kafka calculation separation is implemented by control of the central processor.
As shown in fig. 3, the electronic device 9600 may further include a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 does not necessarily include all the components shown in fig. 3, and furthermore, the electronic device 9600 may include components not shown in fig. 3, to which reference is made in the prior art.
As shown in fig. 3, the central processor 9100, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which central processor 9100 receives inputs and controls the operation of the various components of the electronic device 9600.
The memory 9140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information about failure may be stored, and a program for executing the information may be stored. And the central processor 9100 can execute the program stored in the memory 9140 to realize information storage or processing, and the like.
The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. The power supply 9170 is used to provide power to the electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, but not limited to, an LCD display.
The memory 9140 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), SIM card, etc. But also a memory which holds information even when powered down, can be selectively erased and provided with further data, an example of which is sometimes referred to as EPROM or the like. The memory 9140 may also be some other type of device. The memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 storing application programs and function programs or a flow for executing operations of the electronic device 9600 by the central processor 9100.
The memory 9140 may also include a data store 9143, the data store 9143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).
The communication module 9110 is a transmitter/receiver that transmits and receives signals via the antenna 9111. The communication module 9110 (transmitter/receiver) is coupled to the central processor 9100 to provide input signals and receive output signals, as in the case of conventional mobile communication terminals.
Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, etc., may be provided in the same electronic device. The communication module 9110 (transmitter/receiver) is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and to receive audio input from the microphone 9132 to implement usual telecommunications functions. The audio processor 9130 can include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100 so that sound can be recorded locally through the microphone 9132 and sound stored locally can be played through the speaker 9131.
An embodiment of the present application also provides a computer-readable storage medium capable of realizing all steps in the data processing method in which the execution subject is the kafka calculation separation of the server or the client in the above embodiment, the computer-readable storage medium having stored thereon a computer program which, when executed by a processor, realizes all steps in the data processing method in which the execution subject is the kafka calculation separation of the server or the client in the above embodiment, for example, the processor realizes the following steps when executing the computer program:
Step S101, constructing a multi-layer data processing architecture, wherein the multi-layer data processing architecture comprises a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer comprises a stream storage library, the stream storage library comprises a pre-write log area and a main storage area, a plurality of log fragments are divided in the pre-write log area, the main storage area realizes data persistence through object storage service, the processing scheduling layer comprises a copy manager, a coordinator and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for distribution and group management, and the load predictor is used for establishing a prediction model based on the utilization rate of system resources;
Step S102, receiving batch message data sent by a producer, calculating data characteristics of the batch message data, selecting a compression algorithm according to the data characteristics, executing compression coding on the batch message data to generate a check code, writing the compressed message data into log fragments in the pre-written log area, storing the log fragments by adopting a solid state disk, establishing a master-slave copy relationship among a plurality of log fragments according to configuration of a copy manager, and asynchronously uploading the message data in the log fragments to the main storage area;
And step 103, creating a message buffer area, wherein the message buffer area is used for storing hot message data, regulating the number of message partitions according to the prediction result of the load predictor, reading the hot message data from the message buffer area by a consumer based on a message position, acquiring cold data from the main storage area and decompressing according to the message position by the consumer when target message data does not exist in the message buffer area, recording the consumption progress of the consumer by the coordinator, and verifying the data integrity based on the check code.
From the above description, it can be seen that the computer readable storage medium provided by the embodiments of the present application includes four levels of network communication, task distribution, process scheduling and storage by constructing a multi-layer data processing architecture. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.
The embodiment of the present application also provides a computer program product capable of realizing all the steps in the data processing method of kafka calculation separation in which the execution subject is a server or a client in the above embodiment, the computer program/instructions realizing the steps of the data processing method of kafka calculation separation when executed by a processor, for example, the computer program/instructions realizing the steps of:
Step S101, constructing a multi-layer data processing architecture, wherein the multi-layer data processing architecture comprises a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer comprises a stream storage library, the stream storage library comprises a pre-write log area and a main storage area, a plurality of log fragments are divided in the pre-write log area, the main storage area realizes data persistence through object storage service, the processing scheduling layer comprises a copy manager, a coordinator and a load predictor, the copy manager is responsible for data copy synchronization, the coordinator is responsible for distribution and group management, and the load predictor is used for establishing a prediction model based on the utilization rate of system resources;
Step S102, receiving batch message data sent by a producer, calculating data characteristics of the batch message data, selecting a compression algorithm according to the data characteristics, executing compression coding on the batch message data to generate a check code, writing the compressed message data into log fragments in the pre-written log area, storing the log fragments by adopting a solid state disk, establishing a master-slave copy relationship among a plurality of log fragments according to configuration of a copy manager, and asynchronously uploading the message data in the log fragments to the main storage area;
And step 103, creating a message buffer area, wherein the message buffer area is used for storing hot message data, regulating the number of message partitions according to the prediction result of the load predictor, reading the hot message data from the message buffer area by a consumer based on a message position, acquiring cold data from the main storage area and decompressing according to the message position by the consumer when target message data does not exist in the message buffer area, recording the consumption progress of the consumer by the coordinator, and verifying the data integrity based on the check code.
From the above description, it can be seen that the computer program product provided by the embodiments of the present application includes four levels of network communication, task distribution, process scheduling and storage by constructing a multi-level data processing architecture. And the cold and hot data separation is realized through the layered design of the pre-written log area and the main storage area, and the intelligent scheduling of system resources is realized by combining a copy manager, a coordinator and a load predictor. And the self-adaptive compression algorithm and the checking mechanism are adopted to ensure the data transmission efficiency and integrity, a message buffer area is created to store hot spot data, and flexible access of cold and hot data is realized based on message sites. The method effectively solves the defects of the traditional technology in terms of memory coupling, storage efficiency, load balancing and the like, and remarkably improves the performance and expandability of the message queue system.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the principles and embodiments of the present invention have been described in detail in the foregoing application of the principles and embodiments of the present invention, the above examples are provided for the purpose of aiding in the understanding of the principles and concepts of the present invention and may be varied in many ways by those of ordinary skill in the art in light of the teachings of the present invention, and the above descriptions should not be construed as limiting the invention.

Claims (10)

1.一种kafka存算分离的数据处理方法,其特征在于,所述方法包括:1. A data processing method for Kafka storage and computing separation, characterized in that the method comprises: 构建多层数据处理架构,所述多层数据处理架构包括网络通信层、任务分发层、处理调度层及存储层,所述存储层包括流存储库,所述流存储库包括预写日志区与主存储区,在所述预写日志区中划分多个日志分片,所述主存储区通过对象存储服务实现数据持久化,所述处理调度层包括副本管理器、协调器及负载预测器,所述副本管理器负责数据副本同步,所述协调器负责分区分配与组管理,所述负载预测器基于系统资源使用率建立预测模型;Construct a multi-layer data processing architecture, the multi-layer data processing architecture includes a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer includes a stream storage library, the stream storage library includes a pre-write log area and a main storage area, a plurality of log shards are divided in the pre-write log area, the main storage area realizes data persistence through an object storage service, the processing scheduling layer includes a replica manager, a coordinator and a load predictor, the replica manager is responsible for data replica synchronization, the coordinator is responsible for partition allocation and group management, and the load predictor establishes a prediction model based on system resource utilization; 接收生产者发送的批量消息数据,计算所述批量消息数据的数据特征,根据所述数据特征选择压缩算法,对所述批量消息数据执行压缩编码生成校验码,将压缩后的消息数据写入所述预写日志区中的日志分片,所述日志分片采用固态硬盘存储,根据所述副本管理器的配置在多个所述日志分片之间建立主从复制关系,将所述日志分片中的消息数据异步上传至所述主存储区;Receive batch message data sent by the producer, calculate the data features of the batch message data, select a compression algorithm according to the data features, perform compression encoding on the batch message data to generate a checksum, write the compressed message data into the log shards in the pre-write log area, the log shards are stored in solid-state hard disks, establish a master-slave replication relationship between the multiple log shards according to the configuration of the replica manager, and asynchronously upload the message data in the log shards to the main storage area; 创建消息缓存区,所述消息缓存区用于存储热点消息数据,根据所述负载预测器的预测结果调整消息分区数量,消费者基于消息位点从所述消息缓存区读取所述热点消息数据,当所述消息缓存区中不存在目标消息数据时,所述消费者根据消息位点从所述主存储区获取冷数据并解压,所述协调器记录消费者的消费进度,基于所述校验码验证数据完整性。A message buffer is created, where the message buffer is used to store hot message data. The number of message partitions is adjusted according to the prediction result of the load predictor. The consumer reads the hot message data from the message buffer based on the message site. When the target message data does not exist in the message buffer, the consumer obtains and decompresses the cold data from the main storage area according to the message site. The coordinator records the consumer's consumption progress and verifies the data integrity based on the check code. 2.根据权利要求1所述的kafka存算分离的数据处理方法,其特征在于,所述构建多层数据处理架构,所述多层数据处理架构包括网络通信层、任务分发层、处理调度层及存储层,所述存储层包括流存储库,所述流存储库包括预写日志区与主存储区,包括:2. The data processing method for separating storage and computing of Kafka according to claim 1 is characterized in that a multi-layer data processing architecture is constructed, the multi-layer data processing architecture includes a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, the storage layer includes a stream storage library, the stream storage library includes a pre-write log area and a main storage area, including: 将数据处理架构划分为调用链路,在网络通信层设置套接字监听器以处理客户端连接请求,创建消息传输通道并配置传输参数,在任务分发层构建路由表以映射处理器与消息类型的对应关系,建立任务队列,设置任务优先级调度机制,根据消息类型将任务分发至对应的处理器;Divide the data processing architecture into call links, set up socket listeners in the network communication layer to handle client connection requests, create message transmission channels and configure transmission parameters, build routing tables in the task distribution layer to map the correspondence between processors and message types, establish task queues, set up task priority scheduling mechanisms, and distribute tasks to corresponding processors according to message types; 在处理调度层部署副本管理器实现分区主从复制,配置协调器管理分区分配与消费组,部署负载预测器采集系统资源使用数据,在存储层创建流存储库并划分预写日志区与主存储区,所述预写日志区采用固态硬盘存储介质,所述主存储区采用对象存储服务作为持久化存储。Deploy a replica manager at the processing and scheduling layer to implement partition master-slave replication, configure a coordinator to manage partition allocation and consumer groups, deploy a load predictor to collect system resource usage data, create a stream repository at the storage layer and divide the pre-write log area and the main storage area. The pre-write log area uses a solid-state hard disk storage medium, and the main storage area uses an object storage service as persistent storage. 3.根据权利要求1所述的kafka存算分离的数据处理方法,其特征在于,所述在所述预写日志区中划分多个日志分片,所述主存储区通过对象存储服务实现数据持久化,所述处理调度层包括副本管理器、协调器及负载预测器,所述副本管理器负责数据副本同步,所述协调器负责分区分配与组管理,所述负载预测器基于系统资源使用率建立预测模型,包括:3. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the pre-write log area is divided into multiple log shards, the main storage area realizes data persistence through the object storage service, the processing scheduling layer includes a replica manager, a coordinator and a load predictor, the replica manager is responsible for data replica synchronization, the coordinator is responsible for partition allocation and group management, and the load predictor establishes a prediction model based on system resource utilization, including: 基于预写日志区的总容量计算日志分片数量,为每个日志分片分配唯一标识符,将日志分片按照时间顺序组织成链表结构,在日志分片中设置消息偏移量与时间戳索引,配置日志分片的预留空间阈值,建立日志分片与物理存储设备的映射关系,所述主存储区按照时间范围划分存储桶,为每个存储桶创建元数据索引;The number of log shards is calculated based on the total capacity of the pre-write log area, a unique identifier is assigned to each log shard, the log shards are organized into a linked list structure in chronological order, a message offset and a timestamp index are set in the log shards, a reserved space threshold of the log shards is configured, a mapping relationship between the log shards and the physical storage device is established, the main storage area is divided into buckets according to the time range, and a metadata index is created for each bucket; 配置副本管理器的主从复制参数,建立分区副本的同步机制,在协调器中构建分区分配表与消费组管理表,记录消费组成员信息与消费进度,所述负载预测器采集处理器使用率、内存占用率、磁盘读写速率系统资源指标,构建时序预测模型计算资源使用趋势,基于预测结果生成负载均衡策略。Configure the master-slave replication parameters of the replica manager, establish a synchronization mechanism for partition replicas, build a partition allocation table and a consumer group management table in the coordinator, record consumer group member information and consumption progress, and the load predictor collects processor usage, memory occupancy, disk read and write rate system resource indicators, builds a time series prediction model to calculate resource usage trends, and generates a load balancing strategy based on the prediction results. 4.根据权利要求1所述的kafka存算分离的数据处理方法,其特征在于,所述接收生产者发送的批量消息数据,计算所述批量消息数据的数据特征,根据所述数据特征选择压缩算法,对所述批量消息数据执行压缩编码生成校验码,将压缩后的消息数据写入所述预写日志区中的日志分片,包括:4. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the receiving of batch message data sent by the producer, calculating the data features of the batch message data, selecting a compression algorithm according to the data features, performing compression encoding on the batch message data to generate a checksum, and writing the compressed message data into the log shard in the write-ahead log area comprises: 解析生产者发送的批量消息数据的消息头与消息体,统计消息体的数据分布特征,计算消息体的熵值、重复度、数值范围数据特征指标,建立压缩算法评估矩阵,根据数据特征指标计算不同压缩算法的压缩率与处理时延,在评估矩阵中选择最优压缩算法;Parse the message header and message body of the batch message data sent by the producer, count the data distribution characteristics of the message body, calculate the entropy value, repetition, and value range data characteristic indicators of the message body, establish a compression algorithm evaluation matrix, calculate the compression rate and processing delay of different compression algorithms based on the data characteristic indicators, and select the optimal compression algorithm in the evaluation matrix; 将选定的压缩算法应用于批量消息数据,生成压缩字节流,计算压缩字节流的CRC32校验码,将压缩数据长度、压缩算法标识、校验码打包为元数据头,将元数据头与压缩字节流组装为数据块,在预写日志区中选择可用日志分片写入所述数据块。Apply the selected compression algorithm to the batch message data to generate a compressed byte stream, calculate the CRC32 check code of the compressed byte stream, package the compressed data length, compression algorithm identifier, and check code into a metadata header, assemble the metadata header and the compressed byte stream into a data block, and select an available log segment in the pre-write log area to write the data block. 5.根据权利要求1所述的kafka存算分离的数据处理方法,其特征在于,所述日志分片采用固态硬盘存储,根据所述副本管理器的配置在多个所述日志分片之间建立主从复制关系,将所述日志分片中的消息数据异步上传至所述主存储区,包括:5. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the log shards are stored in solid-state hard disks, a master-slave replication relationship is established between the multiple log shards according to the configuration of the replica manager, and the message data in the log shards are asynchronously uploaded to the main storage area, including: 为日志分片分配固态硬盘存储空间,配置日志分片的存储参数,设置存储缓冲区大小与刷盘策略,基于副本管理器的配置信息选择从节点,在主从节点间建立复制通道,将日志分片数据按照批次大小发送至从节点,在从节点校验数据完整性后向主节点返回确认信息;Allocate SSD storage space for log shards, configure storage parameters for log shards, set storage buffer size and flushing strategy, select slave nodes based on the configuration information of the replica manager, establish replication channels between master and slave nodes, send log shard data to slave nodes according to batch size, and return confirmation information to the master node after the slave node verifies data integrity; 建立消息数据上传任务队列,按照预设的时间间隔扫描日志分片中已确认复制的数据区域,将数据区域中的消息按照时间范围分组打包,计算数据包的存储路径,通过对象存储接口将数据包异步上传至主存储区对应的存储桶,更新数据包的存储位置索引。Establish a message data upload task queue, scan the confirmed replicated data area in the log shard at preset time intervals, group and package the messages in the data area according to the time range, calculate the storage path of the data packet, asynchronously upload the data packet to the storage bucket corresponding to the main storage area through the object storage interface, and update the storage location index of the data packet. 6.根据权利要求1所述的kafka存算分离的数据处理方法,其特征在于,所述创建消息缓存区,所述消息缓存区用于存储热点消息数据,根据所述负载预测器的预测结果调整消息分区数量,包括:6. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the creation of a message buffer area, the message buffer area is used to store hot message data, and the number of message partitions is adjusted according to the prediction result of the load predictor, including: 为消息缓存区分配内存空间,构建缓存索引表记录消息位点与存储地址的映射关系,设置缓存区的容量上限与淘汰策略,建立热点消息评分机制,基于消息的访问频率与时间衰减因子计算消息热度得分,将热度得分超过预设阈值的消息数据加载至缓存区;Allocate memory space for the message cache, build a cache index table to record the mapping relationship between message locations and storage addresses, set the cache capacity limit and elimination strategy, establish a hot message scoring mechanism, calculate the message heat score based on the message access frequency and time decay factor, and load the message data whose heat score exceeds the preset threshold into the cache; 监控消息分区的读写负载数据,所述负载预测器基于历史负载数据训练预测模型,计算未来时间窗口内的负载趋势,当预测负载超过预设阈值时触发分区扩容,将现有分区中的消息数据按照一致性哈希算法重新分配至新增分区,更新分区路由表。Monitor the read and write load data of the message partition. The load predictor trains a prediction model based on historical load data, calculates the load trend in the future time window, triggers partition expansion when the predicted load exceeds a preset threshold, redistributes the message data in the existing partition to the newly added partition according to the consistent hashing algorithm, and updates the partition routing table. 7.根据权利要求1所述的kafka存算分离的数据处理方法,其特征在于,所述消费者基于消息位点从所述消息缓存区读取所述热点消息数据,当所述消息缓存区中不存在目标消息数据时,所述消费者根据消息位点从所述主存储区获取冷数据并解压,所述协调器记录消费者的消费进度,基于所述校验码验证数据完整性,包括:7. The data processing method for Kafka storage and computing separation according to claim 1 is characterized in that the consumer reads the hot message data from the message buffer area based on the message site. When the target message data does not exist in the message buffer area, the consumer obtains the cold data from the main storage area according to the message site and decompresses it. The coordinator records the consumer's consumption progress and verifies the data integrity based on the check code, including: 消费者向协调器发送位点查询请求,协调器返回消息位点对应的存储位置信息,消费者根据存储位置信息在消息缓存区中查找目标消息数据,若目标消息数据不存在于缓存区,则从主存储区获取对应的数据包,读取数据包中的元数据头解析压缩算法标识,调用对应的解压缩方法还原消息数据;The consumer sends a location query request to the coordinator, and the coordinator returns the storage location information corresponding to the message location. The consumer searches for the target message data in the message cache according to the storage location information. If the target message data does not exist in the cache, the consumer obtains the corresponding data packet from the main storage area, reads the metadata header in the data packet to parse the compression algorithm identifier, and calls the corresponding decompression method to restore the message data. 计算解压后消息数据的校验码与元数据头中记录的校验码进行比对,校验通过后将消息数据返回给消费者,消费者处理完消息后向协调器提交消费位点,协调器将消费位点记录到位点管理表,定期将位点管理表持久化到存储系统。The checksum of the decompressed message data is calculated and compared with the checksum recorded in the metadata header. After the checksum passes, the message data is returned to the consumer. After processing the message, the consumer submits the consumption location to the coordinator. The coordinator records the consumption location in the location management table and regularly persists the location management table to the storage system. 8.一种kafka存算分离的数据处理装置,其特征在于,所述装置包括:8. A data processing device with Kafka storage and computing separation, characterized in that the device comprises: 架构搭建模块,用于构建多层数据处理架构,所述多层数据处理架构包括网络通信层、任务分发层、处理调度层及存储层,所述存储层包括流存储库,所述流存储库包括预写日志区与主存储区,在所述预写日志区中划分多个日志分片,所述主存储区通过对象存储服务实现数据持久化,所述处理调度层包括副本管理器、协调器及负载预测器,所述副本管理器负责数据副本同步,所述协调器负责分区分配与组管理,所述负载预测器基于系统资源使用率建立预测模型;An architecture building module is used to build a multi-layer data processing architecture, wherein the multi-layer data processing architecture includes a network communication layer, a task distribution layer, a processing scheduling layer and a storage layer, wherein the storage layer includes a stream storage library, the stream storage library includes a pre-write log area and a main storage area, wherein a plurality of log shards are divided in the pre-write log area, and the main storage area realizes data persistence through an object storage service, wherein the processing scheduling layer includes a replica manager, a coordinator and a load predictor, wherein the replica manager is responsible for data replica synchronization, the coordinator is responsible for partition allocation and group management, and the load predictor establishes a prediction model based on system resource utilization; 消息存储模块,用于接收生产者发送的批量消息数据,计算所述批量消息数据的数据特征,根据所述数据特征选择压缩算法,对所述批量消息数据执行压缩编码生成校验码,将压缩后的消息数据写入所述预写日志区中的日志分片,所述日志分片采用固态硬盘存储,根据所述副本管理器的配置在多个所述日志分片之间建立主从复制关系,将所述日志分片中的消息数据异步上传至所述主存储区;A message storage module, used to receive batch message data sent by a producer, calculate data features of the batch message data, select a compression algorithm according to the data features, perform compression encoding on the batch message data to generate a checksum, write the compressed message data into a log shard in the pre-write log area, the log shard is stored in a solid-state hard disk, establish a master-slave replication relationship between the plurality of log shards according to the configuration of the replica manager, and asynchronously upload the message data in the log shard to the main storage area; 消息缓存模块,用于创建消息缓存区,所述消息缓存区用于存储热点消息数据,根据所述负载预测器的预测结果调整消息分区数量,消费者基于消息位点从所述消息缓存区读取所述热点消息数据,当所述消息缓存区中不存在目标消息数据时,所述消费者根据消息位点从所述主存储区获取冷数据并解压,所述协调器记录消费者的消费进度,基于所述校验码验证数据完整性。A message cache module is used to create a message cache area, wherein the message cache area is used to store hot message data, and the number of message partitions is adjusted according to the prediction result of the load predictor. The consumer reads the hot message data from the message cache area based on the message site. When the target message data does not exist in the message cache area, the consumer obtains and decompresses the cold data from the main storage area according to the message site. The coordinator records the consumer's consumption progress and verifies the data integrity based on the check code. 9.一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1至7任一项所述的kafka存算分离的数据处理方法的步骤。9. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of the Kafka storage-computation separation data processing method described in any one of claims 1 to 7 are implemented. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时实现权利要求1至7任一项所述的kafka存算分离的数据处理方法的步骤。10. A computer-readable storage medium having a computer program stored thereon, characterized in that when the computer program is executed by a processor, the steps of the Kafka storage and computing separation data processing method described in any one of claims 1 to 7 are implemented.
CN202510616765.6A 2025-05-14 2025-05-14 Data processing method and device for kafka memory calculation separation Active CN120123108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510616765.6A CN120123108B (en) 2025-05-14 2025-05-14 Data processing method and device for kafka memory calculation separation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510616765.6A CN120123108B (en) 2025-05-14 2025-05-14 Data processing method and device for kafka memory calculation separation

Publications (2)

Publication Number Publication Date
CN120123108A true CN120123108A (en) 2025-06-10
CN120123108B CN120123108B (en) 2025-07-29

Family

ID=95921298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510616765.6A Active CN120123108B (en) 2025-05-14 2025-05-14 Data processing method and device for kafka memory calculation separation

Country Status (1)

Country Link
CN (1) CN120123108B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120301945A (en) * 2025-06-12 2025-07-11 青岛港国际股份有限公司 A cache information processing method, device and medium for port platform
CN120353614A (en) * 2025-06-26 2025-07-22 山东海量信息技术研究院 Heterogeneous computing system and cache consistency maintenance method, device, equipment and medium
CN121092467A (en) * 2025-09-01 2025-12-09 翼华科技(北京)有限公司 Resource management method and device based on combination of Slab distributor and Buddy system
CN121255499A (en) * 2025-12-03 2026-01-02 北京布洛克快链科技有限公司 Data decoupling centralized query method and system based on MQ message triggering
CN121502821A (en) * 2026-01-13 2026-02-10 江苏零浩网络科技有限公司 A Distributed Partition Storage Security Method and System for Carbon Data of New Energy Freight Vehicles

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949633A (en) * 2020-08-03 2020-11-17 杭州电子科技大学 A method for analyzing operation log of ICT system based on parallel stream processing
CN114035972A (en) * 2021-09-26 2022-02-11 阿里巴巴(中国)有限公司 Message queue processing method, device and message queue management system
CN114328133A (en) * 2022-03-16 2022-04-12 北京微芯感知科技有限公司 Single-mechanism distributed conflict detection method and system and deposit separation framework
CN115168061A (en) * 2022-09-09 2022-10-11 北京镜舟科技有限公司 Calculation storage separation method and system, electronic equipment and storage medium
CN115361386A (en) * 2022-08-17 2022-11-18 度小满科技(北京)有限公司 A method of cold and hot separation of kafka data storage and cluster server system
CN115577053A (en) * 2022-10-13 2023-01-06 深圳市坤同智能仓储科技有限公司 Multi-message consumption mode multi-storage processing method and device
CN115729442A (en) * 2021-08-31 2023-03-03 华为技术有限公司 Data processing method, device and storage medium based on big data platform
WO2024169158A1 (en) * 2023-02-14 2024-08-22 华为技术有限公司 Storage system, data access method, apparatus, and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949633A (en) * 2020-08-03 2020-11-17 杭州电子科技大学 A method for analyzing operation log of ICT system based on parallel stream processing
CN115729442A (en) * 2021-08-31 2023-03-03 华为技术有限公司 Data processing method, device and storage medium based on big data platform
CN114035972A (en) * 2021-09-26 2022-02-11 阿里巴巴(中国)有限公司 Message queue processing method, device and message queue management system
CN114328133A (en) * 2022-03-16 2022-04-12 北京微芯感知科技有限公司 Single-mechanism distributed conflict detection method and system and deposit separation framework
CN115361386A (en) * 2022-08-17 2022-11-18 度小满科技(北京)有限公司 A method of cold and hot separation of kafka data storage and cluster server system
CN115168061A (en) * 2022-09-09 2022-10-11 北京镜舟科技有限公司 Calculation storage separation method and system, electronic equipment and storage medium
CN115577053A (en) * 2022-10-13 2023-01-06 深圳市坤同智能仓储科技有限公司 Multi-message consumption mode multi-storage processing method and device
WO2024169158A1 (en) * 2023-02-14 2024-08-22 华为技术有限公司 Storage system, data access method, apparatus, and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一个天蝎座 白勺 程序猿: "大数据(7.4)Kafka存算分离架构深度实践:解锁对象存储的无限潜能", CSDN, 10 April 2025 (2025-04-10), pages 1 - 6 *
高子妍;王勇;: "面向云服务的分布式消息系统负载均衡策略", 计算机科学, no. 1, 15 June 2020 (2020-06-15) *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120301945A (en) * 2025-06-12 2025-07-11 青岛港国际股份有限公司 A cache information processing method, device and medium for port platform
CN120301945B (en) * 2025-06-12 2025-09-26 青岛港国际股份有限公司 Cache information processing method, equipment and medium for port platform
CN120353614A (en) * 2025-06-26 2025-07-22 山东海量信息技术研究院 Heterogeneous computing system and cache consistency maintenance method, device, equipment and medium
CN121092467A (en) * 2025-09-01 2025-12-09 翼华科技(北京)有限公司 Resource management method and device based on combination of Slab distributor and Buddy system
CN121255499A (en) * 2025-12-03 2026-01-02 北京布洛克快链科技有限公司 Data decoupling centralized query method and system based on MQ message triggering
CN121502821A (en) * 2026-01-13 2026-02-10 江苏零浩网络科技有限公司 A Distributed Partition Storage Security Method and System for Carbon Data of New Energy Freight Vehicles

Also Published As

Publication number Publication date
CN120123108B (en) 2025-07-29

Similar Documents

Publication Publication Date Title
CN120123108B (en) Data processing method and device for kafka memory calculation separation
US20250147848A1 (en) Scalable log-based continuous data protection for distributed databases
US11422721B2 (en) Data storage scheme switching in a distributed data storage system
US10642840B1 (en) Filtered hash table generation for performing hash joins
US9330108B2 (en) Multi-site heat map management
US10764045B2 (en) Encrypting object index in a distributed storage environment
US10853182B1 (en) Scalable log-based secondary indexes for non-relational databases
US10659225B2 (en) Encrypting existing live unencrypted data using age-based garbage collection
US8103628B2 (en) Directed placement of data in a redundant data storage system
US9639589B1 (en) Chained replication techniques for large-scale data streams
CN119376645B (en) Data distributed storage method and device
US20220114064A1 (en) Online restore for database engines
US20150106578A1 (en) Systems, methods and devices for implementing data management in a distributed data storage system
CN106066896B (en) An application-aware big data deduplication storage system and method
US9547706B2 (en) Using colocation hints to facilitate accessing a distributed data storage system
KR20170054299A (en) Reference block aggregating into a reference set for deduplication in memory management
US10909143B1 (en) Shared pages for database copies
US11860869B1 (en) Performing queries to a consistent view of a data set across query engine types
CN120763170A (en) A distributed storage and indexing method
US10922012B1 (en) Fair data scrubbing in a data storage system
CN119200982B (en) Alluxio-based data storage and cache optimization method and system
US20180107404A1 (en) Garbage collection system and process
CN119537341A (en) Data migration method and device in large data volume scenario
US11210212B2 (en) Conflict resolution and garbage collection in distributed databases
Merli et al. Ursa: A Lakehouse-Native Data Streaming Engine for Kafka

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant