CN118819841A

CN118819841A - Load balancing method, device and electronic equipment

Info

Publication number: CN118819841A
Application number: CN202410881450.XA
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2024-07-02
Filing date: 2024-07-02
Publication date: 2024-10-22

Abstract

The present disclosure relates to a load balancing method, device and electronic device, the method comprising: obtaining a target service request; obtaining the status information of each GPU node in a first graphics processor GPU node set; the first GPU node set comprises: GPU nodes that can currently be used to process the target service request; wherein the status information of any GPU node comprises: the longest duration of each reasoning task currently executed by the GPU node and the number of historical service requests in the session to which the target service request belongs; based on the status information of each GPU node in the first GPU node set, at least one target GPU node is screened out in the first GPU node set; and the target service request is processed based on a target large language model using at least one target GPU node. Through the present disclosure, the probability of cache data of a large language model being hit is effectively improved, computing power is saved, and the efficiency of GPU nodes is improved; the quality of reasoning services provided by the large language model is greatly improved.

Description

Load balancing method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of computers, and in particular relates to a load balancing method, a load balancing device and electronic equipment.

Background

The large language model (Large Language Model, LLM) is a prediction model in the field of artificial intelligence natural language processing, has the capability of processing dialogue tasks, can give logical comprehensive answers to questions raised by human beings, and can be applied to scenes such as dialogue robots, document rewriting, intelligent searching and the like. The prediction output process of the large language model can be called as an inference process, and although the comprehensive prediction effect of the large language model is good, the parameter scale is large, the inference time is long, and a certain amount of hardware resources such as a graphics processor (graphics processing unit, GPU) are needed to provide an inference service, so that the requirement on the hardware resources is high. The application modes of low time delay and high availability are required in the scenes of conversation robots, document rewriting, intelligent searching and the like of large language model application, so that the current large language model has a certain dilemma of falling to the ground in practical application. At present, aiming at the problems of long time consumption and poor floor capability of a large language model, the existing acceleration technology can be divided into two types, namely model calculation side optimization and I/O (input/output) optimization. The calculation side optimization means that storage space waste in the calculation process of the large language model can be reduced by means of KV Cache (key value pair Cache), distributed reasoning, operator fusion, quantization compression and the like; the I/O optimization refers to reducing the speed bottleneck of memory access by utilizing the speed and storage difference of multi-level storage of cache, memory and video memory in GPU storage, and the main technologies include fast Attention (Flash Attention), continuous batch processing (Countinous Batching) and the like.

In an application scenario of actual landing, for example, when processing a large-flow concurrent service request, a large language model reasoning service needs to receive tens or hundreds of service requests in a second-level time, and under the condition of limited hardware resources (such as GPU resources), the requirement of low time delay cannot be met through the model acceleration method, so that the quality of the large language model provided reasoning service is greatly influenced.

Disclosure of Invention

In view of this, the present disclosure proposes a load balancing method, apparatus, electronic device, storage medium and computer program product.

According to an aspect of the present disclosure, there is provided a load balancing method, including:

Acquiring a target service request;

Acquiring state information of each GPU node in a GPU node set of a first graphic processor; the first set of GPU nodes includes: a GPU node currently available for processing the target service request; the state information of any GPU node comprises: the longest duration of each reasoning task currently executed by the GPU node and the times of processing the historical service request in the session to which the target service request belongs;

screening at least one target GPU node from the first GPU node set based on the state information of each GPU node in the first GPU node set;

Processing the target service request based on a target large language model with the at least one target GPU node.

In one possible implementation manner, the screening at least one target GPU node from the first GPU node set based on the state information of each GPU node in the first GPU node set includes:

Calculating priority scores of all GPU nodes based on state information of all GPU nodes in the first GPU node set, wherein the priority scores represent priority degrees of processing the target service requests;

and screening at least one GPU node from the first GPU node set to serve as the at least one target GPU node according to the order of the priority scores of the GPU nodes from high to low.

In one possible implementation, the priority score of a first GPU node is positively correlated with the longest duration of each inference task currently executed by the first GPU node, and positively correlated with the number of times the first GPU node processes a historical service request in a session to which the target service request belongs; wherein the first GPU node is any GPU node in the first GPU node set.

In one possible implementation manner, the calculating the priority score of each GPU node based on the state information of each GPU node in the first GPU node set includes:

calculating the absolute value of the difference value between the longest duration of each reasoning task currently executed by the first GPU node and the average duration of the processing history reasoning task; the average duration of the history reasoning task is represented by the average duration from the GPU node to the completion of the history reasoning task;

Calculating the product of the times of the first GPU node processing the historical service request in the session to which the target service request belongs and the time saving corresponding to the historical service request, wherein the time saving corresponding to the historical service request represents the time saved by directly utilizing the cache information corresponding to the historical service request to calculate the cache information relative to the GPU node in the process of executing the reasoning task by the GPU node;

Calculating a priority score of the first GPU node based on the absolute value of the difference and the product; wherein the priority score of the first GPU node is positively correlated with the absolute value of the difference value and positively correlated with the product.

In one possible implementation, the calculating the priority score of the first GPU node based on the absolute value of the difference and the product includes:

And carrying out weighted summation on the absolute value of the difference value and the product to obtain the priority score of the first GPU node.

In one possible implementation, the method further includes:

determining a resource type available to process the target service request;

obtaining a second set of GPU nodes, the second set of GPU nodes comprising: currently GPU nodes with service request processing capabilities;

And determining GPU nodes matched with the resource types in the second GPU node set as the GPU nodes currently available for processing the target service request.

In one possible implementation, the resource types include: the type of GPU node and/or the type of large language model.

In one possible implementation, the method further includes:

Determining the minimum GPU node quantity corresponding to the target service request;

The screening at least one target GPU node from the first GPU node set based on the state information of each GPU node in the first GPU node set includes:

Under the condition that the number of the lowest GPU nodes does not exceed the number of the GPU nodes in the first GPU node set, based on state information of all GPU nodes in the first GPU node set, the GPU nodes with the number of the lowest GPU nodes are selected from the first GPU node set to serve as the at least one target GPU node;

and/or the number of the groups of groups,

Updating the first GPU node set under the condition that the minimum GPU node number exceeds the GPU node number in the first GPU node set; and when the number of GPU nodes in the updated first GPU node set is not less than the minimum number of GPU nodes, based on the updated state information of each GPU node in the first GPU node set, selecting the GPU node with the minimum number of GPU nodes from the updated first GPU node set as the at least one target GPU node.

In one possible implementation, in a case where the longest duration of each of the inference tasks currently performed by the first GPU node exceeds the average duration of the processing history inference tasks, the weight of the absolute value of the difference value is positively correlated with the longest duration of each of the inference tasks currently performed by the first GPU node, and the weight of the product is negatively correlated with the longest duration of each of the inference tasks currently performed by the first GPU node;

and/or the number of the groups of groups,

And under the condition that the longest duration of each reasoning task currently executed by the first GPU node does not exceed the average duration of the processing history reasoning tasks, the weight of the absolute value of the difference value is inversely related to the longest duration of each reasoning task currently executed by the first GPU node, and the weight of the product is positively related to the longest duration of each reasoning task currently executed by the first GPU node.

In one possible implementation manner, the obtaining the second GPU node set includes:

According to distributed semaphores corresponding to each candidate GPU node in a candidate GPU node set, screening out the GPU node with the current service request processing capability from the candidate GPU node set;

the method further comprises the steps of:

and updating the distributed semaphore corresponding to the at least one target GPU node.

In one possible implementation, the method further includes: and synchronizing the message between the at least one target GPU node and the node issuing the target service request by using the key value to the database as a message middleware.

According to another aspect of the present disclosure, there is provided a load balancing apparatus, the apparatus comprising:

The acquisition module is used for acquiring the target service request;

The acquisition module is further used for acquiring state information of each GPU node in the GPU node set of the first graphic processor; the first set of GPU nodes includes: a GPU node currently available for processing the target service request; the state information of any GPU node comprises: the longest duration of each reasoning task currently executed by the GPU node and the times of processing the historical service request in the session to which the target service request belongs;

The screening module is used for screening at least one target GPU node from the first GPU node set based on the state information of each GPU node in the first GPU node set;

And the processing module is used for processing the target service request based on a target large language model by utilizing the at least one target GPU node.

According to another aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described method.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

According to the embodiment of the disclosure, based on the cache characteristics of a large language model in a service process, at least one target GPU node is screened out from a first GPU node set according to state information of each GPU node in the first GPU node set, namely the longest duration of each reasoning task currently executed by the GPU node and the times of processing historical service requests in a session to which the target service requests belong; because the number of times of processing the historical service requests in the session to which the target service request belongs is more, the probability of hitting the cache data is higher when the target service request is processed, and meanwhile, the longest duration of each reasoning task currently executed is longer, more computing resources can be released more quickly to execute other reasoning tasks; in this way, the probability of hitting the cached data of the target GPU node when the screened target GPU node processes the target service request based on the target large language model is higher, so that the calculation force is saved, and the efficiency of the GPU node is improved; the computational bottleneck under the condition of limited hardware resources is reduced, the hardware resources can be more fully utilized when the large-flow concurrent service request is processed, and the throughput of the large language model is effectively exerted, so that frequent queuing waiting of user service requests is avoided, the service time delay is low, and the quality of the inference service provided by the large language model is greatly improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a schematic diagram of a large language model service system according to an embodiment of the present disclosure;

fig. 2 shows a flow chart of a load balancing method according to an embodiment of the present disclosure.

Fig. 3 illustrates a flowchart of a target GPU node screening method according to an embodiment of the present disclosure.

Fig. 4 shows a flow chart of a load balancing method according to an embodiment of the present disclosure.

Fig. 5 shows a flow chart of a load balancing method according to an embodiment of the present disclosure.

Fig. 6 shows a block diagram of a load balancing apparatus according to an embodiment of the present disclosure.

Fig. 7 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "exemplary," "in one embodiment," "in some embodiments," "in other embodiments," and the like in various places throughout this specification are not necessarily all referring to the same embodiment, but mean "one or more, but not all, embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: including the case where a alone exists, both a and B together, and B alone, where a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Fig. 1 illustrates a schematic structure of a large language model service system according to an embodiment of the present disclosure. As shown in fig. 1, the large language model service system may include: GPU node set 101 and load balancing device 102.

The GPU node set 101 may include one or more GPU nodes (may also be referred to as service nodes or nodes), and the GPU node set 101 shown in fig. 1 includes N GPU nodes, that is, GPU node 1 and GPU node 2 … GPU nodes N, where N is a positive integer; these GPU nodes are used to perform the reasoning tasks of the large language model.

Illustratively, the GPU node may be a physical entity device configured with a GPU, where the physical entity may be a terminal or a server or a part of a computer device or a computer device, where the server may be a stand-alone physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, and basic cloud computing services such as big data and artificial intelligence platforms, and so on; one server can be used as a GPU node. The terminal can be a smart phone, a tablet personal computer, a notebook computer, a desktop computer, an intelligent sound box, an intelligent watch, intelligent voice interaction equipment, intelligent household appliances, a vehicle-mounted terminal and the like; one terminal can be used as one GPU node. As an example, one or more processors may be configured in a computer device such as a terminal or a server, and when the computer device configures a plurality of processors, all or part of the plurality of processors may be used as a GPU node; for example, if a computer device is configured with multiple GPUs, each GPU may be referred to as a GPU node, or a number of GPUs may be referred to as a GPU node.

The load balancing device 102 is configured to receive a service request of a user, determine, by using a load balancing method, some or all GPU nodes in the GPU node set 101 as GPU nodes that process the service request, and further process the service request by using the GPU nodes based on a large language model, and, by using the GPU nodes to execute corresponding reasoning tasks, respectively, to complete processing the service request.

The load balancing device 102 may be, for example, software, hardware, or a combination of software and hardware, as not limited in this regard.

In order to improve the quality of the inference service provided by the large language model, in the related art, for example, vLLM frames, the load balancing device 102 may configure a traditional load balancing method, and after receiving a service request of a user, may directly determine a GPU node for processing the service request according to the flow of each GPU node in the GPU node set 101; in the process of processing the service request by the GPU node, discontinuous large language model cache data storage is realized through paging attention (PageAttention), so that the utilization rate of hardware resources is improved to a certain extent, and the time delay is reduced. However, the vLLM framework adopts a traditional load balancing method, only performs service request distribution according to the flow of the GPU node, and under the condition of limited hardware resources, calculation becomes a bottleneck, the problem of frequent queuing of service requests of users still exists when processing large-flow concurrent service requests, and in addition, although the large language model cache data is stored for acceleration in the process of processing the service requests by the GPU node, the probability of hit of the cache data in the subsequent processing process is extremely low.

In order to solve the technical problems, the embodiments of the present disclosure provide a dynamic load balancing method under a large language model reasoning service scenario (see below for detailed description), and consider the actual load capacity of GPU nodes and the cache characteristics of the large language model in the service process, schedule the GPU nodes to process service requests, effectively improve the probability that cache data of the large language model is hit, save computing power, and improve the efficiency of GPU nodes; the computational bottleneck under the condition of limited hardware resources is reduced, the hardware resources can be more fully utilized when the large-flow concurrent service request is processed, and the throughput of the large language model is effectively exerted, so that frequent queuing waiting of user service requests is avoided, the service time delay is low, and the quality of the inference service provided by the large language model is greatly improved.

It should be noted that, the large language model service system described in the embodiments of the present disclosure is for more clearly describing the technical solution of the embodiments of the present disclosure, and does not constitute a limitation to the application scenario of the technical solution provided in the embodiments of the present disclosure, and those skilled in the art can know that, for other similar or new scenarios, the technical solution provided in the embodiments of the present disclosure is applicable to similar technical problems.

The load balancing method provided by the embodiment of the present disclosure is described in detail below.

Fig. 2 shows a flow chart of a load balancing method according to an embodiment of the present disclosure. Illustratively, the method may be performed by the large language model service system 10 or the load balancing device 102 of fig. 1 described above, and as shown in fig. 2, the method may include the steps of:

Step 201, obtaining a target service request.

The service request may also be referred to as an inference request, and may include, for example, a request for a natural language processing task such as text generation, translation, question-answering, and the like.

Illustratively, the large language model service system may provide multiple access interfaces, support multiple access protocols, and thus receive different types of service requests, for example, may be HTTP or GRPC type service requests, and as an example, a user or an application program may send a service request to the load balancing device 102 through an interactive interface, an interface, or the like, for example, in a conversational robot scenario, the user may input a proposed problem through the interactive interface in a form of voice or input text, etc., and trigger a service request for solving the problem, where the service request is a target service request, and after the balancing device 102 receives the service request triggered by the user, execute the following steps to invoke a suitable GPU node to process the service request. It can be understood that in the dialogue robot scene, in one session between the user and the machine, multiple rounds of questioning can be performed, and each time the user performs the questioning, the service request is triggered once; the user in the scene can trigger multiple service requests, and the target service request is any service request in the multiple service requests.

Illustratively, after the load balancing device 102 receives the target service request, the target service request may be parsed and preprocessed. For example, the target service request may be data format converted. As one example, the load balancing device 102 may implement authentication and resolution of the target service request through the Golang language and its ecological components.

Illustratively, after the load balancing device 102 parses the target service request, relevant parameters of the target service request may be checked, e.g., the relevant parameters may include keywords, phrases, etc., indicated by the target service request.

For example, a globally unique identifier in a large language model service system may be configured for a target service request, identifying the target service request to distinguish from other service requests.

Step 202, obtaining state information of each GPU node in the first GPU node set.

Wherein the first set of GPU nodes comprises: GPU nodes currently available for processing the target service request. As an example, when all configured in the large language model service system are isomorphic GPU nodes, the GPU node currently having the available load amount may be used as the GPU node currently available for processing the target service request; as another example, when heterogeneous GPU nodes are configured in the large language model service system, GPU nodes currently available for processing the target service request may be screened out of GPU nodes currently having an available load, for a specific manner see the following related description.

In one possible implementation manner, the stability of the large language model service system in arranging links and related indexes (such as indexes of signal strength, bit error rate, delay and the like) can be monitored through a monitoring reporting mechanism, the overall pressure of the large language model service system is measured, and new GPU nodes can be configured in the large language model service system or existing GPU nodes can be deleted according to the overall pressure change, so that the dynamic capacity expansion or capacity shrinkage of the large language model service system is realized.

In the first set of GPU nodes, the state information of any GPU node includes: the GPU node performs the longest duration of each reasoning task currently executed and processes the number of historical service requests in the session to which the target service request belongs.

For example, in the case of a clustered GPU node deployment, multiple model instances (also referred to as service instances) may be configured in a large language model service system, where each model instance may be capable of receiving a service request with a certain number of concurrent channels, and at the same time, there may be multiple model instances running simultaneously, so as to process a large-traffic concurrent service request. The model instance represents a concrete large language model base, and comprises a model structure file and model weights of a large language model, wherein the model structure contains hierarchical structure information of the model, and the model weights are parameter information learned by a model pre-training process; each model instance may be used to generate text, answer questions, or perform general natural language processing tasks; each model instance has consistency, and the same large language model architecture and parameter set are used, so that the consistency of performance and behavior of all model instances when processing the same task can be ensured. Wherein each model instance may be implemented by one or more GPU nodes, i.e. one service request may be handled by one or more GPU nodes. Meanwhile, each GPU node can have the capability of realizing one or more model instances, namely, for each GPU node, one or more model instances can be simultaneously realized at the same time, namely, the reasoning tasks of a plurality of large language models can be simultaneously executed; it can be appreciated that the process of executing each inference task by the GPU node needs to last for a period of time, where the longest period of time is the longest duration of each inference task currently executed by the GPU node, which may also be referred to as the longest inference task time currently executed by the GPU node. As an example, a GPU node may be configured to perform four inference tasks at the same time at most, e.g., the GPU node is currently performing an inference task a and an inference task B, where the duration of the inference task a is 5s, i.e., the current time is cut off, the GPU node has continuously performed the inference task a for 5s and is still performing execution; the duration of the reasoning task B is 8s, namely the current moment is cut off, and the GPU node continuously executes the reasoning task B for 8s; the longest duration of each reasoning task currently executed by the GPU node is 8s; for another example, the GPU node is currently executing the inference task C, where the duration of the inference task C is 10s, i.e., by the current time, the GPU node has continuously executed the inference task C for 10s and is still executing, and then the longest duration of each inference task currently executed by the GPU node is 10s.

In a conversation robot scene, a user can carry out a conversation with a large language model service system, wherein in the conversation process, the user can carry out 'one question and one answer' with the robot for a plurality of times, each question of the user is a service request, and after the large language model service system receives one question, the large language model service system can generate reply information through reasoning and present the reply information to the user; in the same session between the large language model service system and the user, there may be a correlation between questions of the user multiple times before and after, for example, the user may ask questions from different angles or progressively for one thing; therefore, after each time of reasoning task is executed, the GPU node caches corresponding data (such as the reasoning result or intermediate data of this time) in the GPU node, so that the GPU node directly invokes the cached data when processing the service request after the same session, thereby improving the reasoning efficiency of the GPU node, and for example, after a certain session is finished, the large language model service system can release the cached data of each GPU node. It can be understood that, for each service request, an appropriate GPU node is screened out from the first GPU node set as a target GPU node for processing the service request, so that in the front and rear different service requests, different GPU nodes may be screened out, that is, in the same session, GPU nodes for processing different service requests may be different or the same. For any GPU node, in the session to which the target service request belongs, the number of service requests in the session which the GPU node has processed is the number of historical service requests in the session to which the target service request belongs, wherein if the GPU node has not processed any historical service request in the session to which the target service request belongs, the corresponding number of times is 0, and if the GPU node has processed the historical service request in the session to which the target service request belongs, the corresponding number of times is 1, and so on.

Because the service request amount in the large language model service system is usually large, each GPU node in the large language model service system can execute a large number of reasoning tasks, and the data cached by each GPU node is complex, so that the state information of each GPU node in the large language model service system is changed frequently, and the state information of each GPU node in the large language model service system can be maintained to be updated dynamically, for example, so that the latest state information of each GPU node in the large language model service system can be acquired at any moment.

Step 203, screening at least one target GPU node from the first GPU node set based on the state information of each GPU node in the first GPU node set.

In this step, the load balancing device 102 may screen at least one target GPU node from the first GPU node set based on the longest duration of each inference task currently executed by each GPU node in the first GPU node set and the number of times of processing the historical service request in the session to which the target service request belongs. Considering that when a service request is processed based on a large language model, the duration of executing each reasoning task for a GPU node is generally limited, therefore, for any GPU node, the longer the longest duration of each reasoning task currently executed by the GPU node, the more likely the reasoning task corresponding to the longest duration approaches the end, so that the GPU node can release more computing resources to execute other reasoning tasks more quickly. Meanwhile, for a certain service request, if the number of times that a GPU node processes a history service request in a session to which the service request belongs is greater, the data cached in the GPU node has greater correlation with the data involved in processing the service request, so that for any GPU node, the greater the number of times that the GPU node processes a history service request in a session to which a target service request belongs, the higher the efficiency of the GPU node in processing the target service request will generally be. Therefore, the appropriate target GPU node can be screened based on the longest duration of each reasoning task currently executed by each GPU node in the first GPU node set and the times of processing the historical service request in the session to which the target service request belongs, and the efficiency of processing the target service request by the subsequent target GPU node can be effectively improved.

In one possible implementation, before performing this step 203, the minimum GPU node number corresponding to the target service request may also be determined; and further, different strategies can be flexibly executed according to the relative sizes of the minimum GPU node number corresponding to the target service request and the GPU node number in the first GPU node set, so that at least one target GPU node is screened out from the first GPU node set. The minimum number of GPU nodes may be configured according to the requirement, for example, may be 1.

Scene one, under the condition that the number of the lowest GPU nodes does not exceed the number of GPU nodes in the first GPU node set, based on state information of all GPU nodes in the first GPU node set, screening out the GPU nodes with the number of the lowest GPU nodes from the first GPU node set as at least one target GPU node; in this scenario, the minimum number of GPU nodes corresponding to the target service request does not exceed the number of GPU nodes in the first GPU node set, that is, there are enough GPU node computing resources currently available in the large language model service system to process the target service request, so a policy for directly screening the target GPU nodes is executed.

Scene two, updating the first GPU node set under the condition that the minimum GPU node number exceeds the GPU node number in the first GPU node set; and when the number of GPU nodes in the updated first GPU node set is not less than the minimum number of GPU nodes, based on the updated state information of each GPU node in the first GPU node set, selecting the GPU node with the minimum number of GPU nodes from the updated first GPU node set as the at least one target GPU node. In the scene, the minimum GPU node number corresponding to the target service request exceeds the GPU node number in the first GPU node set, namely, all currently idle GPU node computing resources in the large language model service system are insufficient for processing the target service request, so that a queuing waiting strategy needs to be executed, the target service request enters a queuing queue to wait, the first GPU node set is dynamically updated, the GPU node number in the first GPU node set is changed in real time along with continuous updating, and when the minimum GPU node number corresponding to the target service request does not exceed the GPU node number in the first GPU node set, the target GPU node is screened out.

The load condition of each GPU node in the large language model service system can be updated in real time, and accordingly, the number of GPU nodes currently available for processing the target service request contained in the first GPU node set is dynamically changed, so that when the first GPU node set has enough GPU nodes available for processing the target service request, the target GPU node for processing the target service request can be timely screened out for the target service request.

Step 204, processing the target service request based on a target large language model by using the at least one target GPU node.

In the step, each target GPU node in the at least one target GPU node determined in the step 203 is utilized to process a target service request, so as to realize scheduling of the at least one target GPU node to execute an inference task corresponding to a target large language model, and the target large language model can be utilized to provide language generating capability in the inference process; because each target GPU node is determined based on the longest duration of each reasoning task currently executed by each GPU node in the first GPU node set and the times of processing the historical service requests in the session to which the target service requests belong, the probability that the target GPU node hits cache data when executing the reasoning tasks is improved, the computing resources of the target GPU node are saved, and the target service request processing efficiency is improved.

The target large language model may be preconfigured or may be determined according to a relevant parameter of the target service request, for example, the relevant parameter of the target service request includes a type of large language model, and the large language model of the type may be regarded as the target large language model.

In one possible implementation, the method further includes: and synchronizing the message between the at least one target GPU node and the node issuing the target service request by adopting a Key-Value (KV) database as a message middleware. The node that issues the target service request is the load balancing device 102; illustratively, the KV database may be Redis. In this way, the load balancing device 102 can schedule the target GPU node through the message middleware, and the target GPU node pulls the message of the load balancing device 102 through the message middleware and executes the corresponding reasoning task; during the period that the target GPU node executes the target service request, the large language model service system can subscribe the target service request, then can still continuously acquire a new service request, and execute the determination of the steps for the new service request to process the GPU node of the new service request, so that asynchronous scheduling is realized.

As an example, the target service request may be distributed to multiple target GPU nodes, so that the multiple target GPU nodes may execute corresponding reasoning tasks in parallel and obtain corresponding reasoning results; finally, after each target GPU node completes the reasoning task, reporting a corresponding reasoning result; after all the target GPU nodes complete the reasoning task, summarizing the reasoning results reported by all the target GPU nodes, so as to generate a processing result corresponding to the target service request; and the processing result can be fed back to the user through an interface accessed by the user. Therefore, the user can access the large language model system by using the preset access interface, and obtain the processing result of the service request, so that the method is convenient and simple and has stronger universality.

In the embodiment of the disclosure, based on the cache characteristics of a large language model in a service process, at least one target GPU node is screened out from a first GPU node set according to state information of each GPU node in the first GPU node set, namely the longest duration of each reasoning task currently executed by the GPU node and the times of processing a historical service request in a session to which the target service request belongs; because the number of times of processing the historical service requests in the session to which the target service request belongs is more, the probability of hitting the cache data is higher when the target service request is processed, and meanwhile, the longest duration of each reasoning task currently executed is longer, more computing resources can be released more quickly to execute other reasoning tasks; in this way, the probability of hitting the cached data of the target GPU node when the screened target GPU node processes the target service request based on the target large language model is higher, so that the calculation force is saved, and the efficiency of the GPU node is improved; the computational bottleneck under the condition of limited hardware resources is reduced, the hardware resources can be more fully utilized when the large-flow concurrent service request is processed, and the throughput of the large language model is effectively exerted, so that frequent queuing waiting of user service requests is avoided, the service time delay is low, and the quality of the inference service provided by the large language model is greatly improved.

In the above step 203, possible implementations of the screening of the target GPU node are exemplarily described below.

Fig. 3 illustrates a flowchart of a target GPU node screening method according to an embodiment of the present disclosure. Illustratively, the method may be performed by the large language model service system 10 or the load balancing device 102 of fig. 1 described above, and as shown in fig. 3, the method may include the steps of:

step 301, calculating a priority score of each GPU node based on state information of each GPU node in the first GPU node set, where the priority score represents a priority degree of processing the target service request.

Based on the foregoing, for any GPU node, the longer the longest duration of each inference task currently executed by the GPU node, the more likely the inference task corresponding to the longest duration approaches the end, so that the GPU node can release the computing resource faster to better process the target service request. Meanwhile, for any GPU node, the more times the GPU node processes the historical service request in the session to which the target service request belongs, the higher the efficiency of the GPU node in processing the target service request. Thus, the priority score for each GPU node in the first set of GPU nodes may be calculated based on the longest duration of each inference task currently performed by each GPU node and the number of times a historical service request in the session to which the target service request belongs is processed.

Illustratively, the priority score of the first GPU node is positively correlated with the longest duration of each inference task currently executed by the first GPU node, and positively correlated with the number of times the first GPU node processes a history service request in a session to which the target service request belongs; wherein the first GPU node is any GPU node in the first GPU node set. The longer the longest duration of each reasoning task currently executed by the GPU node is, the more times the GPU node processes the historical service requests in the session to which the target service requests belong, the higher the corresponding priority score is, namely the higher the priority degree of processing the target service requests is; conversely, the lower the corresponding priority score, the lower the priority of processing the target service request.

In one possible implementation, this step 301 may include: calculating the absolute value of the difference value between the longest duration of each reasoning task currently executed by the first GPU node and the average duration of the processing history reasoning task; calculating the product of the times of processing the historical service request in the session to which the target service request belongs by the first GPU node and the time-saving corresponding to one historical service request; calculating a priority score of the first GPU node based on the absolute value of the difference and the product; wherein the priority score of the first GPU node is positively correlated with the absolute value of the difference value and positively correlated with the product.

The average duration of the history reasoning task is represented by the average duration from the GPU node to the completion of the history reasoning task; the average time length may be a quantitative statistical value, for example, information of a plurality of historical reasoning tasks may be collected, and for any historical reasoning task, the collected corresponding information may include time length from the GPU node obtaining the historical reasoning task to the completion of executing the historical reasoning task, so as to obtain an average value of the time lengths corresponding to the plurality of historical reasoning tasks as an average time length for processing the historical reasoning task; as one example, the average duration of the GPU node processing the historical reasoning tasks may be 20s.

The saved time corresponding to the one-time history service request represents the time saved by directly utilizing the cache information corresponding to the one-time history service request to calculate the cache information relative to the GPU node in the process of executing the reasoning task by the GPU node. The saved time may be a quantitative statistic, and illustratively, processing information using multiple historical service requests may be collected, where for each historical service request, the collected processing information may include time saved by directly using the cache information corresponding to the historical service request to calculate the cache information relative to the GPU node.

Illustratively, the calculating the priority score of the first GPU node based on the absolute value of the difference and the product may include: and carrying out weighted summation on the absolute value of the difference value between the longest duration of each reasoning task currently executed by the first GPU node and the average duration of the processing history reasoning tasks and the product of the times of the first GPU node for processing the history service request in the session to which the target service request belongs and the time saving corresponding to one history service request, so as to obtain the priority score of the first GPU node.

For example, the priority Score of the first GPU node may be calculated by the following formula (1):

Score＝a×T×n+b×∣t-M∣…………………………(1)

Wherein T represents a time saving corresponding to one historical service request, n represents the number of times the first GPU node processes the historical service request in the session to which the target service request belongs, t×n represents a product of the number of times the first GPU node processes the historical service request in the session to which the target service request belongs and the time saving corresponding to one historical service request, a is a weight of the product t×n, T represents a longest duration of each reasoning task currently executed by the first GPU node, M represents an average duration of processing the historical reasoning task, T-M 'represents an absolute value of a difference value between the longest duration of each reasoning task currently executed by the first GPU node and the average duration of processing the historical reasoning task, and b is a weight of the absolute value T-M' of the difference value.

In one possible implementation, in a case where the longest duration of each of the inference tasks currently performed by the first GPU node exceeds the average duration of the processing history inference tasks, the weight of the absolute value of the difference value is positively correlated with the longest duration of each of the inference tasks currently performed by the first GPU node, and the weight of the product is negatively correlated with the longest duration of each of the inference tasks currently performed by the first GPU node; and/or, in the case that the longest duration of each inference task currently executed by the first GPU node does not exceed the average duration of the processing history inference tasks, the weight of the absolute value of the difference value is inversely related to the longest duration of each inference task currently executed by the first GPU node, and the weight of the product is positively related to the longest duration of each inference task currently executed by the first GPU node.

Because the average duration of any GPU node processing the historical reasoning task in the large language model service system basically accords with normal distribution, a certain long tail effect exists; for the first GPU node, if the longest duration of each reasoning task currently executed by the first GPU node exceeds the average duration of the processing history reasoning tasks, the larger the value of the longest duration is, the more likely the first GPU node executes the corresponding reasoning task to be close to the end, so that the more likely the computing resources of the first GPU node are scheduled to execute other reasoning tasks as soon as possible, at this time, compared with the product of the number of times of the history service requests in the session of the first GPU node processing the target service request and the corresponding time-saving time of one history service request, the larger the influence of the absolute value of the difference value of the longest duration of each reasoning task currently executed by the first GPU node and the average duration of the processing history reasoning tasks on the priority score of the first GPU node is, the larger the weight of the absolute value of the difference value is, and correspondingly, the smaller the weight of the product is; if the longest duration of each inference task currently executed by the first GPU node does not exceed the average duration of the processing history inference task, the smaller the value of the longest duration is, the lower the likelihood that the first GPU node executes the corresponding inference task near the end is, at this time, the larger the weight of the absolute value of the difference, and correspondingly the larger the weight of the product is, relative to the absolute value of the difference between the longest duration of each inference task currently executed by the first GPU node and the average duration of the processing history inference task, the larger the influence of the product of the number of times of processing the history service request in the session to which the target service request belongs and the time saving corresponding to one history service request on the priority score of the first GPU node is.

For example, taking the above formula (1) as an example, the values of the weights a and b may be adjusted according to the relative magnitudes of the longest duration t of each inference task currently performed by the first GPU node and the average duration M of the processing history inference task, where the smaller a is when t exceeds M and the larger t is, the larger b is when t exceeds M and the smaller t is, and the larger a is and the smaller b is.

Step 302, screening at least one GPU node from the first GPU node set according to the order of the priority scores of the GPU nodes from high to low, and using the at least one GPU node as the at least one target GPU node.

In this way, by executing steps 301-302 to implement quantitative computation of the priority scores of the GPU nodes, a suitable GPU node is screened out as a target GPU node according to the priority scores of the GPU nodes, for example, in the above scenario one, in which enough GPU node computing resources in the first GPU node set can be used to process the target service request, step 301 may be executed to calculate the priority score of each GPU node, and then step 302 is executed to screen out the target GPU node according to the order of the priority scores of the GPU nodes from high to low. In the second scenario, since there are no enough GPU nodes in the first GPU node set in the second scenario, the target service request is queued through a queue maintained in the system, and meanwhile, the large language model service system maintains the load condition detection of GPU nodes with a certain frequency and updates the first GPU node set, when there are enough GPU nodes in the first GPU node set, step 301 may be executed to calculate the priority score of each GPU node, and then step 302 is executed to screen out the target GPU nodes according to the order of the priority scores of the GPU nodes from high to low, so as to provide services for the queued target service request.

Further, considering that heterogeneous GPU nodes exist in the large language model service system, for example, in a scene of clustered deployment of heterogeneous GPU nodes, taking one GPU as an example, GPU types corresponding to different GPU nodes are different, vLLM frames and other related technologies, when service requests are distributed to the GPU nodes, service capability differences of different types of GPUs are not considered; in the embodiment of the disclosure, on the basis of considering the actual load capacity of the GPU node and the cache characteristic of the large language model in the service process, the service capacity distinction of the heterogeneous GPU node is combined, so that the calculation force is used more finely.

Fig. 4 shows a flow chart of a load balancing method according to an embodiment of the present disclosure. Illustratively, the method may be performed by the large language model service system 10 or the load balancing device 102 of fig. 1 described above, as shown in fig. 4, and the method may include the steps of:

step 401, obtaining a target service request;

this step is the same as step 201 in fig. 2 described above.

As an example, the user may also specify a large language model that handles the target service request, i.e., a target large language model, for example, multiple large language models may be preconfigured, and the user may select a large language model that provides the present reasoning service among the multiple large language models according to his/her own needs. As another example, a default large language model, i.e., a target large language model, that provides reasoning services for the user may be preconfigured.

Illustratively, the relevant parameters of the target service request may further include: the type of the large language model and the type of the GPU node.

As an example, when different GPU nodes configure different types of GPUs, the type of GPU node may be represented according to the model of the GPU, e.g., configure GPU nodes of a type GPU, configure GPU nodes of B type GPU, etc.

Illustratively, the types of GPU nodes have correspondence to different large language models, e.g., for multiple large language models: there are four types of GPU nodes, namely an a-type GPU node, a B-type GPU node, a C-type GPU node, and a d-type GPU node, wherein the number of GPU nodes of each type may be one or more. The a-type GPU node may be configured to correspond to the large language model a, i.e., the a-type GPU node may perform the inference task of the large language model a, the B-type GPU node may correspond to the large language model B, i.e., the B-type GPU node may perform the inference task of the large language model B, and the C-type GPU node and the d-type GPU node may correspond to the large language model C, i.e., the C-type GPU node and the d-type GPU node may perform the inference task of the large language model C.

Step 402, determining the resource type available for processing the target service request.

Illustratively, the type of resources available to process the target service request may be determined by a related parameter of the target service request; as one example, the resource types may include a type of GPU node and/or a type of large language model; the type of the large language model which can be used for processing the target service request is the type of the large language model indicated by the relevant parameters of the target service request, and the type of the GPU node which can be used for processing the target service request is the type of the GPU node indicated by the relevant parameters of the target service request.

Step 403, obtaining a second GPU node set.

Wherein the second set of GPU nodes comprises: GPU nodes currently having service request processing capabilities. The GPU node with service request processing currently represents the GPU node which is accessible currently and has idle or non-full load condition.

By way of example, a list of GPU nodes accessible in the large language model service system may be obtained, where the load condition of each GPU node in the large language model service system may be recorded in the list of GPU nodes, and further, GPU nodes with service request processing capability may be screened out according to the load condition of each GPU node, to obtain the second GPU node set. The load condition of each GPU node may be determined by the available load capacity of the GPU node, for example, the load condition may include idle, working non-full load, and the like; idle indicates that the GPU node does not execute an reasoning task currently, and the maximum load capacity of the GPU node is the current available load capacity; the working non-full state indicates that the GPU node is executing an reasoning task, and a part of load capacity is occupied currently, but a certain available load capacity still remains; full load means that the GPU node is executing an inference task, and the current occupied load reaches the maximum load of the GPU node, i.e., the available load is zero.

As an example, the load may be identified by a distributed semaphore, for example, the distributed semaphore S corresponding to a GPU node may be N, which indicates that the GPU node may process N inference tasks currently, where N is an integer, the value range is 0-M, and M indicates the maximum number of inference tasks that the GPU node may process simultaneously; for any GPU node, if the value of the distributed semaphore S corresponding to the GPU node is 0, the current unavailable load capacity of the GPU node is indicated, namely the load condition of the GPU node is full; if the value of the distributed semaphore S corresponding to the GPU node is M, the current available load capacity of the GPU node is represented as the maximum load capacity, namely the load condition of the GPU node is idle; if the value of the distributed semaphore S corresponding to the GPU node is between 0 and M, it indicates that the GPU node still has a certain available load, i.e. the load condition of the GPU node is not fully loaded.

Because the service request amount in the large language model service system is usually large, the load condition of each GPU node in the large language model service system changes frequently, for example, dynamic update can be maintained on the load condition of each GPU node in the large language model service system, for example, the load balancing device 102 can synchronize the load amount of each GPU node in real time through a distributed signal amount mechanism, so that the current latest load condition of each GPU node in the large language model service system can be obtained at any moment. For any GPU node, when the GPU node processes an inference task, the inference task needs to occupy a certain load capacity of the GPU node, and the value of the distributed semaphore S of the corresponding GPU node is reduced by one; when the GPU node completes an reasoning task, the GPU node releases the load occupied by the reasoning task, and the value of the distributed semaphore S of the corresponding GPU node is increased by one; in this way, the distributed semaphore is used as the identifier of the load capacity of each GPU node, and the current load condition of each GPU node can be mastered by updating the distributed semaphore corresponding to each GPU node, so that the maintainability of the load capacity of each GPU node in the large language model service system is improved.

In one possible implementation manner, in this step, the obtaining a second GPU node set includes: and screening out the GPU node with the current service request processing capability from the candidate GPU node set according to the distributed semaphore corresponding to each candidate GPU node in the candidate GPU node set. Illustratively, the candidate set of GPU nodes may include GPU nodes configured by a large language model service system; through the distributed semaphores corresponding to the candidate GPU nodes, the GPU nodes with the service request processing capability at present can be screened out rapidly and accurately.

Illustratively, the resource type of each GPU node in the second GPU node set may also be obtained, for example, the type of each GPU node and/or the type of the large language model adapted by each GPU node; as an example, the accessible GPU node list may also record the resource types of the GPU nodes, so that GPU nodes with service request processing capability may be screened out through the GPU node list acquisition.

In this way, for heterogeneous GPU nodes, the service capabilities of the heterogeneous GPU nodes can be distinguished by the resource types of the GPU nodes.

Step 404, determining a GPU node matched with the resource type in the second GPU node set as the GPU node currently available for processing the target service request.

Illustratively, the resource types may include, as an example, according to the resource types of each GPU node in the second set of GPU nodes: the type of GPU node and/or the type of large language model; the GPU node type is the type of the GPU node which can be used for processing the target service request, the type of the adaptive large language model is the GPU node of the target large language model, and the GPU node is used as all GPU nodes which can be used for processing the target service request currently, so that a first GPU node set is obtained.

Step 405, obtaining state information of each GPU node in the first GPU node set.

This step is the same as step 202 described above in fig. 2.

Step 406, screening at least one target GPU node from the first GPU node set based on the state information of each GPU node in the first GPU node set.

This step is the same as step 203 in fig. 2 described above.

In one possible implementation, in this step, after at least one target GPU node is screened out of the first set of GPU nodes, the distributed semaphore of the at least one target GPU node is updated. Illustratively, when the number of the screened target GPU nodes is a plurality, the distributed semaphore of each target GPU node is updated, e.g., the value of the distributed semaphore of each target GPU node may be reduced by 1.

Step 407, processing the target service request based on a target large language model by using the at least one target GPU node.

This step is the same as step 204 in fig. 2 described above.

In one possible implementation, in this step, the distributed semaphore of the at least one target GPU node is updated after the at least one target GPU node has completed processing the target service request. Illustratively, when the number of target GPU nodes processing the target service request is plural, updating the distributed semaphore of any target GPU node when the corresponding reasoning task is completed; for example, the value of the distributed semaphore for the target GPU node may be increased by 1.

In an embodiment of the disclosure, determining a resource type of a GPU node available for processing a target service request; determining GPU nodes matched with the resource types in the second GPU node set as the GPU nodes currently available for processing the target service request; and screening at least one target GPU node according to the state information of the GPU node currently available for processing the target service request, namely the longest duration of each reasoning task currently executed by the GPU node and the times of processing the historical service request in the session to which the target service request belongs. Therefore, on the basis of considering the actual load capacity of the GPU nodes and the cache characteristics of the large language model in the service process, the service capacity distinction of the heterogeneous GPU nodes is combined, so that the calculation force is used in a finer mode, the heterogeneous GPU nodes are scientifically scheduled based on the multidimensional factors, the service requests are uniformly distributed, and the quality of the large language model reasoning service is further improved.

As one example, load balancing device 102 may include a GPU node registration module, a message synchronization module, a node state maintenance module, a load balancing scheduling module.

The GPU node registration module is used for managing each GPU node in the large language model service system, and can support other functional modules to inquire the information of each GPU node, so that the coupling of the large language model service system is reduced; illustratively, the information of each GPU node may be managed based on the rediskv database, for example, the type of each GPU node, the large language model adapted by the GPU node, and the like may be managed. By way of example, the KV database can be packaged in an interface manner, a GPU node registration interface can be provided, and the GPU node can register own information, so that the management of the GPU nodes with different characteristics is facilitated. In order to reduce the load pressure of KV database storage, the functions of GPU node registration and query can be realized by means of snapshot, read-write isolation, subscription, notification and the like on the premise of maintaining data correctness and consistency.

And the message synchronization module is used for processing the one-to-many working relation between the model instance and the GPU node, and can carry out message transmission between different GPU nodes and between the load balancing scheduling module and the GPU node in an asynchronous mode, so that the additional expenditure brought by communication time is eliminated. The message synchronization module is implemented in a Stream mode of the KV database, and the load balancing scheduling module and each GPU node generate message synchronization of reasoning results in a message push-pull mode.

The node state maintenance module is used for dynamically updating and maintaining the working state and the load condition of the GPU node, and can periodically detect the state information and the load condition of each GPU node in the large language model service system or report the changed state information or the changed load condition under the condition that the state information or the load condition of the GPU node changes. Illustratively, the state information and load conditions of the GPU nodes may be synchronized by a distributed semaphore mechanism.

The load balancing scheduling module is a total entrance for a user to acquire service through the large language model service system. As an example, fig. 5 illustrates a flowchart of a load balancing method according to an embodiment of the present disclosure, as shown in fig. 5, a load balancing module may receive a service request, obtain information of GPU nodes from a GPU node registration module, determine whether there are enough GPU nodes available to process the service request, if not, queue the service request, if so, obtain state information of GPU nodes available to process the service request from a node state maintenance module, thereby screening out target GPU nodes based on the state information of the GPU nodes, distributing the service request to each target GPU node through a message synchronization module, and further, after receiving reasoning results of each target GPU node, summarizing the reasoning results to generate processing results of the service request, and feeding back the processing results to a user.

Based on the same inventive concept of the above method embodiments, embodiments of the present disclosure further provide a load balancing device, which may be used to execute the technical solutions described in the above method embodiments. For example, the steps of the load balancing method shown in fig. 2-5 described above may be performed.

Fig. 6 shows a block diagram of a load balancing apparatus according to an embodiment of the present disclosure, as shown in fig. 6, the apparatus includes: an obtaining module 601, configured to obtain a target service request; the acquiring module 601 is further configured to acquire state information of each GPU node in the first GPU node set of the graphics processor; the first set of GPU nodes includes: a GPU node currently available for processing the target service request; the state information of any GPU node comprises: the longest duration of each reasoning task currently executed by the GPU node and the times of processing the historical service request in the session to which the target service request belongs; a screening module 602, configured to screen at least one target GPU node from the first GPU node set based on state information of each GPU node in the first GPU node set; a processing module 603, configured to process, with the at least one target GPU node, the target service request based on a target large language model.

In one possible implementation, the screening module 602 is further configured to: calculating priority scores of all GPU nodes based on state information of all GPU nodes in the first GPU node set, wherein the priority scores represent priority degrees of processing the target service requests; and screening at least one GPU node from the first GPU node set to serve as the at least one target GPU node according to the order of the priority scores of the GPU nodes from high to low.

In one possible implementation, the screening module 602 is further configured to: calculating the absolute value of the difference value between the longest duration of each reasoning task currently executed by the first GPU node and the average duration of the processing history reasoning task; the average duration of the history reasoning task is represented by the average duration from the GPU node to the completion of the history reasoning task; calculating the product of the times of the first GPU node processing the historical service request in the session to which the target service request belongs and the time saving corresponding to the historical service request, wherein the time saving corresponding to the historical service request represents the time saved by directly utilizing the cache information corresponding to the historical service request to calculate the cache information relative to the GPU node in the process of executing the reasoning task by the GPU node; calculating a priority score of the first GPU node based on the absolute value of the difference and the product; wherein the priority score of the first GPU node is positively correlated with the absolute value of the difference value and positively correlated with the product.

In one possible implementation, the screening module 602 is further configured to: and carrying out weighted summation on the absolute value of the difference value and the product to obtain the priority score of the first GPU node.

In a possible implementation manner, the acquisition model 601 is further configured to: determining a resource type available to process the target service request; obtaining a second set of GPU nodes, the second set of GPU nodes comprising: currently GPU nodes with service request processing capabilities; and determining GPU nodes matched with the resource types in the second GPU node set as the GPU nodes currently available for processing the target service request.

In one possible implementation, the screening module 602 is further configured to: determining the minimum GPU node quantity corresponding to the target service request; the screening at least one target GPU node from the first GPU node set based on the state information of each GPU node in the first GPU node set includes: under the condition that the number of the lowest GPU nodes does not exceed the number of the GPU nodes in the first GPU node set, based on state information of all GPU nodes in the first GPU node set, the GPU nodes with the number of the lowest GPU nodes are selected from the first GPU node set to serve as the at least one target GPU node; and/or updating the first set of GPU nodes if the minimum number of GPU nodes exceeds the number of GPU nodes in the first set of GPU nodes; and when the number of GPU nodes in the updated first GPU node set is not less than the minimum number of GPU nodes, based on the updated state information of each GPU node in the first GPU node set, selecting the GPU node with the minimum number of GPU nodes from the updated first GPU node set as the at least one target GPU node.

In a possible implementation manner, the acquisition model 601 is further configured to: according to distributed semaphores corresponding to each candidate GPU node in a candidate GPU node set, screening out the GPU node with the current service request processing capability from the candidate GPU node set; the process model 603 is further configured to: and updating the distributed semaphore corresponding to the at least one target GPU node.

In one possible implementation, the processing model 603 is further configured to: and synchronizing the message between the at least one target GPU node and the node issuing the target service request by using the key value to the database as a message middleware.

The technical effects and specific descriptions of the load balancing apparatus shown in fig. 6 and the various possible implementations thereof can be seen from the above load balancing method, and will not be repeated here.

It should be understood that the division of the modules in the above apparatus is only a division of a logic function, and may be fully or partially integrated into one physical entity or may be physically separated when actually implemented. Furthermore, modules in the apparatus may be implemented in the form of processor-invoked software; the device comprises, for example, a processor, the processor being connected to a memory, the memory having instructions stored therein, the processor invoking the instructions stored in the memory to perform any of the above methods or to perform the functions of the modules of the device, wherein the processor is, for example, a general purpose processor, such as a CPU or microprocessor, and the memory is either internal to the device or external to the device. Or a module in the apparatus may be implemented in the form of a hardware circuit, and the functions of some or all of the modules may be implemented by the design of the hardware circuit, where the hardware circuit may be understood as one or more processors; for example, in one implementation, the hardware circuit is an application-specific integrated circuit (ASIC), and the functions of some or all of the above modules are implemented by designing the logic relationships of elements in the circuit; for another example, in another implementation, the hardware circuit may be implemented by a programmable logic device (programmable logic device, PLD), for example, a field programmable gate array (Field Programmable GATE ARRAY, FPGA), which may include a large number of logic gates, and the connection relationship between the logic gates is configured by a configuration file, so as to implement the functions of some or all of the above modules. All modules of the above device may be realized in the form of processor calling software, or in the form of hardware circuits, or in part in the form of processor calling software, and in the rest in the form of hardware circuits.

In the disclosed embodiments, the processor is a circuit with signal processing capabilities, and in one implementation, the processor may be a circuit with instruction reading and running capabilities, such as a CPU, microprocessor, graphics processor (graphics processing unit, GPU), digital signal processor (DIGITAL SIGNAL processor, DSP), neural-network processor (neural-network processing unit, NPU), tensor processor (tensor processing unit, TPU), etc.; in another implementation, the processor may perform a function through a logical relationship of hardware circuitry that is fixed or reconfigurable, e.g., a hardware circuit implemented by the processor as an ASIC or PLD, such as an FPGA. In the reconfigurable hardware circuit, the processor loads the configuration document, and the process of implementing the configuration of the hardware circuit can be understood as a process of loading instructions by the processor to implement the functions of some or all of the above modules.

It will be seen that each module in the above apparatus may be one or more processors (or processing circuits) configured to implement the methods of the above embodiments, for example: CPU, GPU, NPU, TPU, microprocessors, DSP, ASIC, FPGA, or a combination of at least two of these processor forms. In addition, all or part of the modules in the above apparatus may be integrated together or may be implemented independently, which is not limited.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

Fig. 7 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server or terminal device. Referring to FIG. 7, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output interface 1958 (I/O interface). The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server ^TM,Mac OS X^TM,Unix^TM,Linux^TM,FreeBSD^TM or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of load balancing, comprising:

Acquiring a target service request;

2. The method according to claim 1, wherein the screening at least one target GPU node from the first set of GPU nodes based on the state information of each GPU node in the first set of GPU nodes comprises:

3. The method of claim 2, wherein a priority score of a first GPU node is positively correlated with a longest duration of each inference task currently performed by the first GPU node, and positively correlated with a number of times the first GPU node processes a history service request in a session to which the target service request belongs; wherein the first GPU node is any GPU node in the first GPU node set.

4. The method according to claim 2, wherein calculating the priority score of each GPU node based on the state information of each GPU node in the first set of GPU nodes comprises:

5. The method of claim 4, wherein the calculating the priority score of the first GPU node based on the absolute value of the difference and the product comprises:

6. The method according to claim 1, wherein the method further comprises:

determining a resource type available to process the target service request;

7. The method of claim 6, wherein the resource types comprise: the type of GPU node and/or the type of large language model.

8. The method according to claim 1, wherein the method further comprises:

and/or the number of the groups of groups,

9. The method of claim 5, wherein the weight of the absolute value of the difference is positively correlated with the maximum duration of each of the reasoning tasks currently performed by the first GPU node and the weight of the product is negatively correlated with the maximum duration of each of the reasoning tasks currently performed by the first GPU node, in the event that the maximum duration of each of the reasoning tasks currently performed by the first GPU node exceeds the average duration of the processing history reasoning tasks;

and/or the number of the groups of groups,

10. The method of claim 6, wherein the obtaining a second set of GPU nodes comprises:

the method further comprises the steps of:

11. The method according to any one of claims 1-10, further comprising: and synchronizing the message between the at least one target GPU node and the node issuing the target service request by using the key value to the database as a message middleware.

12. A load balancing apparatus, the apparatus comprising:

The acquisition module is used for acquiring the target service request;

13. An electronic device, comprising:

A processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any one of claims 1 to 11 when executing the instructions stored by the memory.

14. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 11.