CN111897697B

CN111897697B - Server hardware fault repairing method and device

Info

Publication number: CN111897697B
Application number: CN202010801889.9A
Authority: CN
Inventors: 李斯达; 赵亮; 刘晨科
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2024-09-06
Anticipated expiration: 2040-08-11
Also published as: CN111897697A

Abstract

Disclosed are a server hardware failover method and apparatus for a server group, wherein servers in the server group store data for performing tasks and serve as hosts for carrying virtual machine traffic, comprising: receiving server fault information in a server group, wherein the server fault information indicates that a specific server in the server group has hardware faults; determining a candidate replacement server for replacing the specific server among the plurality of backup servers based on the operation states of the plurality of backup servers; determining a replacement server matching the specific server among the candidate replacement servers based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers; outputting a complete machine replacement message for transferring data for executing tasks stored in a specific server to a replacement server; in the event that a particular server is replaced by a replacement server, the server configuration parameters in the server group are updated.

Description

Server hardware fault repairing method and device

Technical Field

The present application relates to cloud technology, and more particularly, to a method, an apparatus, a device, and a storage medium for repairing server hardware failure of a server group.

Background

In the operation life cycle of the server, when the server executing the service fails or is about to fail, there is a need for performing complete machine replacement on the failed server. The replaced server has the same service data and service attributes as the originally used server. In conventional solutions, to achieve complete replacement of the server, the manufacturer is required to be contacted for access and maintenance, which takes hours or even days. In the case of a failed server that needs to perform a task, such failover time consuming would severely impact the user's use experience. For this reason, it is desirable to provide a faster hardware fault repair method.

Disclosure of Invention

According to an aspect of the present application, there is provided a server hardware failure repair method for a server group, wherein servers in the server group store data for performing tasks, comprising: receiving server fault information in the server group, wherein the server fault information indicates that a specific server in the server group has hardware faults; responding to the server fault information, acquiring configuration parameters of the specific server, and acquiring configuration parameters and operation states of a plurality of standby servers; determining a candidate replacement server for replacing the specific server among the plurality of backup servers based on the operation states of the plurality of backup servers; determining a replacement server matching the specific server among the candidate replacement servers based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers; outputting a complete machine replacement message for transferring data for executing tasks stored in the specific server to the replacement server; in the event that the particular server is replaced by the replacement server, the server configuration parameters in the server group are updated.

In some embodiments, the operational state includes a hardware state and a network connection state; determining, based on the operational status of the plurality of backup servers, a candidate replacement server for replacing the particular server among the plurality of backup servers includes: and selecting a standby server from the plurality of standby servers, and determining the standby server as a candidate replacement server when the hardware state indicates that the hardware state of the standby server is normal and the network connection state indicates that the network connection state of the standby server is normal.

In some embodiments, the standby server is an idle server networked with the server group.

In some embodiments, the number of the plurality of backup servers is determined based on a vending weight, availability of servers in the server group, and model allowance of the servers, wherein the vending weight is a weight parameter determined according to a market plan of the servers, the availability is a parameter indicating a failure rate of the servers in the server group, and the model allowance is a parameter indicating a number of servers in the server group of a same model as the particular server.

In some embodiments, the configuration parameters include at least one of: machine room unit information, product model information, equipment type information, hardware version information, and size information.

In some embodiments, the hardware failover method further comprises, prior to the particular server being replaced by the replacement server: and performing hardware state check and network connection state check on the replacement server to determine whether the hardware state and the network connection state of the replacement server are normal.

In some embodiments, in a case that the hardware state and the network connection state of the replacement server are normal, replacing the specific server with the replacement server, and performing data transfer; and in the case that the hardware state or the network connection state of the replacement server is abnormal, reselecting the replacement server from the plurality of standby servers.

In some embodiments, after updating the server configuration parameters in the server group, the hardware failover method further comprises: and performing hardware state check and network connection state check on the replacement server to determine whether the hardware state and the network connection state of the replacement server are normal.

In some embodiments, the server configuration parameters in the server group are updated in case that the hardware state and the network connection state of the replacement server are normal, and the replacement server is reselected from the plurality of backup servers in case that the hardware state or the network connection state of the replacement server is abnormal.

In some embodiments, the hardware fault remediation method further comprises: in the hardware fault repair process, the alarm information for the specific server is disabled.

According to another aspect of the present application, there is also provided a server hardware failure repair apparatus for a server group, wherein servers in the server group store data for performing tasks, including: a failure information receiving unit configured to receive server failure information in the server group, the server failure information indicating that a particular server in the server group has a hardware failure; a parameter obtaining unit configured to obtain configuration parameters of the specific server in response to the server failure information, and obtain configuration parameters and operation states of a plurality of standby servers; a standby selection unit configured to determine a candidate replacement server for replacing the specific server among the plurality of standby servers based on an operation state of the plurality of standby servers; a parameter comparison unit configured to determine a replacement server matching the specific server among the candidate replacement servers based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers; a direction output unit configured to output a complete machine replacement message for transferring data for executing a task stored in the specific server to the replacement server; and an information synchronization unit configured to update server configuration parameters in the server group in a case where the specific server is replaced by the replacement server.

In some embodiments, the hardware failover apparatus further comprises a state checking unit configured to perform a hardware state check and a network connection state check on the replacement server to determine whether the hardware state and the network connection state of the replacement server are normal, before the specific server is replaced by the replacement server.

In some embodiments, the hardware failover apparatus further comprises a state checking unit configured to perform a hardware state check and a network connection state check on the replacement server after updating the server configuration parameters in the server group to determine whether the hardware state and the network connection state of the replacement server are normal.

In some embodiments, the hardware failover device disables alert information for the particular server during a hardware failover process.

According to another aspect of the present application, there is also provided a hardware fault repair device, the device including a memory and a processor, wherein the memory has instructions stored therein, which when executed by the processor, cause the processor to perform the hardware fault repair method as described above.

According to another aspect of the present application, there is also provided a computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the hardware fault repair method as described above.

By using the server hardware fault repairing method, device, equipment and storage medium based on the server group, which are provided by the application, when the server serving as the host fails, the complete machine replacement of the fault server can be rapidly realized, and the time for repairing the fault is greatly shortened.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art. The following drawings are not intended to be drawn to scale, emphasis instead being placed upon illustrating the principles of the application.

FIG. 1 illustrates an exemplary scenario diagram for hardware failover of a server according to the present application;

FIG. 2A shows an illustrative process for hardware failover of a server;

FIG. 2B shows an illustrative process for hardware failover of a server in accordance with an embodiment of the application;

FIG. 3 shows a schematic process of a server hardware failover method for a server group in accordance with an embodiment of the application;

an exemplary process of determining candidate replacement servers according to an embodiment of the application is shown in FIG. 4;

FIG. 5 illustrates an exemplary process of comparing configuration parameters of two servers according to an embodiment of the application;

FIG. 6A shows an illustrative process for replacing a failed server with a replacement server in accordance with an embodiment of the present application;

FIG. 6B shows an illustrative process of changing the operational state of a server in accordance with an embodiment of the present application;

FIG. 7 shows an illustrative process of performing a secondary check on a replacement server in accordance with an embodiment of the present application;

FIG. 8 shows a schematic block diagram of a server hardware failover apparatus for a server group in accordance with an embodiment of the application;

FIG. 9 shows a schematic diagram of an implementation of a hardware failover scheme in accordance with an embodiment of the application; and

FIG. 10 illustrates an architecture of a computing device according to an embodiment of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are also within the scope of the application.

As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

A flowchart is used in the present application to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Cloud computing (clouding) is a computing model that distributes computing tasks across a large pool of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.

As a basic capability provider of cloud computing, a cloud computing resource pool (abbreviated as a cloud platform, generally referred to as IaaS (Infrastructure AS A SERVICE) platform) is established, in which multiple types of virtual resources are deployed for external clients to select for use.

In the cloud computing service, when a server (i.e., a host) for carrying virtual machine traffic fails, an operation of repairing the failure needs to be performed. In an application scenario in which all data required for executing a service is stored in a cloud storage manner, a host only bears a function of executing the operation. In this case, if the host fails in hardware requiring repair, another idle server may be selected in the network formed by the server group to take over the function of the failed host. However, such solutions need to be supported by cloud storage services. That is, data required for executing a task needs to be stored by way of cloud storage. In this way, after selecting other idle servers to take over the functions of the failed host, the server for replacement can utilize reading the data stored in the cloud storage system and continue to perform tasks.

However, in some application scenarios, it is not desirable to store the data required to perform a task by way of cloud storage. For example, when the amount of data required for executing a task is large and the data needs to be repeatedly accessed when executing the task, storing the data required for executing the task in a local storage manner is advantageous for executing the task. The data may be stored in a storage device of the host. For example, in applications such as a supercomputer, various data implementing a computing service delegated by a user may be stored locally at the supercomputer (e.g., in a host). In this case, when the host computer fails in hardware, the failure repair cannot be achieved by the aforementioned manner supported by the cloud storage service.

In view of the above, the present application provides a method for repairing a server hardware failure of a server group.

FIG. 1 illustrates an exemplary scenario diagram for hardware failover of a server according to the present application. As shown in fig. 1, the system 100 may include a user terminal 110, a network 120, a server (group) 130.

The user terminal 110 may be, for example, a computer 110-1, a mobile phone 110-2 as shown in fig. 1. It will be appreciated that in fact, the user terminal may be any other type of electronic device capable of performing data processing, which may include, but is not limited to, a desktop computer, a notebook computer, a tablet computer, a smart phone, a smart home device, a wearable device, etc.

A user may access network 120 using user terminal 110. For example, user terminal 110 may access network 120 via any wired or wireless means. The user may create commands on the user terminal 110 and send control instructions to the server 130 via the network 120 to perform tasks. The tasks referred to herein may be any computing and/or storage tasks that the server 130 is capable of performing. Such as image processing, file storage, instant messaging, batch computing, load balancing, and the like.

Network 120 may be a single network or a combination of at least two different networks. For example, network 120 may include, but is not limited to, one or a combination of several of a local area network, a wide area network, a public network, a private network, and the like.

The server 130 may be a group of servers, with each server within the group being connected via a wired or wireless network. A server farm may be centralized, such as a data center, or distributed. The server 130 may be local or remote.

In some embodiments, the system 100 may also include a database (not shown). A database may refer broadly to a device having a storage function. The database is mainly used to store various data utilized, generated, and outputted in the operation of the server 130. The database 140 may include various memories, such as random access Memory (Random Access Memory (RAM)), read Only Memory (ROM), and the like. The above-mentioned storage devices are merely examples and the storage devices that may be used by the system are not limited thereto.

In some embodiments, the database may be integrated in the server 130. It will be appreciated that other storage devices for storing data may also be included in the system, through connection to the server and/or user terminal.

Fig. 2A shows an illustrative process for hardware failover of a server.

As shown in fig. 2A, in block 201, a host failure is detected.

In block 202, conventional failover means are performed to attempt to failover the host. Wherein conventional failover means refer to any available means that is capable of restoring host functionality without a complete machine replacement. For example, failover of a host may be attempted according to a predefined repair policy. For example, hardware components (such as a power supply, a hard disk, a motherboard, a memory, etc.) included in the host may be sequentially examined. In some examples, replacement methods may be utilized to attempt repair of hardware failures to hosts. For example, the hardware components of the host may be replaced according to a predetermined order. If the fault can be repaired by simply replacing the hardware component, the host machine is not required to be replaced entirely. That is, in this case, the conventional fault repairing means is effective.

In block 203, a determination may be made as to whether conventional failover means are in effect. When the conventional fault repair approach is in effect, conventional repair to the host can be achieved. The host machine can be used continuously after the repair.

In block 204, when the conventional failover means is not effective, it may be considered that the failover of the host cannot be completed by a simple operation. In this case, the host needs to be fully checked to realize the fault repair. In order to reduce the influence of host machine faults on the use of users, the fault repair can be realized by using a scheme of complete machine replacement. That is, a new physical machine may be used to replace the failed host machine and replace the functions of the original host machine.

In block 205, the failover to the failed host is complete. As previously described, the failover of the failed host may be accomplished by regular repair or by complete machine replacement.

FIG. 2B shows an illustrative process for hardware failover of a server in accordance with an embodiment of the application.

As shown in fig. 2B, in the hardware failover process provided by the present application, in block 206, it is determined that a particular server in the server group fails. In some embodiments, a particular server may refer to a single server in a server group. In other embodiments, a particular server may also refer to two or more servers in a server farm.

In block 207, it is determined whether the hardware failure of a particular server needs to be resolved by way of a complete machine replacement. For example, steps S202-S203 in fig. 2A may be utilized to determine whether the failover of the failed host can be implemented by conventional repair means. If the fault restoration cannot be completed by the conventional restoration means, determining that the fault restoration is required by the whole machine replacement mode.

If it is determined that failover is to be achieved by way of complete machine replacement (yes), then block 208 may be advanced. In block 208, a backup server for the complete replacement of a particular server may be selected from one or more backup servers. Then, block 209 may be advanced. In block 209, a complete replacement may be performed for the particular server. The process of blocks 208-209 will be described in detail below in conjunction with fig. 3 and will not be described in detail herein.

If it is determined in block 207 that the failure of the particular server need not be repaired by way of a complete machine replacement (no), then a repeat check may proceed to block 210 to determine if the failure needs to be repaired by way of a complete machine replacement. If the repeated checks in block 210 determine that the failure of a particular server needs to be repaired by way of a complete machine replacement (yes), then it may proceed to block 208 and proceed to the steps of blocks 208 and 209.

If the repeated checks in block 210 determine that the failure of the particular server need not be repaired by way of a complete machine replacement (no), then the flow may proceed to block 211 ending the hardware failure repair flow.

Fig. 3 shows a schematic process of a server hardware failover method for a server group according to an embodiment of the application. Wherein the servers in the server farm may be hosts for carrying virtual machine traffic, in which various data required for executing the traffic are stored.

In step S302, server failure information in the server group may be received, the server failure information indicating that a particular server in the server group has failed in hardware. In some embodiments, the server failure information indicates that the hardware failure of the particular server cannot be resolved by conventional failure repair means, requiring complete machine replacement.

In step S304, in response to the server failure information, configuration parameters of a specific server may be obtained, and configuration parameters and operation states of a plurality of standby servers may be obtained.

In some embodiments, the standby server may be an idle server networked with a server group.

In some embodiments, the configuration parameters of the server may be obtained by accessing system information of the server. The configuration parameters of the server may be at least one of: machine room unit information, product model information, equipment type information, hardware version information, and size information. Wherein the room unit information indicates the location of the room unit where the server is located. For example, the room unit information may be the number of the room unit. The product model information may be identification information of a machine model defined by a hardware vendor. For example, the product model information may be a character string indicating the machine model. The device type information refers to vendor-defined device standard types that provide cloud services. For example, the device type information may include entry level devices, workgroup level devices, department level devices, enterprise level devices, and the like. The hardware version information may be version information of a hardware component of the server, for example, version information of a hardware component such as a Central Processing Unit (CPU), a memory, a disk array (Raid), and the like. The size information may be information of a physical size of the server. For example, the size information may be represented as size information of a rack occupied by the servers.

The operational state may include at least one of a hardware state and a network connection state of the server. Wherein the hardware state may indicate whether the server is able to perform operations normally. In some embodiments, the machine room environment in which the server group is located may be monitored through an intelligent platform management interface (INTELLIGENT PLATFORM MANAGEMENT INTERFACE, IPMI) to obtain information about the hardware state of the server. The network connection status may indicate whether the server is capable of network communication in the server group. In some embodiments, the network connection status of the server may be obtained through a heartbeat mechanism.

In step S306, a candidate replacement server for replacing the specific server may be determined among the plurality of backup servers based on the operation states of the plurality of backup servers.

In some embodiments, for each of the plurality of backup servers, where the hardware state indicates that the hardware state of the backup server is normal and the network connection state indicates that the network connection state of the backup server is normal, the backup server may be determined to be a candidate replacement server.

By screening among the plurality of backup servers using the operating state, a backup server that can normally operate and that normally communicates in the server group can be determined as a candidate replacement server. Hereinafter, an exemplary process of determining the candidate replacement server in step S306 will be described with specific reference to fig. 4.

In step S308, a replacement server matching the specific server may be determined among the candidate replacement servers based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers.

In some embodiments, the configuration parameters of the particular server that failed may be compared to the configuration parameters of the candidate replacement server. If the two match, then the candidate replacement server may be considered a replacement server that matches the particular server.

In some implementations, if there are multiple replacement servers in the candidate replacement servers that match the particular server, a replacement server for replacing the failed particular server may be determined from the multiple matching replacement servers based on predefined rules. For example, a first candidate replacement server that matches the particular server may be determined as the replacement server for replacing the failed particular server. For another example, the candidate replacement server that matches the particular server to the highest degree may be determined as the replacement server for replacing the failed particular server.

In step S310, a complete machine replacement message for transferring data for performing a task stored in a specific server to a replacement server is output. In some embodiments, the complete machine replacement message may include identification information indicating the replacement server and information of specific operations. For example, the entire set of replacement messages may be sent to the site where the server farm is located, and the operations to transfer the data for performing the tasks to the replacement server may be performed by an operator at the site. In some embodiments, the operator may be instructed to remove the storage device storing the data for performing the task from the particular server and install it on the replacement server indicated in the complete machine replacement message.

In some embodiments, after the operator has completed the transfer of data for performing the task to the replacement server, feedback information may be received and hardware information, such as a MAC address, of the replacement server may be updated in response to the feedback information.

In step S312, in the case where a specific server is replaced with a replacement server, the server configuration parameters in the server group may be updated.

In some embodiments, the server configuration parameters may include hardware configuration parameters and software configuration parameters. In some implementations, the physical chassis information of the failed particular server and the chassis information of the replacement server can be swapped in a configuration management database (Configuration Management Database, CMDB) to avoid errors in the CMDB information. In other implementations, the corresponding fixed resource relationship of the virtual machine on a specific server (i.e., the original host) may be mapped to the replacement server in a vending information base of the cloud service, so that the user can normally use the functions of the virtual machine after the hardware fault repair process is completed.

Next, an exemplary process of determining a candidate replacement server at step S306 in fig. 3 will be described with reference to fig. 4. An exemplary process of determining candidate replacement servers according to an embodiment of the application is shown in fig. 4.

In block 402, an inventory condition of the standby library 401 may be determined. If it is determined at block 402 that the inventory number of the standby library 401 is not 0, then block 404 may be advanced.

If it is determined in block 402 that the inventory number of the standby library is less than the reference number C, then proceed to block 403 to replenish the inventory number of the standby library. In block 403, a dispatch instruction may be issued. The stock quantity in the standby library 401 is made not smaller than the reference quantity C by stock adjustment.

In some embodiments, the reference number of standby servers contained in the standby library may be determined based on a preset standby scheduling algorithm. In some examples, the reference number C of standby servers in the standby library may be determined based on the selling weight a, the availability B of the servers in the server group, and the model allowance D of the servers. Wherein the selling weight a is a parameter determined according to a market plan of servers, the availability B is a parameter indicating a failure rate of servers in the server group, and the model allowance D is a parameter indicating the number of servers of the same model as a particular server that has failed in the current server group.

The reference number C of standby servers in the standby library may be calculated based on equation (1):

C＝D*((1+A)*B)(1)

In block 404, a standby server may be selected in the standby library 401.

In block 405, a check may be made of the hardware status of the selected standby server. If the hardware state is abnormal, return to block 401 to rescale. If the hardware state is normal, proceed to block 406.

In block 406, a check may be made of the network connection status of the standby server. For example, there may be a server sending a heartbeat packet to check if the connection is normal. For another example, it may be checked whether the login status of the server in the server group is normal. If the check in block 406 indicates that the network connection status of the standby server is abnormal, then the block 401 returns to rescreening. If the network connection state is normal, proceed to block 407.

In block 407, the backup server that passed the check may be determined as a candidate replacement server.

If the operating states of all the standby servers in the standby library are abnormal, it can be considered that no suitable standby server exists in the standby library, and the rapid hardware repair procedure is terminated.

Next, an exemplary process of determining a replacement server that matches a specific server in step S308 will be described with specific reference to fig. 5. Fig. 5 illustrates an exemplary process of comparing configuration parameters of two servers according to an embodiment of the present application.

As shown in fig. 5, in block 501, an exemplary process of comparing configuration parameters of two servers begins.

In block 502, it may be compared whether the room unit information of the two servers match. When the room units of the two servers agree, the room unit information of the two servers may be considered to be matched and proceed to block 502. If the room units of the two servers do not match, then proceed to block 507 to determine if the alignment is again performed. In some embodiments, if there is room unit information for other room units allowed by the server group to match the room unit in which the particular server is located, then block 502 may be returned and a comparison may be made as to whether the room unit information for the two servers match. That is, it is possible to compare whether or not the room unit information of the candidate replacement server is room unit information of other room units that match the room unit in which the specific server is located.

In block 503, the product model information of the two servers may be compared. If the product model information of the two servers are matched, block 504 may be advanced. If the product model information of both servers does not pass, proceed to block 507 to determine whether to re-compare. In some embodiments, when the product model information of two servers is the same, the product model information of the two servers may be considered to be matched. In other embodiments, there may be at least two different product model information that match the product model information of a particular server. Accordingly, if it is determined in block 503 that the product model information of the two servers is not matched, it may also proceed to block 507 to attempt a re-alignment.

In block 504, device type information for two servers may be compared. If the device types of the two servers are matched, block 505 may be advanced. If the device type comparison of the two servers fails, then proceed to block 507 to determine whether to re-compare. In some embodiments, when the device type information of two servers is the same, the device type information of the two servers may be considered to be matched. In other embodiments, there may be at least two different device type information that match the device type information of a particular server. Thus, if it is determined in block 504 that the device type information of the two servers is not matched, it may also proceed to block 507 to attempt a re-alignment.

In block 505, hardware version information of two servers may be compared. If the hardware versions of the two servers are matched, block 506 may proceed. If the hardware version of the two servers do not pass the comparison, then proceed to block 507 to determine if the comparison is again made. In some embodiments, when the hardware version information of two servers is the same, the hardware version information of the two servers may be considered to be matched. In other embodiments, there may be at least two different hardware version information that match the hardware version information of a particular server. Thus, if it is determined in block 505 that the hardware version information of the two servers are not matched, it may also proceed to block 507 to attempt a re-alignment.

In block 506, the size information of the two servers may be compared. If the sizes of the two servers are matched, block 508 may proceed. If the size of the two servers does not pass, then proceed to block 507 to determine if the comparison is again made. In some embodiments, when the size information of two servers is the same, the size information of the two servers may be considered to be matched. In other embodiments, there may be at least two different size information that matches the size information of a particular server. Thus, if it is determined in block 506 that the size information of the two servers does not match, it may also proceed to block 507 to attempt a re-alignment.

In block 508, when all configuration parameters of the particular server and the candidate replacement server are determined to match, the candidate replacement server may be determined to be a replacement server that matches the particular server.

In block 509, the exemplary process of comparing the configuration parameters of the two servers ends.

Returning to FIG. 3, in the process illustrated in FIG. 3, the particular server being replaced may involve multiple reboots to satisfy the foregoing inspection operations and replacement operations. To avoid such an abnormal state-caused warning, the alert information for the specific server and/or the replacement server may be disabled in the process shown in fig. 3. For example, in the hardware fault repairing process provided by the application, the operation state of a specific server can be set to be in replacement, and the server in the state is set not to trigger a warning, so that the hardware fault repairing process is completely finished. When the hardware fault repair process is completed, the alarm information of the replacement server can be started. For example, the operation state of the replacement server may be restored to a normal state, such as "in operation".

Fig. 6A shows an illustrative process for replacing a failed server with a replacement server in accordance with an embodiment of the present application.

At block 601, the replacement process begins.

At block 602, a complete machine replacement message may be output to a field operator that data for performing a task is transferred to the replacement server. The operation of data transfer may be accomplished manually by an on-site operator or by a robot.

In block 603, it may be determined whether the data transfer is complete. If the data transfer is not complete, then it may proceed to block 604 where an operation and maintenance personnel intervene and assist in completing the data transfer process.

If the data transfer is complete, then block 605 may be advanced. At block 605, server configuration parameters in a server group may be updated to achieve information synchronization in the event that a particular server is replaced by a replacement server.

After the information synchronization is complete, it may proceed to block 606 to perform a status check on the replacement server that performs tasks in place of the original failed server to determine if the hardware status and network connection status of the replaced server are normal.

After the status check passes, it may proceed to block 607 where the replacement process ends.

Fig. 6B shows an exemplary process of changing the operational state of a server according to an embodiment of the present application.

As shown in fig. 6B, in block 611, the flow begins. For example, after step S310 of fig. 3 outputs a complete machine replacement message for transferring data for performing a task stored in a specific server to the replacement server, the process shown in fig. 6B may be started.

In block 612, the operational status of the particular server and the replacement server for replacing the particular server is adjusted to "in-replacement," thereby disabling alert information for the particular server and the replacement server. For example, after the entire machine replacement message is output in step S310 of fig. 3, the operation states of the specific server and the replacement server for replacing the specific server are automatically adjusted to "in replacement".

After the operational status of the replacement server is adjusted to "replacement," the field operator may perform the dashed boxes 613 and 614 shown on the right side of the flow of fig. 6B for data transfer. In block 613, the data transfer operation for the specific server may be completed in response to the complete machine replacement message output in step S310 of fig. 3. The field operator may be either manual or robotic.

In block 614, the replacement server may be approved by an operator at the site. In the event that the acceptance passes, block 615 may be advanced. Under the condition that the acceptance is not passed, the replacement server can be debugged and configured according to the actual situation, so that the replacement server can work normally.

In block 615, the operational status of the replacement server may be adjusted to "in operation" to enable alert information for the replacement server.

In block 616, the process of changing the operational state of the server is ended.

Referring back to fig. 3, in some embodiments, the operation state of the replacement server may be checked secondarily before step S310 and before step S312, that is, before outputting the complete machine replacement message indicating that the replacement operation is performed and after performing the replacement operation is completed. That is, although it is determined in step S306 that the operation state has been checked in the candidate replacement server, in order to ensure that the replacement server can operate normally after the completion of the hardware failure repair, it may be checked again whether the hardware state and the network connection state of the replacement server are normal.

Fig. 7 shows an exemplary process of performing a secondary check of the hardware status and network connection status of a replacement server according to an embodiment of the present application.

In block 701, a secondary inspection process begins.

As shown in fig. 7, in block 702, a hardware status check and a network connection status check may be performed on the replacement server. If the check indicates that the hardware status and network connection status of the replacement server are normal, then proceed to block 703 where the secondary check flow ends.

If the check in block 702 indicates that the hardware state and network connection state of the replacement server are abnormal, then block 704 may be advanced.

In block 704, a current operation type of the replacement server may be determined. If the replacement server has not performed the data transfer operation, then proceed to block 705 to rescreen the replacement server in the standby library that matches the particular server. If there is no replacement server that matches the particular server, the failover process ends.

If a data transfer operation has been performed on the replacement server, then block 706 may be advanced to restart the failover process shown in FIG. 3. If the re-execution of the failover process fails, the process may end.

By using the server hardware fault repairing method for the server group, the complete machine replacement of the fault server can be realized by selecting the replacement server matched with the fault server in the standby machine library. By transferring data for executing tasks stored in the failed server to the replacement server, the tasks of the virtual machine can be executed by the replacement server instead of the failed server. With the method, when the server serving as the host of the virtual machine fails, the standby machine capable of being used for replacing the failed server can be quickly determined and normal service of the server group can be restored without consuming a great deal of time to analyze the specific failure of the server. In many cases, the host needs to simulate a service scenario under no-load conditions to be able to analyze the root cause of the hardware failure. According to the method, the failed server is replaced from the server group, so that the failed server can be quickly emptied, test conditions for analyzing the failure cause are provided, and the failure analysis efficiency and accuracy are improved.

Fig. 8 shows a schematic block diagram of a server hardware failover apparatus for a server group in accordance with an embodiment of the application.

As shown in fig. 8, the apparatus 800 may include a fault information receiving unit 810, a parameter acquiring unit 820, a standby selecting unit 830, a parameter comparing unit 840, a direction output unit 850, an information synchronizing unit 860, and a status checking unit 870.

The failure information receiving unit 810 may be configured to receive server failure information indicating that a particular server in the server group has failed in hardware. In some embodiments, the server failure information indicates that the hardware failure of the particular server cannot be resolved by conventional failure repair means, requiring complete machine replacement. In some embodiments, a particular server may refer to a single server in a server group. In other embodiments, a particular server may also refer to two or more servers in a server farm.

The parameter acquisition unit 820 may be configured to acquire configuration parameters of a specific server and acquire configuration parameters and operation states of a plurality of standby servers in response to server failure information.

In some embodiments, the configuration parameters of the server may be obtained by accessing system information of the server. The configuration parameters of the server may be at least one of: machine room unit information, product model information, equipment type information, hardware version information, and size information.

The standby selection unit 830 may be configured to determine a candidate replacement server for replacing a specific server among the plurality of standby servers based on the operation states of the plurality of standby servers.

By screening among the plurality of backup servers using the operating state, a backup server that can normally operate and that normally communicates in the server group can be determined as a candidate replacement server.

In some embodiments, the standby selection unit 830 may be configured to execute the flow illustrated in fig. 4 to implement the standby selection.

The parameter comparison unit 840 may be configured to determine a replacement server matching the specific server among the candidate replacement servers based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers.

The parameter comparison unit 840 may be configured to perform the process shown in fig. 5.

The direction output unit 850 may be configured to output a complete machine replacement message for transferring data for performing a task stored in a specific server to a replacement server. In some embodiments, the complete machine replacement message may include identification information indicating the replacement server and information of specific operations. For example, the entire set of replacement messages may be sent to the site where the server farm is located, and the operations to transfer the data for performing the tasks to the replacement server may be performed by an operator at the site. In some embodiments, the operator may be instructed to remove the storage device storing the data for performing the task from the particular server and install it on the replacement server.

The information synchronization unit 860 may be configured to update server configuration parameters in a server group in case that a specific server is replaced by a replacement server.

In some embodiments, in the process illustrated in FIG. 3, the particular server being replaced may involve multiple reboots to satisfy the foregoing inspection and replacement operations. To avoid such an abnormal state-caused warning, the alert information for the specific server and/or the replacement server may be disabled in the process shown in fig. 3. For example, in the hardware fault repairing process provided by the application, the operation state of a specific server can be set to be in replacement, and the server in the state is set not to trigger a warning, so that the hardware fault repairing process is completely finished. When the hardware fault repair process is completed, the alarm information of the replacement server can be started. For example, the operation state of the replacement server may be restored to a normal state, such as "in operation".

The status checking unit 870 may be configured to perform a secondary check on the operation status of the replacement server before outputting the complete replacement message instructing the operator to perform the replacement operation and after the operator performs the replacement operation (e.g., after the information synchronization unit updates the server configuration parameters in the server group). That is, although the standby selection unit has checked the operation state of the standby server in determining the candidate replacement server, in order to ensure that the replacement server can operate normally after the completion of the hardware failure repair, it may check again whether the hardware state and the network connection state of the replacement server are normal.

The status checking unit 870 may be configured to perform the process shown in fig. 7.

By using the server hardware fault repairing device for the server group, the complete machine replacement of the fault server can be realized by selecting the replacement server matched with the fault server in the standby machine library. By transferring data for executing tasks stored in the failed server to the replacement server, the tasks of the virtual machine can be executed by the replacement server instead of the failed server. With the method, when the server serving as the host of the virtual machine fails, the standby machine capable of being used for replacing the failed server can be quickly determined and normal service of the server group can be restored without consuming a great deal of time to analyze the specific failure of the server. In many cases, the host needs to simulate a service scenario under no-load conditions to be able to analyze the root cause of the hardware failure. According to the method, the failed server is replaced from the server group, so that the failed server can be quickly emptied, test conditions for analyzing the failure cause are provided, and the failure analysis efficiency and accuracy are improved.

By using the server hardware fault repairing method and device for the server group, when the servers in the server group are in fault, the average complete machine replacement time is 1.677 hours. If the fault is repaired without adopting a whole machine replacement mode, the time from fault discovery to cause analysis is at least about 8 hours. The speed of the fault repairing mode provided by the application is improved by 76% compared with that of the conventional hardware fault repairing mode.

FIG. 9 shows a schematic diagram of an implementation of a hardware failover scheme in accordance with an embodiment of the application.

As shown in fig. 9, in block 901, a complete machine replacement request may be initiated based on a detected failure of a particular server.

In block 902, available backup servers may be found.

In block 903, a candidate replacement server for replacing the failed particular server may be determined in the standby library using the standby selection unit shown in FIG. 8.

In block 904, the configuration parameters of the candidate replacement server and the failed particular server may be compared using the parameter comparison unit shown in fig. 8 to determine a replacement server that matches the particular server.

In block 905, it may be verified whether the operation state of the replacement server is normal using the state checking unit shown in fig. 8. If the operational state of the replacement server is verified to be normal, then block 905 may be advanced. If the operational status of the verified replacement server is not normal, block 903 may be returned to attempt to find other replacement servers that can match the particular server.

In block 906, a complete machine replacement message may be created and output.

In block 907, the operator at the site may connect to the server hardware failover system for the server farm provided by the present application through authentication (token) and obtain the complete machine replacement message. According to the obtained complete machine replacement message, an operator on site can perform an operation to transfer data for executing tasks stored in a specific server to a replacement server.

In block 908, the server configuration parameters in the server group may be updated using the information synchronization unit shown in FIG. 8.

In block 909, it may be determined whether the information synchronization status is normal. If the information synchronization status indication is normal, the flow may proceed to block 911 where the status check unit may be utilized to perform status checks on the replacement server after the information synchronization. If block 911 indicates that the operational status of the replacement server is normal, then block 912 may be advanced. If block 911 indicates that the operational status of the replacement server is not normal, then block 910 may be advanced to verify the operational status of the replacement server a second time. For example, the operating state of the replacement server may be checked manually or robotically.

If the information synchronization status indicates abnormal, it may proceed to block 910 to secondarily verify the information synchronization status of the replacement server, by manually or robotically checking to determine if the information synchronization is complete. And may then proceed to block 912.

In block 912, the hardware failover process ends.

Furthermore, methods or apparatus according to embodiments of the present application may also be implemented by way of the architecture of the computing device shown in FIG. 10. Fig. 10 illustrates an architecture of the computing device. As shown in fig. 10, computing device 1000 may include a bus 1010, one or more CPUs 1020, a Read Only Memory (ROM) 1030, a Random Access Memory (RAM) 1040, a communication port 1050 connected to a network, an input/output component 1060, a hard disk 1070, and the like. A storage device in the computing device 1000, such as the ROM 1030 or the hard disk 1070, may store various data or files for processing and/or communication use of the object detection method provided by the present application and program instructions executed by the CPU. The computing device 1000 may also include a user interface 1080. Of course, the architecture shown in FIG. 10 is merely exemplary, and one or at least two components of the computing device shown in FIG. 10 may be omitted as practical needed in implementing different devices.

According to another aspect of the present application there is also provided a non-volatile computer readable storage medium having stored thereon computer readable instructions which when executed by a computer can perform a method as described above.

According to yet another aspect of the present application, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device performs the server hardware fault repair method for the server group.

Program portions of the technology may be considered to be "products" or "articles of manufacture" in the form of executable code and/or associated data, embodied or carried out by a computer readable medium. A tangible, persistent storage medium may include any memory or storage used by a computer, processor, or similar device or related module. Such as various semiconductor memories, tape drives, disk drives, or the like, capable of providing storage functionality for software.

All or a portion of the software may sometimes communicate over a network, such as the internet or other communication network. Such communication may load software from one computer device or processor to another. For example: a hardware platform loaded from a server or host computer of the video object detection device to a computer environment, or other computer environment implementing the system, or similar functioning system related to providing information needed for object detection. Thus, another medium capable of carrying software elements may also be used as a physical connection between local devices, such as optical, electrical, electromagnetic, etc., propagating through cable, optical cable, air, etc. Physical media used for carrier waves, such as electrical, wireless, or optical, may also be considered to be software-bearing media. Unless limited to a tangible "storage" medium, other terms used herein to refer to a computer or machine "readable medium" mean any medium that participates in the execution of any instructions by a processor.

The application uses specific words to describe embodiments of the application. Reference to "a first/second embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the application. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the application may be combined as suitable.

Furthermore, those skilled in the art will appreciate that the various aspects of the application are illustrated and described in the context of a number of patentable categories or circumstances, including any novel and useful procedures, machines, products, or materials, or any novel and useful modifications thereof. Accordingly, aspects of the application may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the application may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the following claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A server hardware failover method for a server group, wherein servers in the server group store data for performing tasks, comprising:

receiving server fault information in the server group, wherein the server fault information indicates that a specific server in the server group has hardware faults;

Responding to the server fault information, acquiring configuration parameters of the specific server, and acquiring configuration parameters and operation states of a plurality of standby servers;

Determining a candidate replacement server for replacing the specific server among the plurality of backup servers based on the operation states of the plurality of backup servers;

Determining a replacement server matching the specific server among the candidate replacement servers based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers;

Outputting a complete machine replacement message for transferring data for executing tasks stored in the specific server to the replacement server;

updating server configuration parameters in the server group if the particular server is replaced by the replacement server;

The number of the plurality of standby servers is determined based on a selling weight, availability of servers in the server group, and model allowance of the servers, wherein the selling weight is a weight parameter determined according to a market delivery plan of the servers, the availability is a parameter indicating a failure rate of the servers in the server group, and the model allowance is a parameter indicating the number of servers of the same model as a specific server in the server group.

2. The hardware fault remediation method of claim 1, wherein the operational state includes a hardware state and a network connection state;

Determining, based on the operational status of the plurality of backup servers, a candidate replacement server for replacing the particular server among the plurality of backup servers includes:

a standby server is selected among the plurality of standby servers,

And determining the standby server as a candidate replacement server in the case that the hardware state indicates that the hardware state of the standby server is normal and the network connection state indicates that the network connection state of the standby server is normal.

3. The hardware failover method of claim 1 wherein the standby server is an idle server networked with the server group.

4. The hardware failover method of claim 1 wherein the configuration parameters comprise at least one of: machine room unit information, product model information, equipment type information, hardware version information, and size information.

5. The hardware failover method of claim 1, wherein before the particular server is replaced by the replacement server, the hardware failover method further comprises:

and performing hardware state check and network connection state check on the replacement server to determine whether the hardware state and the network connection state of the replacement server are normal.

6. The hardware failover method of claim 5 wherein,

Under the condition that the hardware state and the network connection state of the replacement server are normal, replacing the specific server by using the replacement server, and performing data transfer;

and in the case that the hardware state or the network connection state of the replacement server is abnormal, reselecting the replacement server from the plurality of standby servers.

7. The hardware failover method of claim 1, wherein after updating server configuration parameters in the server group, the hardware failover method further comprises:

8. The hardware failover method of claim 7, wherein, in the event that the hardware state and network connection state of the replacement server are both normal, server configuration parameters in the server group are updated,

9. The hardware fault remediation method of claim 1, further comprising:

In the hardware fault repair process, the alarm information for the specific server is disabled.

10. A server hardware failover apparatus for a server group, wherein servers in the server group store data for performing tasks, comprising:

A failure information receiving unit configured to receive server failure information in the server group, the server failure information indicating that a particular server in the server group has a hardware failure;

a parameter obtaining unit configured to obtain configuration parameters of the specific server in response to the server failure information, and obtain configuration parameters and operation states of a plurality of standby servers;

A standby selection unit configured to determine a candidate replacement server for replacing the specific server among the plurality of standby servers based on an operation state of the plurality of standby servers;

A parameter comparison unit configured to determine a replacement server matching the specific server among the candidate replacement servers based on the configuration parameters of the specific server and the configuration parameters of the candidate replacement servers;

A direction output unit configured to output a complete machine replacement message for transferring data for executing a task stored in the specific server to the replacement server;

An information synchronization unit configured to update server configuration parameters in the server group in a case where the specific server is replaced by the replacement server;

11. The hardware failover apparatus of claim 10 wherein the operational state comprises a hardware state and a network connection state;

The standby selection unit is configured to:

a standby server is selected among the plurality of standby servers,

12. The hardware failover apparatus of claim 10 wherein the standby server is an idle server networked with the server group.

13. A hardware fault repair device comprising a memory and a processor, wherein the memory has instructions stored therein that, when executed by the processor, cause the processor to perform the hardware fault repair method of any of claims 1-9.

14. A computer readable storage medium having instructions stored thereon, which when executed by a processor, cause the processor to perform the hardware fault repair method of any of claims 1-9.