CN114356643A

CN114356643A - Automatic task discovery failure and recovery method in remote sensing satellite processing system

Info

Publication number: CN114356643A
Application number: CN202210244791.7A
Authority: CN
Inventors: 李景山; 赵灵军; 程玉芳
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-04-15
Anticipated expiration: 2042-03-14
Also published as: CN114356643B

Abstract

The application provides a method for automatically discovering task failure and recovering in a remote sensing satellite processing system, which comprises the following steps: according to the established error capturing mechanism, under the condition that the execution failure of the task is monitored, the execution of the failed task is initiated again; if the failed task fails to be executed again, terminating the execution of the first task flow corresponding to the failed task; the first task process comprises at least one subtask; analyzing an influence domain of task execution failure on the execution process of the first task flow, and determining a target task with the minimum influence degree on restarting the execution process of the first task flow in the first task flow; creating a second task flow based on the target task, and taking the target task as an initial task of the second task flow; the second task process is a task process executed when the first task process is restarted after the first task process is terminated.

Description

A method for automatically discovering mission failure and recovery in a remote sensing satellite processing system

技术领域technical field

本申请涉及遥感卫星数据处理领域，尤其涉及一种遥感卫星处理系统中自动发现任务失败和恢复方法。The present application relates to the field of remote sensing satellite data processing, in particular to a method for automatically discovering task failure and recovery in a remote sensing satellite processing system.

背景技术Background technique

在遥感卫星处理系统中，通常包含几十、甚至上百个处理节点，同一时刻存在数十万个处理任务。在任务处理的过程中，由于硬件和系统软件的偶发错误，如存储文件系统偶发读写错误，机群通信偶发错误、作业调度系统偶发错误等，容易出现处理任务失败的情况，造成系统不能正常处理遥感数据，导致系统可用性降低。In a remote sensing satellite processing system, there are usually dozens or even hundreds of processing nodes, and there are hundreds of thousands of processing tasks at the same time. In the process of task processing, due to occasional errors in hardware and system software, such as occasional read and write errors in the storage file system, occasional errors in cluster communication, occasional errors in the job scheduling system, etc., it is easy to fail to process the task, causing the system to fail to process normally. Remote sensing data, resulting in reduced system availability.

在相关技术中，可以通过人工发现并解决问题，但需要细致的工作和精力，效率较低；或者采取面向专用计算作用的容错机制增加系统的容错能力，但时间开销较大。In related technologies, problems can be found and solved manually, but it requires meticulous work and energy, and the efficiency is low; or a fault-tolerant mechanism oriented to dedicated computing functions is adopted to increase the fault-tolerant capability of the system, but the time overhead is relatively large.

发明内容SUMMARY OF THE INVENTION

本申请的目的是提供一种遥感卫星处理系统中自动发现任务失败和恢复方法，用于感卫星处理系统中失败任务的自动发现和恢复。The purpose of this application is to provide a method for automatically discovering and recovering task failures in a remote sensing satellite processing system, which is used for automatic discovery and recovery of failed tasks in a sensing satellite processing system.

本申请提供一种遥感卫星处理系统中自动发现任务失败和恢复方法，包括：The present application provides a method for automatically discovering task failure and recovery in a remote sensing satellite processing system, including:

根据建立的错误捕捉机制，在监测到任务执行失败的情况下，重新发起失败任务的执行；若所述失败任务再次执行失败，则终止所述失败任务所对应的第一任务流程的执行；所述第一任务流程包括至少一个子任务；分析任务执行失败对所述第一任务流程的执行过程的影响域，并确定所述第一任务流程中对重新启动所述第一任务流程的执行过程的影响程度最小的目标任务；基于所述目标任务创建第二任务流程，将所述目标任务作为所述第二任务流程的起始任务；其中，所述第二任务流程为所述第一任务流程终止后，重新启动所述第一任务流程时所执行的任务流程。According to the established error capture mechanism, in the case of monitoring that the task execution fails, the execution of the failed task is re-initiated; if the failed task fails to be executed again, the execution of the first task flow corresponding to the failed task is terminated; The first task flow includes at least one subtask; analyze the impact domain of task execution failure on the execution process of the first task flow, and determine the execution process for restarting the first task flow in the first task flow Create a second task flow based on the target task, and use the target task as the starting task of the second task flow; wherein, the second task flow is the first task After the process is terminated, the task process executed when the first task process is restarted.

可选地，所述根据建立的错误捕捉机制，在监测到任务执行失败的情况下，重新发起失败任务的执行；若所述失败任务再次执行失败，则终止所述失败任务所对应的第一任务流程的执行之前，所述方法还包括：建立错误捕捉机制，在系统的任务流程的执行过程中，若出现任务执行失败的情况，则输出对应的标准错误输出文件。Optionally, according to the established error capture mechanism, in the case of monitoring that the task execution fails, the execution of the failed task is re-initiated; if the failed task fails to be executed again, the first corresponding to the failed task is terminated. Before the execution of the task flow, the method further includes: establishing an error capture mechanism, and during the execution of the task flow of the system, if the task execution fails, output a corresponding standard error output file.

可选地，所述根据建立的错误捕捉机制，在监测到任务执行失败的情况下，重新发起失败任务的执行；若所述失败任务再次执行失败，则终止所述失败任务所对应的第一任务流程的执行，包括：监测标准错误输出文件的生成，在监测到标准错误输出文件生成的情况下，根据生成的标准错误输出文件，确定任务执行失败的第一任务节点对应的第一任务的任务信息；根据所述第一任务的任务信息，重新执行所述第一任务。Optionally, according to the established error capture mechanism, in the case of monitoring that the task execution fails, the execution of the failed task is re-initiated; if the failed task fails to be executed again, the first corresponding to the failed task is terminated. The execution of the task process includes: monitoring the generation of the standard error output file, and in the case of monitoring the generation of the standard error output file, according to the generated standard error output file, determining the first task corresponding to the first task node where the task execution failed. task information; re-execute the first task according to the task information of the first task.

可选地，所述分析任务执行失败对所述第一任务流程的执行过程的影响域，并确定所述第一任务流程中对重新启动所述第一任务流程的执行过程的影响程度最小的目标任务，包括：在重新执行所述第一任务失败的情况下，根据所述第一任务的任务信息，确定与所述第一任务对应的所述第一任务流程；其中，所述第一任务节点为所述第一任务流程中所述第一任务对应的任务节点。Optionally, analyzing the impact domain of the task execution failure on the execution process of the first task process, and determining the first task process that has the least impact on restarting the execution process of the first task process. The target task includes: in the case of failure to re-execute the first task, determining the first task flow corresponding to the first task according to the task information of the first task; wherein the first task The task node is a task node corresponding to the first task in the first task flow.

可选地，所述在重新执行所述第一任务失败的情况下，根据所述第一任务的任务信息，确定与所述第一任务对应的所述第一任务流程之后，所述方法还包括：依次倒序遍历所述第一任务流程中所述第一任务节点以及之前的任务节点中各个任务节点所对应任务的输入数据；将所述第一任务流程中距离所述第一任务节点最近、且存在执行任务所需的输入数据的第二任务节点所对应的第二任务确定为所述目标任务。Optionally, in the case where re-execution of the first task fails, after determining the first task flow corresponding to the first task according to the task information of the first task, the method further Including: traversing the first task node in the first task flow and the input data of the tasks corresponding to each task node in the previous task nodes in reverse order; , and the second task corresponding to the second task node that has input data required for executing the task is determined as the target task.

可选地，所述根据生成的标准错误输出文件，确定任务执行失败的第一任务节点对应的第一任务的任务信息，包括：采用作业调度系统获取所述标准错误输出文件，并根据所述标准错误输出文件中的错误任务名称，和/或，错误任务编号确定所述第一任务的任务信息；其中，所述作业调度系统包括以下任一项：便携式批处理系统PBS、负载共享设施LSF、用于资源管理的简单Linux实用程序SLURM。Optionally, the determining, according to the generated standard error output file, the task information of the first task corresponding to the first task node whose task execution fails includes: using a job scheduling system to obtain the standard error output file, and according to the The erroneous task name in the standard error output file, and/or the erroneous task number determines the task information of the first task; wherein, the job scheduling system includes any one of the following: Portable Batch Batch System PBS, Load Sharing Facility LSF , SLURM, a simple Linux utility for resource management.

本申请提供的遥感卫星处理系统中自动发现任务失败和恢复方法，首先建立错误捕捉机制，根据建立的错误捕捉机制，在监测到任务执行失败的情况下，重新发起失败任务的执行；若所述失败任务再次执行失败，则终止所述失败任务所对应的第一任务流程的执行。之后，分析任务执行失败对所述第一任务流程的执行过程的影响域，并确定所述第一任务流程中对重新启动所述第一任务流程的执行过程的影响程度最小的目标任务。最后，基于所述目标任务创建第二任务流程，将所述目标任务作为所述第二任务流程的起始任务，解决了因处理系统偶发故障引起的处理任务失败的问题，取代了传统的人工筛查失败任务，手动发起错误任务等复杂方法，提高了系统的可用性。In the remote sensing satellite processing system provided by the present application, the method for automatically discovering mission failure and recovery is to first establish an error capture mechanism, and according to the established error capture mechanism, in the case of monitoring the failure of mission execution, re-initiate the execution of the failed mission; If the failed task fails to be executed again, the execution of the first task process corresponding to the failed task is terminated. Afterwards, the impact domain of the task execution failure on the execution process of the first task process is analyzed, and the target task in the first task process that has the least impact on restarting the execution process of the first task process is determined. Finally, a second task flow is created based on the target task, and the target task is used as the starting task of the second task flow, which solves the problem of processing task failure caused by the accidental failure of the processing system, and replaces the traditional manual Sophisticated methods such as screening for failed tasks and manually initiating erroneous tasks increase the usability of the system.

附图说明Description of drawings

为了更清楚地说明本申请或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the present application or the prior art more clearly, the following briefly introduces the accompanying drawings that are needed in the description of the embodiments or the prior art. Obviously, the drawings in the following description are of the present application. For some embodiments of the present invention, for those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1是本申请提供的遥感卫星处理系统中自动发现任务失败和恢复方法的流程示意图之一；Fig. 1 is one of the flow charts of automatic discovery task failure and recovery method in the remote sensing satellite processing system provided by the application;

图2是本申请提供的遥感卫星处理系统中自动发现任务失败和恢复方法的流程示意图之二；Fig. 2 is the second schematic flow chart of automatic discovery task failure and recovery method in the remote sensing satellite processing system provided by the application;

图3是本申请提供的捕捉任务错误流程示意图；Fig. 3 is the schematic flow chart of capturing task error provided by the application;

图4是本申请提供的错误任务影响域分析流程示意图；Fig. 4 is a schematic diagram of the analysis process flow of error task impact domain provided by the present application;

图5是本申请提供的流程描述文件中的相关信息。FIG. 5 is related information in the process description file provided by this application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请中的附图，对本申请中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be described clearly and completely below with reference to the accompanying drawings in the present application. Obviously, the described embodiments are part of the embodiments of the present application. , not all examples. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象，而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施，且“第一”、“第二”等所区分的对象通常为一类，并不限定对象的个数，例如第一对象可以是一个，也可以是多个。此外，说明书以及权利要求中“和/或”表示所连接对象的至少其中之一，字符“/”，一般表示前后关联对象是一种“或”的关系。The terms "first", "second" and the like in the description and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in sequences other than those illustrated or described herein, and distinguish between "first", "second", etc. The objects are usually of one type, and the number of objects is not limited. For example, the first object may be one or more than one. In addition, "and/or" in the description and claims indicates at least one of the connected objects, and the character "/" generally indicates that the associated objects are in an "or" relationship.

在几十上百个处理节点，PB（petabyte）级别存储系统的遥感卫星处理系统中，同一时刻存在数十万个处理任务，该处理系统属于典型的计算密集和数据密集型处理系统。在任务处理的过程中，由于硬件和系统软件的偶发错误，如存储文件系统偶发读写错误，机群通信偶发错误、作业调度系统偶发错误等，容易出现处理任务失败的情况，造成系统不能正常处理遥感数据，导致系统可用性降低。In a remote sensing satellite processing system with dozens or hundreds of processing nodes and a petabyte (PB) level storage system, there are hundreds of thousands of processing tasks at the same time. This processing system is a typical computing-intensive and data-intensive processing system. In the process of task processing, due to occasional errors in hardware and system software, such as occasional read and write errors in the storage file system, occasional errors in cluster communication, occasional errors in the job scheduling system, etc., it is easy to fail to process the task, causing the system to fail to process normally. Remote sensing data, resulting in reduced system availability.

为了解决此类问题，目前解决方案主要有种：1）人为发现错误，判断错误发生步骤，再从合理的中间步骤或者从初始步骤发起，这需要细致的工作和精力，且效率低下；2）面向通用大数据处理作业的容错与发起措施，如专为大规模数据处理而设计的快速通用的计算引擎Apache Spark，能够基于较细粒度的数据或者任务进行冗余状态管理，但同时也受限于框架通用性的考量，难以对作业任务粒度细、任务间的依赖关系复杂、容错需求多样等特点进一步优化； 3）采取面向专用计算作用的容错，以检查点的方式增加容差与发起机制，如Apache Flink，其采用“分布式快照”的检查点机制来完成容错。执行批处理任务时，其采用从头执行任务的方式来实现容错，该方式虽然实现简单，但带来了很大的时间开销。In order to solve such problems, there are currently two main solutions: 1) Humans find errors, determine the steps where the errors occur, and then start from a reasonable intermediate step or from the initial step, which requires meticulous work and energy, and is inefficient; 2) Fault tolerance and initiation measures for general big data processing jobs, such as Apache Spark, a fast and general computing engine designed for large-scale data processing, can perform redundant state management based on finer-grained data or tasks, but it is also limited Considering the versatility of the framework, it is difficult to further optimize the features such as fine-grained job tasks, complex dependencies between tasks, and diverse fault tolerance requirements; 3) Adopt fault tolerance oriented to dedicated computing functions, and increase tolerance and initiation mechanisms in the form of checkpoints , such as Apache Flink, which uses a "distributed snapshot" checkpoint mechanism to complete fault tolerance. When executing batch tasks, it adopts the method of executing tasks from scratch to achieve fault tolerance. Although this method is simple to implement, it brings a lot of time overhead.

为了提高发现失败任务并恢复的效率，如图1所示，为本申请实施例提供的遥感卫星处理系统中自动发现任务失败和恢复方法的原理图，首先建立任务执行失败时的错误捕捉机制，基于该错误捕捉机制开发任务程序，使得任务执行失败时能够将失败信息输出到标准错误输出文件中。发现失败任务后，立刻重新发起失败任务，继续目标流程；如果再次失败，则终止执行失败的任务所对应的第一任务流程的执行；对失败任务对应的任务流程进行影响域分析，并根据分析结果恢复任务的执行。In order to improve the efficiency of discovering failed tasks and recovering, as shown in FIG. 1 , the schematic diagram of the automatic discovery task failure and recovery method in the remote sensing satellite processing system provided by the embodiment of the present application, firstly establish an error capture mechanism when the task execution fails, The task program is developed based on the error capture mechanism, so that the failure information can be output to the standard error output file when the task execution fails. After discovering the failed task, immediately re-initiate the failed task and continue the target process; if it fails again, terminate the execution of the first task process corresponding to the failed task; perform an impact domain analysis on the task process corresponding to the failed task, and according to the analysis The result resumes the execution of the task.

本申请提出了一种通过捕捉任务错误日志、分析错误任务流程影响域、重新启动相关流程等步骤的遥感卫星处理系统中自动发现任务失败和恢复方法，解决了因处理系统偶发故障引起的处理任务失败的问题，提高了系统的可用性。The present application proposes a method for automatically discovering and recovering task failures in a remote sensing satellite processing system by capturing the task error log, analyzing the impact domain of the error task process, restarting the relevant process, etc., so as to solve the processing task caused by the accidental failure of the processing system. The problem of failure improves the availability of the system.

下面结合附图，通过具体的实施例及其应用场景对本申请实施例提供的遥感卫星处理系统中自动发现任务失败和恢复方法进行详细地说明。The following describes in detail the method for automatically discovering task failure and recovering in the remote sensing satellite processing system provided by the embodiments of the present application through specific embodiments and application scenarios with reference to the accompanying drawings.

如图2所示，本申请实施例提供的一种遥感卫星处理系统中自动发现任务失败和恢复方法，该方法可以包括下述步骤201至步骤203：As shown in FIG. 2 , a method for automatically discovering task failure and recovery in a remote sensing satellite processing system provided by an embodiment of the present application may include the following steps 201 to 203:

步骤201、根据建立的错误捕捉机制，在监测到任务执行失败的情况下，重新发起失败任务的执行；若所述失败任务再次执行失败，则终止所述失败任务所对应的第一任务流程的执行。Step 201, according to the established error capture mechanism, in the case of monitoring that the task execution fails, re-initiate the execution of the failed task; if the failed task fails to be executed again, then terminate the first task process corresponding to the failed task. implement.

其中，所述第一任务流程包括至少一个子任务。Wherein, the first task flow includes at least one subtask.

示例性地，上述第一任务流程为遥感卫星处理系统处理的多个任务流程中的任一个。通常情况下，遥感卫星处理系统每轨数据的处理都包含多个有逻辑顺序的任务流程，每个任务流程包含了至少一个子任务；而每天都会同时处理以万为计量单位的并发任务流程。Exemplarily, the above-mentioned first task process is any one of multiple task processes processed by the remote sensing satellite processing system. Usually, the data processing of each orbit of the remote sensing satellite processing system includes multiple logically sequenced task processes, and each task process contains at least one sub-task; and concurrent task processes in the unit of ten thousand are processed at the same time every day.

在一种可能的实现方式中，为了能够实现对上述多个任务流程监控，发现出现任务执行失败的任务流程，本申请实施例还提供了一种捕捉任务错误的机制。In a possible implementation manner, in order to monitor the above-mentioned multiple task processes and find out a task process in which the task execution fails, the embodiment of the present application further provides a mechanism for capturing task errors.

示例性地，本申请实施例提供的遥感卫星处理系统中自动发现任务失败和恢复方法，在步骤201至步骤203之前，还需要建立任务失败的监测机制，即上述错误捕捉机制，该任务失败检测机制具体可以包括以下步骤204：Exemplarily, in the remote sensing satellite processing system provided by the embodiment of the present application, in the method for automatically discovering task failure and recovery, before step 201 to step 203, it is also necessary to establish a monitoring mechanism for task failure, that is, the above-mentioned error capture mechanism, the task failure detection mechanism. The mechanism may specifically include the following steps 204:

步骤204、建立错误捕捉机制，在系统的任务流程的执行过程中，若出现任务执行失败的情况，则输出对应的标准错误输出文件。Step 204 , establishing an error capturing mechanism, and during the execution of the task flow of the system, if the task execution fails, output the corresponding standard error output file.

示例性地，以上述第一任务流程的执行过程为例，在第一任务流程的执行过程中，调度系统将捕捉失败任务的标准错误输出，并输入到标准错误输出文件中。Exemplarily, taking the execution process of the first task flow as an example, during the execution of the first task flow, the scheduling system will capture the standard error output of the failed task and input it into the standard error output file.

示例性地，在遥感卫星处理系统处理任务的过程中，制定任务程序开发规范和作业脚本编写规范。处理系统运行时，按设定处理流程生成任务的作业脚本，该作业脚本包括：任务的执行路径、输入参数等调用信息，以及任务执行失败时的标准错误输出路径。Exemplarily, in the process of processing a task by the remote sensing satellite processing system, a task program development specification and a job script writing specification are formulated. When the processing system is running, the job script of the task is generated according to the set processing flow. The job script includes: the execution path of the task, the calling information such as input parameters, and the standard error output path when the task execution fails.

当处理系统执行该任务时，将该任务的作业脚本提交至调度系统，调度系统调用该作业脚本中执行路径所指示的任务程序，并按照作业脚本中的输入参数执行该任务程序。When the processing system executes the task, the job script of the task is submitted to the scheduling system, and the scheduling system calls the task program indicated by the execution path in the job script, and executes the task program according to the input parameters in the job script.

当任务程序运行出现异常错误导致任务执行失败时，按照规范编写的任务程序会将错误写入标准错误输出文件中，调度系统将任务程序的输出，以及操作系统层次的标准错误输出到上述标准错误输出路径中。处理系统判断该任务执行失败。When an abnormal error occurs in the task program running and the task execution fails, the task program written in accordance with the specification will write the error into the standard error output file, and the scheduling system will output the output of the task program and the standard error at the operating system level to the above standard error. in the output path. The processing system determines that the task execution fails.

示例性地，通过对上述标准错误输出文件的监控，可以实现任务流程监控，及时发现任务执行失败的情况。Exemplarily, by monitoring the above-mentioned standard error output file, the task process monitoring can be implemented, and the failure of task execution can be found in time.

举例说明，如图3所示，任务对应的运行程序需要按照规范进行编写，处理系统生成运行脚本（即上述作业脚本）后，提交至作业调度系统，作业调度系统捕捉到有标准错误输出，生成错误输出文件。处理系统确认失败任务。For example, as shown in Figure 3, the running program corresponding to the task needs to be written in accordance with the specification. After the processing system generates the running script (that is, the above job script), it is submitted to the job scheduling system. The job scheduling system captures the standard error output and generates Error output file. The processing system acknowledges the failed task.

可以理解的是，由于处理过程中，各个任务流程之间，任务流程张红的各个任务之间存在一定的依赖关系，当某个任务节点的任务执行失败时，会影响后续任务节点的任务的执行。因此，当监测到第一任务流程的执行过程中出现任务执行失败的情况时，先后采用以下两种处理方式进行处理：It is understandable that due to the certain dependencies between the various task processes and the various tasks of Zhang Hong in the task process during the processing, when the task execution of a task node fails, it will affect the task execution of subsequent task nodes. implement. Therefore, when a task execution failure is detected during the execution of the first task process, the following two processing methods are used for processing:

方式1：Way 1:

在处理方式1中，立即重新执行该任务。In processing mode 1, the task is re-executed immediately.

示例性地，当第一任务流程中第一任务节点的第一任务执行失败时，可以将该第一任务确定为上述目标任务，并重新执行该第一任务。即尝试直接重新执行该第一任务，若执行成功，则表示此次任务执行失败的影响程度较小，通过重新执行该第一任务即可恢复正常。Exemplarily, when execution of the first task of the first task node in the first task flow fails, the first task may be determined as the above-mentioned target task, and the first task is re-executed. That is, an attempt is made to directly re-execute the first task. If the execution is successful, it means that the impact of the failure of the task execution is small, and the first task can be restored to normal by re-executing the first task.

示例性地，上述目标任务包括：任务执行失败的当前任务节点对应的任务。Exemplarily, the above target task includes: a task corresponding to the current task node whose task execution fails.

示例性地，上述步骤201，可以包括以下步骤201a和步骤201b：Exemplarily, the above step 201 may include the following steps 201a and 201b:

步骤201a、监测标准错误输出文件的生成，在监测到标准错误输出文件生成的情况下，根据生成的标准错误输出文件，确定任务执行失败的第一任务节点对应的第一任务的任务信息。Step 201a: Monitor the generation of the standard error output file, and determine the task information of the first task corresponding to the first task node whose task execution fails according to the generated standard error output file when the generation of the standard error output file is monitored.

步骤201b、根据所述第一任务的任务信息，重新执行所述第一任务。Step 201b: Re-execute the first task according to the task information of the first task.

示例性地，可以根据步骤204建立的错误捕捉机制，实时监测是否有新的标准错误输出文件的生成，当监测到有新的标准错误输出文件的生成时，可以根据生成的标准错误输出文件，确定执行失败的任务。系统中所有执行失败的任务，均可以通过步骤201a和步骤201b监测到。Exemplarily, according to the error capture mechanism established in step 204, it is possible to monitor in real time whether a new standard error output file is generated. When monitoring the generation of a new standard error output file, the generated standard error output file can be used. Identify tasks that failed to execute. All failed tasks in the system can be monitored through steps 201a and 201b.

方式2：Way 2:

如果处理方式1失败，则继续开启方式2。If processing mode 1 fails, continue to open mode 2.

在处理方式2中，进行影响域的分析，并重启匹配后的任务流程。In processing mode 2, the analysis of the impact domain is performed, and the matched task flow is restarted.

步骤202、分析任务执行失败对所述第一任务流程的执行过程的影响域，并确定所述第一任务流程中对重新启动所述第一任务流程的执行过程的影响程度最小的目标任务。Step 202: Analyze the impact domain of the task execution failure on the execution process of the first task process, and determine the target task in the first task process that has the least impact on restarting the execution process of the first task process.

示例性地，上述影响域用于反应执行任务失败对任务流程的恢复的影响程度。Exemplarily, the above-mentioned influence domain is used to reflect the degree of influence of the failure of executing the task on the recovery of the task flow.

示例性地，当终止该第一任务流程的执行后，需要对该第一任务流程进行分析，该分析的作用是确定重新启动时，从哪个任务流程的起始节点执行，对该第一任务流程的执行的影响程度最小。Exemplarily, after terminating the execution of the first task flow, the first task flow needs to be analyzed, and the function of the analysis is to determine which task flow starting node to execute from when restarting, and the first task flow needs to be analyzed. The execution of the process has minimal impact.

可以理解的是，当该第一任务流程因出现任务执行失败的情况而终止执行时，可以直接从该第一任务流程的初始任务流程开始重新启动该第一任务流程的执行，但这样的处理方式会重复执行以及正常执行过的任务，导致系统资源的浪费。因此，当出现任务执行失败的情况时，需要对其进行分析，确定任务流程中对重新启动该第一任务流程的执行过程的影响程度最小的目标任务。It can be understood that when the execution of the first task process is terminated due to a task execution failure, the execution of the first task process can be restarted directly from the initial task process of the first task process, but such processing This method will repeat and execute tasks that have been executed normally, resulting in a waste of system resources. Therefore, when the task execution fails, it needs to be analyzed to determine the target task in the task flow that has the least impact on restarting the execution process of the first task flow.

具体地，可以根据出现任务执行失败的第一任务节点的起始节点是否还存在执行任务所需的输入数据，并以此确定需要重新执行的上级节点的任务。Specifically, the task of the superior node that needs to be re-executed can be determined according to whether the starting node of the first task node where the task execution fails still has input data required for executing the task.

示例性地，可以通过以下步骤来确定第一任务节点的上级任务节点的相关信息，即上述步骤202，可以包括以下步骤202a：Exemplarily, the relevant information of the upper-level task node of the first task node may be determined through the following steps, that is, the above step 202, which may include the following step 202a:

步骤202a、在重新执行所述第一任务失败的情况下，根据所述第一任务的任务信息，确定与所述第一任务对应的所述第一任务流程。Step 202a: In the case of failure to re-execute the first task, determine the first task flow corresponding to the first task according to the task information of the first task.

其中，所述第一任务节点为所述第一任务流程中所述第一任务对应的任务节点。The first task node is a task node corresponding to the first task in the first task process.

示例性地，上述任务失败监测机制，可以监测是否有新的标准错误输出文件被创建，并在监测到新的标准错误输出文件被创建后，根据该标准错误输出文件，确定执行失败的任务的任务信息。Exemplarily, the above task failure monitoring mechanism can monitor whether a new standard error output file is created, and after monitoring that the new standard error output file is created, determine the failure of executing the failed task according to the standard error output file. task information.

示例性地，可以根据第一任务执行失败时输出的标准错误输出文件中的相关信息，来确定第一任务。该标准错误输出文件中，可以包括该第一任务的任务单号、该第一任务的任务程序路径、以及该第一任务执行所需的输入数据等信息。同时，该标准错误输出文件中还可以包括第一任务流程的流程信息，可以根据该流程信息，确定第一任务节点所在任务流程（即上述第一任务流程）中所有上级任务节点的任务信息。Exemplarily, the first task may be determined according to relevant information in the standard error output file output when the first task fails to execute. The standard error output file may include information such as the task ticket number of the first task, the task program path of the first task, and input data required for the execution of the first task. At the same time, the standard error output file may also include process information of the first task process. According to the process information, the task information of all upper-level task nodes in the task process where the first task node is located (ie, the above-mentioned first task process) can be determined.

举例说明，如任务程序是C++语言编写，则在编写代码时应将错误信息用cerr输出；如任务程序是java编写，则应将错误信息用System.err.println输出。以任务程序AUTOCHECKGEO为例，由处理系统自动生成，拟提交作业调度系统的作业脚本文件中必须包含以下内容：For example, if the task program is written in C++ language, the error message should be output with cerr when writing the code; if the task program is written in java, the error message should be output with System.err.println. Taking the task program AUTOCHECKGEO as an example, it is automatically generated by the processing system. The job script file to be submitted to the job scheduling system must contain the following contents:

#PBS -e /public/VRS//work/vrs20220105011831/Log/pbs/#PBS -e /public/VRS//work/vrs20220105011831/Log/pbs/

/public/VRS//main/modules/bin/AUTOCHECKGEO/public/VRS//main/modules/bin/AUTOCHECKGEO

/public/VRS//work/vrs20220105011831/Parameter/ AUTOCHECKGEO.xml/public/VRS//work/vrs20220105011831/Parameter/AUTOCHECKGEO.xml

其中，vrs20220105011831为唯一的任务单编号；/public/VRS//work/vrs20220105011831/Log/pbs/指定了错误文件输出路径，处理系统把捕捉到的错误信息输出该目录的AUTOCHECKGEO.sh.e2230612文件中去，文件名中的2230612为提交的作业编号；/public/VRS//main/modules/bin/AUTOCHECKGEO /public/VRS//work/vrs20220105011831/Parameter/ AUTOCHECKGEO.xml则分别指定了作业所调用的程序为AUTOCHECKGEO，输入参数包含在AUTOCHECKGEO.xml中。Among them, vrs20220105011831 is the unique task order number; /public/VRS//work/vrs20220105011831/Log/pbs/ specifies the error file output path, and the processing system outputs the captured error information to the AUTOCHECKGEO.sh.e2230612 file in the directory. Go, 2230612 in the file name is the submitted job number; /public/VRS//main/modules/bin/AUTOCHECKGEO /public/VRS//work/vrs20220105011831/Parameter/ AUTOCHECKGEO.xml respectively specifies the program called by the job For AUTOCHECKGEO, the input parameters are contained in AUTOCHECKGEO.xml.

具体地，上述步骤201a，可以包括以下步骤202a1：Specifically, the above step 201a may include the following step 202a1:

步骤201a1、采用作业调度系统获取所述标准错误输出文件，并根据所述标准错误输出文件的错误任务名称，和/或，错误任务编号确定所述第一任务的任务信息。Step 201a1: Use a job scheduling system to acquire the standard error output file, and determine the task information of the first task according to the error task name and/or the error task number of the standard error output file.

其中，所述作业调度系统包括以下任一项：便携式批处理系统（Portable BatchSystem，PBS）、加载共享设施（Load Sharing Facility，LSF）、用于资源管理的简单Linux实用程序（Simple Linux Utility for Resource Management，SLURM）。Wherein, the job scheduling system includes any one of the following: Portable Batch System (PBS), Load Sharing Facility (LSF), Simple Linux Utility for Resource Management (Simple Linux Utility for Resource) Management, SLURM).

示例性地，当某个任务执行失败时，会将该任务的任务名称、任务编号等任务信息，以及错误信息输入到对应的标准错误输出文件名或者文件内容中。当系统检测到生成了新的标准错误输出文件时，可以根据该标准错误输出文件中的信息，确定执行失败的任务。Exemplarily, when a task fails to be executed, task information such as the task name and task number of the task, and error information are input into the corresponding standard error output file name or file content. When the system detects that a new standard error output file is generated, it can determine the failed task according to the information in the standard error output file.

示例性地，可以通过上述便携式批处理系统（Portable Batch System，PBS）、负载共享设施（Load Sharing Facility，LSF）、用于资源管理的简单Linux实用程序（SimpleLinux Utility for Resource Management，SLURM）或者自定义方法，获得标准错误输出文件。Illustratively, the above-mentioned Portable Batch System (PBS), Load Sharing Facility (LSF), Simple Linux Utility for Resource Management (SLURM), or self- Define a method to get the standard error output file.

示例性地，上述步骤202a之后，上述步骤202还可以包括以下步骤202b和步骤202c：Exemplarily, after the above step 202a, the above step 202 may further include the following steps 202b and 202c:

步骤202b、依次倒序遍历所述第一任务流程中所述第一任务节点以及之前的任务节点中各个任务节点所对应任务的输入数据。Step 202b, traverse the input data of the tasks corresponding to the first task nodes in the first task flow and the tasks corresponding to each task node in the previous task nodes in reverse order.

步骤202c、将所述第一任务流程中距离所述第一任务节点最近、且存在执行任务所需的输入数据的第二任务节点所对应的第二任务确定为所述目标任务。Step 202c: Determine the second task corresponding to the second task node that is closest to the first task node and has input data required for executing the task in the first task flow as the target task.

示例性地，当重新执行第一任务失败后，需要从第一任务节点的上级任务节点恢复第一任务流程的执行。具体可以按照就近原则，将距离第一任务节点最近、且存在执行任务所需的输入数据的任务节点对应的任务，确定为目标任务。Exemplarily, after the re-execution of the first task fails, the execution of the first task process needs to be resumed from the upper-level task node of the first task node. Specifically, the task corresponding to the task node that is closest to the first task node and has input data required for executing the task may be determined as the target task according to the proximity principle.

可以理解的是，若第一任务节点的所在的任务流程上级节点未保存执行该节点对应任务的输入数据，则无法从该节点开始恢复第一任务流程的执行，需要遍历更上一级任务流程是否保存执行任务所需的输入数据，直至遍历到存在执行任务所需的输入数据的任务流程起始节点，并将该节点对应的任务，确定为目标任务。It can be understood that if the upper-level node of the task process where the first task node is located does not save the input data for executing the task corresponding to the node, the execution of the first task process cannot be resumed from this node, and it is necessary to traverse the higher-level task process. Whether to save the input data required to execute the task until the task process start node with the input data required to execute the task is traversed, and the task corresponding to this node is determined as the target task.

需要说明的是，若第一任务节点以及该第一任务节点的上级任务流程的起始任务节点中，均不存在执行任务节点对应任务所需的输入数据的情况，即找不到第二任务节点，则终止该第一任务流程的重新执行。It should be noted that, if neither the first task node nor the initial task node of the upper-level task process of the first task node has the input data required to execute the task corresponding to the task node, that is, the second task cannot be found. node, the re-execution of the first task process is terminated.

举例说明，如图4所示，为本申请实施例提供的错误任务影响域分析流程示意图，当发现错误输出文件后，重新发起错误任务的执行，若执行成功，则结束恢复流程；若执行失败，则进行影响域分析，并确定对恢复第一任务流程的执行影响程度最小的目标任务，进而根据该目标任务，创建第二任务流程，之后，获取执行该任务所需的输入参数，为执行该任务提供数据保证。For example, as shown in FIG. 4 , in the schematic diagram of the analysis process of the error task impact domain provided by the embodiment of the present application, after the error output file is found, the execution of the error task is re-initiated, and if the execution is successful, the recovery process is ended; if the execution fails , then carry out an impact domain analysis, and determine the target task that has the least impact on the execution of the restoration of the first task process, and then create a second task process according to the target task, and then obtain the input parameters required to execute the task. This task provides data assurance.

步骤203、基于所述目标任务创建第二任务流程，将所述目标任务作为所述第二任务流程的起始任务。Step 203: Create a second task flow based on the target task, and use the target task as a starting task of the second task flow.

其中，所述第二任务流程为所述第一任务流程终止后，重新启动所述第一任务流程时所执行的任务流程。The second task flow is a task flow executed when the first task flow is restarted after the first task flow is terminated.

示例性地，在确定第一任务流程中对重新启动该第一任务流程的执行过程的影响程度最小的目标任务后，可以重新启动该第一任务流程的执行，并重新执行上述目标任务。Exemplarily, after determining the target task in the first task flow that has the least impact on restarting the execution process of the first task flow, the execution of the first task flow can be restarted and the above target task is re-executed.

需要说明的是，由于上述第一任务流程在重新执行失败任务并再次失败的情况下，已经被终止，因此，为了完成该第一任务流程中后续任务的执行，需要创建新的任务流程，即上述第二任务流程。It should be noted that, since the above-mentioned first task flow has been terminated in the case of re-executing the failed task and failing again, in order to complete the execution of the subsequent tasks in the first task flow, a new task flow needs to be created, that is, The above second task flow.

示例性地，在通过上述标准错误输出文件获取到第一任务的任务信息后，可以根据该任务信息，匹配该第一任务对应的第一任务流程，进而实现对该第一任务流程的控制，并将该第一任务流程作为恢复流程。具体可以通过流程描述文件进行任务流程的匹配。Exemplarily, after obtaining the task information of the first task through the above-mentioned standard error output file, the first task flow corresponding to the first task can be matched according to the task information, thereby realizing the control of the first task flow, And use the first task process as a recovery process. Specifically, the task process can be matched through the process description file.

如图5所示，为流程描述文件中的相关信息，其中L1A表示数据的级别，而workflowid=87则表示其任务流程编号为87，paraformat则表示其任务流程中所包含的任务程序及其先后顺序。根据数据级别，可以从任务流程描述表中查到其最小影响域为L1A，则以L1A所在的流程为恢复流程。再获取恢复流程的起始输入数据，如果起始输入数据已经不存在，则前溯其上级数据，直至原始数据。如上溯某级数据存在，则指定该级数据所处流程为任务恢复流程；若所有级别数据均不存在，则不匹配恢复流程。如图5所示的流程描述文件中，若L1A数据不存在，则处理系统依次向上匹配CAT级别，直至R0级别。As shown in Figure 5, it is the relevant information in the process description file, where L1A represents the level of the data, and workflowid=87 represents the task process number 87, and paraformat represents the task program and its sequence included in the task process. order. According to the data level, it can be found from the task process description table that its minimum impact domain is L1A, and the process where L1A is located is the recovery process. Then obtain the initial input data of the recovery process. If the initial input data does not exist, it will trace back to the upper-level data until the original data. If the data of a certain level in the traceback exists, specify the process of the data at this level as the task recovery process; if all levels of data do not exist, it does not match the recovery process. In the process description file shown in FIG. 5 , if the L1A data does not exist, the processing system sequentially matches the CAT level up to the R0 level.

具体地，在重新启动上述第一任务流程的执行之前，还可以根据上述获取到的输入数据，依次生成第一任务流程中需要执行或重新执行的任务的作业脚本，包括：目标任务以及目标任务之后执行的任务的作业脚本。之后，按照生成的作业脚本的执行顺序，通过创建并执行新的任务流程（即上述第二任务流程）的方式，完成第一任务流程的恢复。Specifically, before restarting the execution of the first task flow, a job script of the tasks to be executed or re-executed in the first task flow may be sequentially generated according to the obtained input data, including: the target task and the target task Job script for tasks to be executed after. Afterwards, according to the execution sequence of the generated job script, the restoration of the first task flow is completed by creating and executing a new task flow (ie, the second task flow described above).

举例说明，如图5所示，以AUTOCHECKGEO为例，如果当前输入数据L1A数据存在，则匹配任务流程为87号任务流程，并将其作为恢复流程。其发生错误时的输入参数为：AUTOCHECKGEO.xml，执行程序为AUTOCHECKGEO；当重启任务流程时，按编号为87的恢复流程进行重新执行。因恢复流程的初始运行程序为CORRECT，则处理系统首先生成CORRECT.xml和CORRECT.sh，再提交CORRECT.sh脚本，按流程顺序执行CORRECT，AUTOCHECKGEO等任务程序。For example, as shown in FIG. 5, taking AUTOCHECKGEO as an example, if the current input data L1A data exists, the matching task flow is the task flow No. 87, and it is used as the recovery flow. The input parameter when the error occurs is: AUTOCHECKGEO.xml, and the execution program is AUTOCHECKGEO; when the task process is restarted, it is re-executed according to the recovery process numbered 87. Since the initial running program of the recovery process is CORRECT, the processing system first generates CORRECT.xml and CORRECT.sh, and then submits the CORRECT.sh script, and executes tasks such as CORRECT and AUTOCHECKGEO in sequence.

需要说明的是，本申请提供的任务失败处理方法，并不局限于卫星数据处理领域，对于卫星数据质量评价、卫星数据深加工等大批量并发作业的领域同样适用。It should be noted that the task failure processing method provided in this application is not limited to the field of satellite data processing, and is also applicable to the field of large-scale concurrent operations such as satellite data quality evaluation and satellite data deep processing.

本申请实施例提供的遥感卫星处理系统中自动发现任务失败和恢复方法，首先建立错误捕捉机制，根据建立的错误捕捉机制，在监测到任务执行失败的情况下，重新发起失败任务的执行；若所述失败任务再次执行失败，则终止所述失败任务所对应的第一任务流程的执行。之后，分析任务执行失败对所述第一任务流程的执行过程的影响域，并确定所述第一任务流程中对重新启动所述第一任务流程的执行过程的影响程度最小的目标任务。最后，基于所述目标任务创建第二任务流程，将所述目标任务作为所述第二任务流程的起始任务，解决了因处理系统偶发故障引起的处理任务失败的问题，取代了传统的人工筛查失败任务，手动发起错误任务等复杂方法，提高了系统的可用性。In the remote sensing satellite processing system provided by the embodiment of the present application, the method for automatically discovering task failure and recovery is to first establish an error capture mechanism, and according to the established error capture mechanism, in the case of monitoring that the task execution fails, the execution of the failed task is re-initiated; If the failed task fails to be executed again, the execution of the first task process corresponding to the failed task is terminated. Afterwards, the impact domain of the task execution failure on the execution process of the first task process is analyzed, and the target task in the first task process that has the least impact on restarting the execution process of the first task process is determined. Finally, a second task flow is created based on the target task, and the target task is used as the starting task of the second task flow, which solves the problem of processing task failure caused by the accidental failure of the processing system, and replaces the traditional manual Sophisticated methods such as screening for failed tasks and manually initiating erroneous tasks increase the usability of the system.

需要说明的是，本申请实施例中，上述各个方法附图所示的。遥感卫星处理系统中自动发现任务失败和恢复方法均是以结合本申请实施例中的一个附图为例示例性的说明的。具体实现时，上述各个方法附图所示的遥感卫星处理系统中自动发现任务失败和恢复方法还可以结合上述实施例中示意的其它可以结合的任意附图实现，此处不再赘述。It should be noted that, in the embodiments of the present application, the above methods are shown in the accompanying drawings. The automatic discovery task failure and recovery methods in the remote sensing satellite processing system are all illustratively described with reference to a figure in the embodiment of the present application as an example. During specific implementation, the method for automatically discovering task failure and recovering in the remote sensing satellite processing system shown in the drawings of the above methods can also be implemented in combination with any other drawings shown in the above embodiments that can be combined, and will not be repeated here.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

1. A method for automatically discovering task failure and recovering in a remote sensing satellite processing system is characterized by comprising the following steps:

according to the established error capturing mechanism, under the condition that the execution failure of the task is monitored, the execution of the failed task is initiated again; if the failed task fails to be executed again, terminating the execution of the first task flow corresponding to the failed task; the first task process comprises at least one subtask;

analyzing an influence domain of task execution failure on the execution process of the first task flow, and determining a target task with the minimum influence degree on restarting the execution process of the first task flow in the first task flow;

creating a second task flow based on the target task, and taking the target task as an initial task of the second task flow;

the second task process is a task process executed when the first task process is restarted after the first task process is terminated.

2. The method according to claim 1, characterized in that, in case of monitoring a task execution failure according to an established error trapping mechanism, the execution of the failed task is reinitiated; if the failed task fails to be executed again, before terminating the execution of the first task flow corresponding to the failed task, the method further includes:

and establishing an error capturing mechanism, and outputting a corresponding standard error output file if the task fails to be executed in the execution process of the task flow of the system.

3. The method according to claim 2, characterized in that, in case of monitoring a task execution failure according to the established error trapping mechanism, the execution of the failed task is reinitiated; if the failed task fails to execute again, terminating the execution of the first task flow corresponding to the failed task, including:

monitoring generation of a standard error output file, and determining task information of a first task corresponding to a first task node with task execution failure according to the generated standard error output file under the condition that the generation of the standard error output file is monitored;

and re-executing the first task according to the task information of the first task.

4. The method according to claim 3, wherein the analyzing the influence domain of the task execution failure on the execution process of the first task flow and determining the target task in the first task flow with the least influence on restarting the execution process of the first task flow comprises:

under the condition that the re-execution of the first task fails, determining the first task flow corresponding to the first task according to the task information of the first task;

the first task node is a task node corresponding to the first task in the first task flow.

5. The method according to claim 4, wherein after determining the first task flow corresponding to the first task according to task information of the first task in case of failure to re-execute the first task, the method further comprises:

sequentially traversing the input data of the tasks corresponding to the first task node and each task node in the previous task nodes in the first task process in a reverse order;

and determining a second task corresponding to a second task node which is closest to the first task node in the first task flow and has input data required by task execution as the target task.

6. The method according to claim 3, wherein the determining task information of the first task corresponding to the first task node with failed task execution according to the generated standard error output file comprises:

acquiring the standard error output file by adopting an operation scheduling system, and determining task information of the first task according to an error task name and/or an error task number in the standard error output file;

wherein the job scheduling system comprises any one of: a portable batch system PBS, a load sharing facility LSF, a simple Linux utility SLURM for resource management.