CN113687831A

CN113687831A - Method, device, computer equipment and storage medium for generating data acquisition script

Info

Publication number: CN113687831A
Application number: CN202110770812.4A
Authority: CN
Inventors: 陈家银; 潘帅; 张伟; 陈曦; 麻志毅
Original assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Current assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-11-23

Abstract

The application provides a method and a device for generating a data acquisition script, computer equipment and a storage medium, wherein the method for generating the data acquisition script comprises the following steps: respectively acquiring text information contained in each node of the webpage data aiming at the webpage data of the target site; generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data; analyzing the webpage data through the trained analysis model based on the text information and the characteristic statistical information, identifying the path of the data to be acquired, and generating a data acquisition script of the target site. According to the method and the device, the data acquisition script can be automatically generated through the trained model, a large amount of manpower and material resources are saved, and the webpage data analysis speed is effectively improved.

Description

Method and device for generating data acquisition script, computer equipment and storage medium

Technical Field

The application belongs to the technical field of data acquisition, and particularly relates to a method and a device for generating a data acquisition script, computer equipment and a storage medium.

Background

Currently, in a data collection task scenario, it is usually necessary to collect required information content from a web document in an HTML format. The existing acquisition modes generally include two types: there are several types of methods: a template-based acquisition mode and a statistics-based acquisition mode.

The template-based acquisition mode is mainly to locate the path of the required content by utilizing open-source analysis templates, such as XPath, Selector CSS, Beautiful and the like, according to the internal structure and content of an HTML document, wherein the most common method is to locate the XPath thereof, and then generate an acquisition script according to the located XPath for data acquisition. The statistical-based acquisition mode mainly includes the steps of counting some characteristics (such as the number of tags, the text length in the tags, the text density in the tags and the like) in an HTML document, fusing to generate a discrimination model, when the predicted value of a certain tag exceeds a threshold value, considering that the structure of the tag is an XPath of required content, and then generating an acquisition script according to the XPath for data acquisition.

However, both of the above two collection methods have great disadvantages, for example, the template-based collection method requires manually analyzing the XPath of the required content one by one, and when the number of sites is large, a lot of manpower and time are consumed. In addition, when the analyzed website web page structure slightly changes, the originally located XPath may also be affected, resulting in increased maintenance cost and poor robustness. The discrimination model generated by the statistical-based acquisition mode is too simple, so that the identification accuracy is low, and the discrimination model is difficult to adapt to a complex acquisition task scene.

Disclosure of Invention

The application provides a method and a device for generating a data acquisition script, computer equipment and a storage medium, the data acquisition script can be automatically generated through a trained model, a large amount of manpower and material resources are saved, and the webpage data analysis speed is effectively improved.

An embodiment of a first aspect of the present application provides a method for generating a data acquisition script, where the method includes:

respectively acquiring text information contained in each node of webpage data aiming at the webpage data of a target site;

generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data;

analyzing the webpage data through a trained analysis model based on the text information and the characteristic statistical information, identifying a path of the data to be acquired, and generating a data acquisition script of the target site.

Optionally, the analyzing the web page data through a trained analysis model based on the text information and the feature statistical information, identifying a path of data to be acquired, and generating a data acquisition script of the target site includes:

traversing each node of the webpage data in a hierarchy mode, and sequentially forming a text representation vector corresponding to each node based on text information of each node;

performing convolution and pooling operation on the text representation vectors in sequence to form new text representation vectors;

forming a label statistical vector based on the characteristic statistical information of each label, and splicing the new text representation vector and the label statistical vector to obtain a spliced vector;

and sequentially connecting the splicing vector with a full connection layer and an output layer to identify a path of data to be acquired and generate a data acquisition script of the target station.

Optionally, before analyzing the text included in each node through a trained analysis model based on the text information and the feature statistical information and generating a data acquisition script, the method further includes:

the method comprises the steps that analyzed information and characteristic statistical information of analyzed webpage data are obtained based on a plurality of analyzed webpage data with the same site type, and the analyzed information and the characteristic statistical information are used for training and generating an analysis model; the parsed information includes at least Xpath path and site information.

Optionally, the acquiring, based on a plurality of parsed web page data with the same site type, parsed information and feature statistical information of the parsed web page data, and training and generating the parsing model through the parsed information and the feature statistical information includes:

the method comprises the steps that a hierarchy traverses each node of each analyzed webpage data with the same type of the website, the analyzed information of each node of the analyzed webpage data is obtained, and a training data set is formed based on all the analyzed information;

generating a characteristic statistical training vector of each label according to the multidimensional characteristics of various labels in a plurality of analyzed webpage data with the same site type;

and training and generating the analytical model through the training data set and the feature statistical training vector.

Optionally, the forming a training data set based on all parsed information includes:

and marking the text information corresponding to each node according to whether the required information is contained or not based on all the analyzed information, and generating a training data set based on the marked text information.

Optionally, the labeling the text information corresponding to each node according to whether the text information includes the required information includes:

for each node, if the child node of the node contains the required information, performing first labeling on the text information of the child node, and performing second labeling on the text information of the node; if the node does not contain the required information, carrying out third labeling on the text information of the node; and the first label is used for enabling the analysis model to stop traversing the nodes of the webpage data.

Optionally, the multi-dimensional features include at least a number of labels, a density of text, and weight information.

An embodiment of a second aspect of the present application provides an apparatus for generating a data acquisition script, the apparatus including:

the text module is used for respectively acquiring text information contained in each node of the webpage data aiming at the webpage data of the target site;

the label module is used for generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data;

and the script module is used for analyzing the webpage data through a trained analysis model based on the text information and the characteristic statistical information, identifying a path of the data to be acquired, and generating a data acquisition script of the target site.

Embodiments of a third aspect of the present application provide a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the method according to the first aspect.

An embodiment of a fourth aspect of the present application provides a computer-readable storage medium having a computer program stored thereon, the program being executable by a processor to implement the method according to the first aspect.

The technical scheme provided in the embodiment of the application at least has the following technical effects or advantages:

according to the method for generating the data acquisition script, according to the text information contained in each node of the webpage data and the characteristic statistical information of each label, the webpage data are analyzed through the trained analysis model, the path of the data to be acquired can be identified, the data acquisition script of the target site is generated, the method can be executed by computer equipment, data acquisition is automatically carried out according to the data acquisition script, a large amount of manpower and time cost are saved, the efficiency and accuracy of webpage analysis and data acquisition are effectively improved, and the maintenance cost of a later-stage acquisition task is reduced.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings.

In the drawings:

FIG. 1 is a schematic flow chart illustrating a method for generating a data acquisition script according to an embodiment of the present application;

FIG. 2 illustrates a flow diagram for parsing web page data using a parsing model;

FIG. 3 is a schematic flow chart illustrating another method for generating a data collection script according to an embodiment of the present application;

FIG. 4 shows a schematic flow chart of labeling a training data set;

fig. 5 shows a schematic structural diagram of an apparatus for generating a data acquisition script according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

A method, an apparatus, a computer device, and a storage medium for generating a data acquisition script according to embodiments of the present application are described below with reference to the accompanying drawings.

The embodiment of the application provides a method for generating a data acquisition script, which can be applied to a device for generating the data acquisition script, the device can be a computer device with data processing (such as query, calculation and the like) capability, and can also be a processing module capable of performing data processing on the computer device. As shown in fig. 1, the method may include the steps of:

step S1, respectively acquiring text information included in each node of the web page data for the web page data of the target site.

A site may be understood as a document on a computer device that is used to store web content. Generally, the document (HTML document) is structured data and can also be regarded as a tree structure, the tree structure includes a parent node and a child node, that is, each tag (tag) of the document can be regarded as a node of the tree, the node can be a tree node or a child node, and generally the tree node will contain part or all of the information of the child node, so that the text information contained in each node can include all the content under the tag, and the text information contained in all the nodes constitutes the web page data of the site. The sites can be classified according to the content of the website, such as news sites, advertisement sites, picture sites, video sites, and the like, and generally, one type of site corresponds to one data acquisition script. The target site may be any site where a very careful data collection is desired.

In this embodiment, all text information included in each node of the web page data can be acquired through functions of querying, searching and the like, so that the data to be acquired is identified subsequently, and a path of the data to be acquired is obtained.

Step S2, generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data.

In this embodiment, referring to a statistical class-based method for extracting XPath, some tag features in the web page data are used as subsequent data to be acquired for identification, so as to obtain auxiliary features of a data path to be acquired. For example, in the task of collecting news text, there are usually more < p > tags, and the number and density of < p > tags can be used as indicators for judging the text. Similarly, advertisement information usually contains many < a > tags. Accordingly, the multi-dimensional features may include at least a label number, a label density, a text density, and weight information, among others.

Specifically, the multidimensional feature may include 39 dimensional features composed of the number of 33 various tags and 6 other features (tag density, text density, weight information, and the like). Wherein 6 other features may be:

intensity _ of _ a _ text (text density);

element _ of _ calculation (punctuation density);

log10(element. number _ of _ p _ despendants +2) (p-tag density);

class _ weight (empirical class name weight);

element number _ of _ datetime (weight of the number of time texts);

(1-element _ of _ a _ text) element _ of _ publication _ np.log10(element.number _ of _ p _ descriptors +2) (product weight of a plurality of features).

It should be noted that, when performing the tag feature statistics, a proper statistical principle may be formulated according to different collection task environments (in general, the collection task environments correspond to the site types one to one), and the present embodiment does not specifically limit the statistical principle.

And step S3, analyzing the webpage data through the trained analysis model based on the text information and the characteristic statistical information, identifying the path of the data to be acquired, and generating a data acquisition script of the target site.

The trained analytical model is used for identifying the path of the data to be collected according to the text information and the characteristic statistical information of the input webpage data and generating a data collection script of the target site. In particular, the trained analytical model may be, but is not limited to, a neural network model

In this embodiment, after the text information and the feature statistical information are obtained, the obtained text information and the feature statistical information may be input into a trained parsing model, and the parsing model may perform a series of data processing on the text information and the feature statistical information, so as to identify a path of data to be collected and generate a data collection script of a target site, so as to reduce labor and time costs consumed by manually parsing XPath, and greatly improve data collection efficiency.

It should be noted that, in this embodiment, a specific range of the analysis model processing data is not specifically limited, and for example, it may be understood that the process of acquiring the text information from the web page data may also be a processing function of the analysis model.

In a specific implementation manner of this embodiment, the step S3 may include the following steps: traversing each node of the webpage data in a hierarchy mode, and sequentially forming a text representation vector corresponding to each node based on text information of each node; carrying out convolution and pooling operation on the text characterization vectors in sequence to form new text characterization vectors; forming a tag statistical vector based on the characteristic statistical information of each tag, and splicing the new text characterization vector and the tag statistical vector to obtain a spliced vector; and sequentially connecting the splicing vector with the full connection layer and the output layer to identify a path of the data to be acquired and generate a data acquisition script of the target station.

In this embodiment, as shown in fig. 2, text information of each node of the web page data (including information of its child node and parent node) may be obtained in a manner of hierarchical traversal, and a text characterization vector of the node is formed based on the text information (i.e., an Embedding process), for example, text information included in one node is converted into a text characterization vector e ═ e₁,..,e_n]∈R^n*dWherein n is the sequence length of the text information, d is the dimension of the word in the text information, and the value of d can be determined according to the actual situationThe condition setting may range from tens to hundreds, such as 100 dimensions, etc. The trained parsing model may then perform a convolution operation (i.e., convolution process) on the text characterization vector, and the convolution function may be c ═ F (v)^Te_i:j+h-1) Wherein v ∈ R^f*dA convolution kernel is represented, f represents the size of a window, and can generally take natural numbers such as 2, 3, 4 and the like, and d is the dimension of a word in the text information; i. j and h are respectively the number of the summary points, the number of the child nodes and the height of the structure tree of the webpage data. The resulting convolution vector is C ═ C at k different numbers of convolution kernels₁,..,c_k]. After the convolution operation, this embodiment performs Pooling operation by using a Max-Pooling policy to obtain an Output text characterization Vector (Output Vector) P ═ Max (c)₁),..,max(c_k)]. Meanwhile, based on the characteristic statistical information of each label, a corresponding label statistical vector is formed. Next, in this embodiment, the output text Feature Vector P obtained through CNN network learning is spliced with a tag statistical Vector (Feature Vector) S formed in advance through tag statistical Features to form a new Feature Vector R ═ P ≦ S (i.e., a configured Features process), where ≦ is a horizontal splicing operation. The vector contains both textual information and the statistical characteristics of the tag. Finally, a Fully Connected Layer and an Output softmax Layer (i.e. full Connected Layer Output Layer process) are Connected to the vector of the concatenation, i.e. L ═ tanh (W)_f ^TR+b_f)，O＝softmax(W_o ^TL+b_o) Where tanh and softmax are both activation functions, W_f ^T，b_f，W_o ^T，b_oAnd forming a new feature vector after splicing.

It should be noted that the convolution function and the pooling strategy are only preferred embodiments of the present embodiment, and the present embodiment is not limited thereto as long as the convolution and pooling operations can be implemented.

Accordingly, before using the trained analytical model, a model building (also understood as model training) step is further included, which may include the following processes: the method comprises the steps that analyzed information and characteristic statistical information of analyzed webpage data are obtained based on a plurality of analyzed webpage data with the same site type, and an analysis model is generated through the training of the analyzed information and the characteristic statistical information; the parsed information includes at least Xpath path and site information.

In this embodiment, as shown in fig. 3, before using the analysis model, the analyzed information and the feature statistical information of the analyzed web page data may be integrated, and the original neural network model may be trained to obtain an analysis model capable of accurately analyzing the web page data. The Xpath path and the site information have a corresponding relationship and can be used for judging whether the analysis result is correct.

Further, the concrete model building process is similar to the principle of the step S3 of analyzing the web page data by using the analytic model, and accordingly, the above-mentioned analyzing information and feature statistical information of the analyzed web page data are obtained based on the multiple analyzed web page data with the same site type, and the analytic model is generated by training through the analyzed information and feature statistical information, which may include the following processes: the method comprises the steps that a hierarchy traverses each node of each analyzed webpage data with the same type of the website, the analyzed information of each node of the analyzed webpage data is obtained, and a training data set is formed based on all the analyzed information; generating a characteristic statistical training vector of each label according to the multidimensional characteristics of various labels in a plurality of analyzed webpage data with the same site type; training and generating an analytic model through a training data set and a characteristic statistic training vector.

In this embodiment, to obtain correct training data, a part of the web page data (i.e., parsed web page data) with the same site type may be manually parsed, then, each node of each parsed web page data is traversed by a hierarchy, all parsed information is obtained, and a training data set of the parsing model may be generated based on the parsed information. In order to further improve the accuracy of parsing, the present embodiment is to use the statistical features of the labels for assistance, so that a feature statistical training vector of each label is also generated according to the multidimensional features of various labels in the parsed web page data, so as to generate the parsing model through the training data sets and the feature statistical training vector.

Specifically, the forming of the training data set based on all the parsed information may include the following processes: and marking the text information corresponding to each node according to whether the required information is contained or not based on all the analyzed information, and generating a training data set based on the marked text information.

In this embodiment, the training of the analytic model is a process of supervised learning, and the training data set may be labeled first according to whether the corresponding node includes the required information, so as to determine whether an analytic result of the analytic model on the training data set is correct, thereby improving the analytic accuracy.

More specifically, the labeling of the text information corresponding to each node according to whether the required information is included may include the following processing: for each node, if the child node of the node contains the required information, performing first labeling on the text information of the child node, and performing second labeling on the text information of the node; if the node does not contain the required information, carrying out third labeling on the text information of the node; and the first label is used to stop the parsing model from traversing the nodes of the web page data.

In this embodiment, as shown in fig. 4, when each node of the web page data is traversed according to a method of hierarchical traversal, text information corresponding to a node (also referred to as a target node) where the required information is located may be labeled as 1 (i.e., text information of a child node is first labeled), text information corresponding to a node (also referred to as a parent node) including the child node is labeled as 2 (i.e., text information of a node is second labeled), and text information corresponding to a node not including the required information is labeled as 0 (i.e., text information of a node not including the required information is third labeled). Then, when the neural network model carries out prediction, the layer-by-layer traversal prediction can be started from the lower-level node of a root node (body label), and if the prediction result is 0, the node is stopped to be searched; if the prediction result is 2, continuing to search the node downwards; and if the prediction result is 1, stopping searching and returning the node. The following is a special case approach to prevent false determinations from accidental errors.

1) If the prediction result of the whole layer is 0, the next search is continued.

2) If there are more than 2 in the prediction results of the same layer, only one 2 with the highest probability is reserved, and all other prediction results are set to be 0.

3) If the prediction results of the same layer have more than 1, the nodes of the previous layer are returned.

In the model training process, the cross entropy can be adopted as a loss function to train the model so as to modify the model. If the webpage structure of the site changes, the model can be used for predicting once again to identify a new XPath and further generate a new acquisition script.

Note that, the above labels 0, 1, and 2 are only one implementation of the present embodiment, and the present embodiment is not limited thereto, as long as the text information of three nodes can be identified.

In addition, in this embodiment, in order to verify the validity and accuracy of the analytic model, an experiment is performed on a text task of extracting a bidding text, specifically: using artificially analyzed bidding sites (the specific number may be set as required, for example, tens or hundreds), a number of data training sets of about 20 ten thousand are generated from about 2 thousand pieces of detailed HTML web page data for model training, the experimental data is 2000 HTML pages, and parameters included in feature statistical training vectors applied in the experiment are shown in table 1 below. And the evaluation of the model does not take the accuracy of the node identified by the model as an evaluation method, but adopts the predicted Xpath path to compare with the corresponding actual Xpath path to calculate the accuracy.

TABLE 1 parameters included in the feature statistics training vectors used in the experiments

Imbedding _ size (word dimension)	100
		seq _ length (text length)	200
num _ filters (number of convolution kernels)	128
		Filter _ sizes (convolution kernel size)	[2,3,4,5]
drop _ prob (drop parameter)	0.5
		feature _ size (vector dimension of statistical label)	39
learning _ rate (learning rate)	1e-3
		batch _ size (batch size)	128
num _ epochs (number of training rounds)	4

The results of the above experiments are shown in table 2 below.

TABLE 2 results of the experiment

Data set for use in training a model	Rate of accuracy
		Using only text features	78％
Text feature + tag statistical features (this example)	93％

As can be seen from table 2, in this task, the experimental results show that the analysis accuracy of the analysis model obtained by training only with text features is 78%, compared with the recognition effect with or without the tag statistical features. And the analytic model obtained by jointly training the text characteristics and the label statistical characteristics has the analytic accuracy up to 93 percent. Therefore, the label statistical characteristics are very effective characteristics, and by using the label statistical characteristics, the accuracy of the analysis model can be improved by 15 percent on the basis of the original (without the label statistical characteristics), the overall recognition effect can reach 93 percent, and the manual recognition effect is close to the manual recognition effect.

According to the method for generating the data acquisition script, according to the text information contained in each node of the webpage data and the characteristic statistical information of each label, the webpage data are analyzed through the trained analysis model, the path of the data to be acquired can be identified, the data acquisition script of the target site is generated, so that the computer equipment can execute the method, and the data acquisition is automatically performed according to the data acquisition script, so that a large amount of manpower and time cost are saved, the efficiency and accuracy of webpage analysis and data acquisition are effectively improved, and the maintenance cost of a later-stage acquisition task is reduced.

Based on the same concept of the above method for generating a data acquisition script, this embodiment further provides an apparatus for generating a data acquisition script, as shown in fig. 5, the apparatus includes:

and the script module is used for analyzing the webpage data through the trained analysis model based on the text information and the characteristic statistical information, identifying the path of the data to be acquired and generating a data acquisition script of the target site.

The apparatus for generating a data acquisition script provided in this embodiment can execute the method for generating a data acquisition script, so that at least the beneficial effects that can be achieved by the method for generating a data acquisition script can be achieved, and are not described herein again.

Based on the same concept of the method for generating the data acquisition script, the present embodiment further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for generating the data acquisition script as described above.

The computer device provided in this embodiment can execute the method for generating the data acquisition script, so that at least the beneficial effects that can be achieved by the method for generating the data acquisition script can be achieved, and details are not repeated herein.

Based on the same concept as the above-described method of generating a data acquisition script, the present embodiment also provides a computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the method of generating a data acquisition script as described above.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. a method for generating a data acquisition script, wherein the method comprises:

For the webpage data of the target site, obtain the text information contained in each node of the webpage data respectively;

Generate feature statistics of each tag according to the multi-dimensional features of various tags in the webpage data;

Based on the text information and the feature statistical information, the web page data is parsed through the trained parsing model, the path of the data to be collected is identified, and a data collection script of the target site is generated.

2. The method according to claim 1, wherein, based on the text information and the feature statistical information, the web page data is parsed by a trained analytical model, and the path of the data to be collected is identified, And generate the data collection script of the target site, including:

The hierarchy traverses each node of the web page data, and sequentially forms a text representation vector corresponding to each node based on the text information of each node;

Perform convolution and pooling operations on the text representation vector in turn to form a new text representation vector;

Based on the feature statistical information of each tag, a tag statistics vector is formed, and the new text representation vector and the tag statistics vector are spliced to obtain a splicing vector;

The splicing vector is sequentially connected to the fully connected layer and the output layer to identify the path of the data to be collected, and to generate a data collection script of the target site.

3. The method according to claim 1, wherein, based on the text information and the feature statistical information, the text contained in each node is parsed by a trained parsing model, and data is generated Before collecting scripts, also include:

Obtaining parsed information and feature statistics of the parsed web page data based on multiple parsed web page data of the same site type, and training and generating the parsing model based on the parsed information and the feature statistics information; The parsed information includes at least Xpath path and site information.

4. The method according to claim 3, wherein the parsed information and feature statistics of the parsed webpage data are acquired based on a plurality of parsed webpage data of the same site type, and the The parsing information and the feature statistics are trained to generate the parsing model, including:

The hierarchy traverses each node of each parsed web page data of the same site type, obtains the parsed information of each node of the parsed web page data, and forms a training data set based on all the parsed information;

According to the multi-dimensional features of various tags in the parsed web page data of the same site type, the feature statistics training vector of each tag is generated;

The analytical model is trained and generated through the training data set and the feature statistics training vector.

5. The method according to claim 4, wherein the forming a training data set based on all parsed information comprises:

Based on all the parsed information, the text information corresponding to each node is marked according to whether the required information is included, and a training data set is generated based on the marked text information.

6. The method according to claim 4, wherein the labeling of the text information corresponding to each node according to whether the required information is included, comprising:

For each node, if the child node of the node contains the required information, the first annotation is performed on the text information of the child node, and the second annotation is performed on the text information of the node; if the node does not contain the required information information, a third annotation is performed on the text information of the node; and the first annotation is used to make the parsing model stop traversing the node of the webpage data.

7 . The method according to claim 1 , wherein the multi-dimensional features at least include label quantity, label density, text density, and weight information. 8 .

8. A device for generating a data collection script, wherein the device comprises:

a text module, for obtaining the text information contained in each node of the webpage data for the webpage data of the target site;

The tag module is used to generate feature statistics of each tag according to the multi-dimensional features of various tags in the webpage data;

The script module is configured to parse the webpage data through the trained parsing model based on the text information and the feature statistical information, identify the path of the data to be collected, and generate a data collection script of the target site.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method as claimed in the claims The method of any one of 1-7.

10. A computer-readable storage medium on which a computer program is stored, wherein the program is executed by a processor to implement the method according to any one of claims 1-7.