CN113687831A - Method, device, computer equipment and storage medium for generating data acquisition script - Google Patents

Method, device, computer equipment and storage medium for generating data acquisition script Download PDF

Info

Publication number
CN113687831A
CN113687831A CN202110770812.4A CN202110770812A CN113687831A CN 113687831 A CN113687831 A CN 113687831A CN 202110770812 A CN202110770812 A CN 202110770812A CN 113687831 A CN113687831 A CN 113687831A
Authority
CN
China
Prior art keywords
information
data
node
text
parsed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110770812.4A
Other languages
Chinese (zh)
Inventor
陈家银
潘帅
张伟
陈曦
麻志毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Original Assignee
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Institute of Information Technology AIIT of Peking University, Hangzhou Weiming Information Technology Co Ltd filed Critical Advanced Institute of Information Technology AIIT of Peking University
Priority to CN202110770812.4A priority Critical patent/CN113687831A/en
Publication of CN113687831A publication Critical patent/CN113687831A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application provides a method and a device for generating a data acquisition script, computer equipment and a storage medium, wherein the method for generating the data acquisition script comprises the following steps: respectively acquiring text information contained in each node of the webpage data aiming at the webpage data of the target site; generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data; analyzing the webpage data through the trained analysis model based on the text information and the characteristic statistical information, identifying the path of the data to be acquired, and generating a data acquisition script of the target site. According to the method and the device, the data acquisition script can be automatically generated through the trained model, a large amount of manpower and material resources are saved, and the webpage data analysis speed is effectively improved.

Description

Method and device for generating data acquisition script, computer equipment and storage medium
Technical Field
The application belongs to the technical field of data acquisition, and particularly relates to a method and a device for generating a data acquisition script, computer equipment and a storage medium.
Background
Currently, in a data collection task scenario, it is usually necessary to collect required information content from a web document in an HTML format. The existing acquisition modes generally include two types: there are several types of methods: a template-based acquisition mode and a statistics-based acquisition mode.
The template-based acquisition mode is mainly to locate the path of the required content by utilizing open-source analysis templates, such as XPath, Selector CSS, Beautiful and the like, according to the internal structure and content of an HTML document, wherein the most common method is to locate the XPath thereof, and then generate an acquisition script according to the located XPath for data acquisition. The statistical-based acquisition mode mainly includes the steps of counting some characteristics (such as the number of tags, the text length in the tags, the text density in the tags and the like) in an HTML document, fusing to generate a discrimination model, when the predicted value of a certain tag exceeds a threshold value, considering that the structure of the tag is an XPath of required content, and then generating an acquisition script according to the XPath for data acquisition.
However, both of the above two collection methods have great disadvantages, for example, the template-based collection method requires manually analyzing the XPath of the required content one by one, and when the number of sites is large, a lot of manpower and time are consumed. In addition, when the analyzed website web page structure slightly changes, the originally located XPath may also be affected, resulting in increased maintenance cost and poor robustness. The discrimination model generated by the statistical-based acquisition mode is too simple, so that the identification accuracy is low, and the discrimination model is difficult to adapt to a complex acquisition task scene.
Disclosure of Invention
The application provides a method and a device for generating a data acquisition script, computer equipment and a storage medium, the data acquisition script can be automatically generated through a trained model, a large amount of manpower and material resources are saved, and the webpage data analysis speed is effectively improved.
An embodiment of a first aspect of the present application provides a method for generating a data acquisition script, where the method includes:
respectively acquiring text information contained in each node of webpage data aiming at the webpage data of a target site;
generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data;
analyzing the webpage data through a trained analysis model based on the text information and the characteristic statistical information, identifying a path of the data to be acquired, and generating a data acquisition script of the target site.
Optionally, the analyzing the web page data through a trained analysis model based on the text information and the feature statistical information, identifying a path of data to be acquired, and generating a data acquisition script of the target site includes:
traversing each node of the webpage data in a hierarchy mode, and sequentially forming a text representation vector corresponding to each node based on text information of each node;
performing convolution and pooling operation on the text representation vectors in sequence to form new text representation vectors;
forming a label statistical vector based on the characteristic statistical information of each label, and splicing the new text representation vector and the label statistical vector to obtain a spliced vector;
and sequentially connecting the splicing vector with a full connection layer and an output layer to identify a path of data to be acquired and generate a data acquisition script of the target station.
Optionally, before analyzing the text included in each node through a trained analysis model based on the text information and the feature statistical information and generating a data acquisition script, the method further includes:
the method comprises the steps that analyzed information and characteristic statistical information of analyzed webpage data are obtained based on a plurality of analyzed webpage data with the same site type, and the analyzed information and the characteristic statistical information are used for training and generating an analysis model; the parsed information includes at least Xpath path and site information.
Optionally, the acquiring, based on a plurality of parsed web page data with the same site type, parsed information and feature statistical information of the parsed web page data, and training and generating the parsing model through the parsed information and the feature statistical information includes:
the method comprises the steps that a hierarchy traverses each node of each analyzed webpage data with the same type of the website, the analyzed information of each node of the analyzed webpage data is obtained, and a training data set is formed based on all the analyzed information;
generating a characteristic statistical training vector of each label according to the multidimensional characteristics of various labels in a plurality of analyzed webpage data with the same site type;
and training and generating the analytical model through the training data set and the feature statistical training vector.
Optionally, the forming a training data set based on all parsed information includes:
and marking the text information corresponding to each node according to whether the required information is contained or not based on all the analyzed information, and generating a training data set based on the marked text information.
Optionally, the labeling the text information corresponding to each node according to whether the text information includes the required information includes:
for each node, if the child node of the node contains the required information, performing first labeling on the text information of the child node, and performing second labeling on the text information of the node; if the node does not contain the required information, carrying out third labeling on the text information of the node; and the first label is used for enabling the analysis model to stop traversing the nodes of the webpage data.
Optionally, the multi-dimensional features include at least a number of labels, a density of text, and weight information.
An embodiment of a second aspect of the present application provides an apparatus for generating a data acquisition script, the apparatus including:
the text module is used for respectively acquiring text information contained in each node of the webpage data aiming at the webpage data of the target site;
the label module is used for generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data;
and the script module is used for analyzing the webpage data through a trained analysis model based on the text information and the characteristic statistical information, identifying a path of the data to be acquired, and generating a data acquisition script of the target site.
Embodiments of a third aspect of the present application provide a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the method according to the first aspect.
An embodiment of a fourth aspect of the present application provides a computer-readable storage medium having a computer program stored thereon, the program being executable by a processor to implement the method according to the first aspect.
The technical scheme provided in the embodiment of the application at least has the following technical effects or advantages:
according to the method for generating the data acquisition script, according to the text information contained in each node of the webpage data and the characteristic statistical information of each label, the webpage data are analyzed through the trained analysis model, the path of the data to be acquired can be identified, the data acquisition script of the target site is generated, the method can be executed by computer equipment, data acquisition is automatically carried out according to the data acquisition script, a large amount of manpower and time cost are saved, the efficiency and accuracy of webpage analysis and data acquisition are effectively improved, and the maintenance cost of a later-stage acquisition task is reduced.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings.
In the drawings:
FIG. 1 is a schematic flow chart illustrating a method for generating a data acquisition script according to an embodiment of the present application;
FIG. 2 illustrates a flow diagram for parsing web page data using a parsing model;
FIG. 3 is a schematic flow chart illustrating another method for generating a data collection script according to an embodiment of the present application;
FIG. 4 shows a schematic flow chart of labeling a training data set;
fig. 5 shows a schematic structural diagram of an apparatus for generating a data acquisition script according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.
A method, an apparatus, a computer device, and a storage medium for generating a data acquisition script according to embodiments of the present application are described below with reference to the accompanying drawings.
The embodiment of the application provides a method for generating a data acquisition script, which can be applied to a device for generating the data acquisition script, the device can be a computer device with data processing (such as query, calculation and the like) capability, and can also be a processing module capable of performing data processing on the computer device. As shown in fig. 1, the method may include the steps of:
step S1, respectively acquiring text information included in each node of the web page data for the web page data of the target site.
A site may be understood as a document on a computer device that is used to store web content. Generally, the document (HTML document) is structured data and can also be regarded as a tree structure, the tree structure includes a parent node and a child node, that is, each tag (tag) of the document can be regarded as a node of the tree, the node can be a tree node or a child node, and generally the tree node will contain part or all of the information of the child node, so that the text information contained in each node can include all the content under the tag, and the text information contained in all the nodes constitutes the web page data of the site. The sites can be classified according to the content of the website, such as news sites, advertisement sites, picture sites, video sites, and the like, and generally, one type of site corresponds to one data acquisition script. The target site may be any site where a very careful data collection is desired.
In this embodiment, all text information included in each node of the web page data can be acquired through functions of querying, searching and the like, so that the data to be acquired is identified subsequently, and a path of the data to be acquired is obtained.
Step S2, generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data.
In this embodiment, referring to a statistical class-based method for extracting XPath, some tag features in the web page data are used as subsequent data to be acquired for identification, so as to obtain auxiliary features of a data path to be acquired. For example, in the task of collecting news text, there are usually more < p > tags, and the number and density of < p > tags can be used as indicators for judging the text. Similarly, advertisement information usually contains many < a > tags. Accordingly, the multi-dimensional features may include at least a label number, a label density, a text density, and weight information, among others.
Specifically, the multidimensional feature may include 39 dimensional features composed of the number of 33 various tags and 6 other features (tag density, text density, weight information, and the like). Wherein 6 other features may be:
intensity _ of _ a _ text (text density);
element _ of _ calculation (punctuation density);
log10(element. number _ of _ p _ despendants +2) (p-tag density);
class _ weight (empirical class name weight);
element number _ of _ datetime (weight of the number of time texts);
(1-element _ of _ a _ text) element _ of _ publication _ np.log10(element.number _ of _ p _ descriptors +2) (product weight of a plurality of features).
It should be noted that, when performing the tag feature statistics, a proper statistical principle may be formulated according to different collection task environments (in general, the collection task environments correspond to the site types one to one), and the present embodiment does not specifically limit the statistical principle.
And step S3, analyzing the webpage data through the trained analysis model based on the text information and the characteristic statistical information, identifying the path of the data to be acquired, and generating a data acquisition script of the target site.
The trained analytical model is used for identifying the path of the data to be collected according to the text information and the characteristic statistical information of the input webpage data and generating a data collection script of the target site. In particular, the trained analytical model may be, but is not limited to, a neural network model
In this embodiment, after the text information and the feature statistical information are obtained, the obtained text information and the feature statistical information may be input into a trained parsing model, and the parsing model may perform a series of data processing on the text information and the feature statistical information, so as to identify a path of data to be collected and generate a data collection script of a target site, so as to reduce labor and time costs consumed by manually parsing XPath, and greatly improve data collection efficiency.
It should be noted that, in this embodiment, a specific range of the analysis model processing data is not specifically limited, and for example, it may be understood that the process of acquiring the text information from the web page data may also be a processing function of the analysis model.
In a specific implementation manner of this embodiment, the step S3 may include the following steps: traversing each node of the webpage data in a hierarchy mode, and sequentially forming a text representation vector corresponding to each node based on text information of each node; carrying out convolution and pooling operation on the text characterization vectors in sequence to form new text characterization vectors; forming a tag statistical vector based on the characteristic statistical information of each tag, and splicing the new text characterization vector and the tag statistical vector to obtain a spliced vector; and sequentially connecting the splicing vector with the full connection layer and the output layer to identify a path of the data to be acquired and generate a data acquisition script of the target station.
In this embodiment, as shown in fig. 2, text information of each node of the web page data (including information of its child node and parent node) may be obtained in a manner of hierarchical traversal, and a text characterization vector of the node is formed based on the text information (i.e., an Embedding process), for example, text information included in one node is converted into a text characterization vector e ═ e1,..,en]∈Rn*dWherein n is the sequence length of the text information, d is the dimension of the word in the text information, and the value of d can be determined according to the actual situationThe condition setting may range from tens to hundreds, such as 100 dimensions, etc. The trained parsing model may then perform a convolution operation (i.e., convolution process) on the text characterization vector, and the convolution function may be c ═ F (v)Tei:j+h-1) Wherein v ∈ Rf*dA convolution kernel is represented, f represents the size of a window, and can generally take natural numbers such as 2, 3, 4 and the like, and d is the dimension of a word in the text information; i. j and h are respectively the number of the summary points, the number of the child nodes and the height of the structure tree of the webpage data. The resulting convolution vector is C ═ C at k different numbers of convolution kernels1,..,ck]. After the convolution operation, this embodiment performs Pooling operation by using a Max-Pooling policy to obtain an Output text characterization Vector (Output Vector) P ═ Max (c)1),..,max(ck)]. Meanwhile, based on the characteristic statistical information of each label, a corresponding label statistical vector is formed. Next, in this embodiment, the output text Feature Vector P obtained through CNN network learning is spliced with a tag statistical Vector (Feature Vector) S formed in advance through tag statistical Features to form a new Feature Vector R ═ P ≦ S (i.e., a configured Features process), where ≦ is a horizontal splicing operation. The vector contains both textual information and the statistical characteristics of the tag. Finally, a Fully Connected Layer and an Output softmax Layer (i.e. full Connected Layer Output Layer process) are Connected to the vector of the concatenation, i.e. L ═ tanh (W)f TR+bf),O=softmax(Wo TL+bo) Where tanh and softmax are both activation functions, Wf T,bf,Wo T,boAnd forming a new feature vector after splicing.
It should be noted that the convolution function and the pooling strategy are only preferred embodiments of the present embodiment, and the present embodiment is not limited thereto as long as the convolution and pooling operations can be implemented.
Accordingly, before using the trained analytical model, a model building (also understood as model training) step is further included, which may include the following processes: the method comprises the steps that analyzed information and characteristic statistical information of analyzed webpage data are obtained based on a plurality of analyzed webpage data with the same site type, and an analysis model is generated through the training of the analyzed information and the characteristic statistical information; the parsed information includes at least Xpath path and site information.
In this embodiment, as shown in fig. 3, before using the analysis model, the analyzed information and the feature statistical information of the analyzed web page data may be integrated, and the original neural network model may be trained to obtain an analysis model capable of accurately analyzing the web page data. The Xpath path and the site information have a corresponding relationship and can be used for judging whether the analysis result is correct.
Further, the concrete model building process is similar to the principle of the step S3 of analyzing the web page data by using the analytic model, and accordingly, the above-mentioned analyzing information and feature statistical information of the analyzed web page data are obtained based on the multiple analyzed web page data with the same site type, and the analytic model is generated by training through the analyzed information and feature statistical information, which may include the following processes: the method comprises the steps that a hierarchy traverses each node of each analyzed webpage data with the same type of the website, the analyzed information of each node of the analyzed webpage data is obtained, and a training data set is formed based on all the analyzed information; generating a characteristic statistical training vector of each label according to the multidimensional characteristics of various labels in a plurality of analyzed webpage data with the same site type; training and generating an analytic model through a training data set and a characteristic statistic training vector.
In this embodiment, to obtain correct training data, a part of the web page data (i.e., parsed web page data) with the same site type may be manually parsed, then, each node of each parsed web page data is traversed by a hierarchy, all parsed information is obtained, and a training data set of the parsing model may be generated based on the parsed information. In order to further improve the accuracy of parsing, the present embodiment is to use the statistical features of the labels for assistance, so that a feature statistical training vector of each label is also generated according to the multidimensional features of various labels in the parsed web page data, so as to generate the parsing model through the training data sets and the feature statistical training vector.
Specifically, the forming of the training data set based on all the parsed information may include the following processes: and marking the text information corresponding to each node according to whether the required information is contained or not based on all the analyzed information, and generating a training data set based on the marked text information.
In this embodiment, the training of the analytic model is a process of supervised learning, and the training data set may be labeled first according to whether the corresponding node includes the required information, so as to determine whether an analytic result of the analytic model on the training data set is correct, thereby improving the analytic accuracy.
More specifically, the labeling of the text information corresponding to each node according to whether the required information is included may include the following processing: for each node, if the child node of the node contains the required information, performing first labeling on the text information of the child node, and performing second labeling on the text information of the node; if the node does not contain the required information, carrying out third labeling on the text information of the node; and the first label is used to stop the parsing model from traversing the nodes of the web page data.
In this embodiment, as shown in fig. 4, when each node of the web page data is traversed according to a method of hierarchical traversal, text information corresponding to a node (also referred to as a target node) where the required information is located may be labeled as 1 (i.e., text information of a child node is first labeled), text information corresponding to a node (also referred to as a parent node) including the child node is labeled as 2 (i.e., text information of a node is second labeled), and text information corresponding to a node not including the required information is labeled as 0 (i.e., text information of a node not including the required information is third labeled). Then, when the neural network model carries out prediction, the layer-by-layer traversal prediction can be started from the lower-level node of a root node (body label), and if the prediction result is 0, the node is stopped to be searched; if the prediction result is 2, continuing to search the node downwards; and if the prediction result is 1, stopping searching and returning the node. The following is a special case approach to prevent false determinations from accidental errors.
1) If the prediction result of the whole layer is 0, the next search is continued.
2) If there are more than 2 in the prediction results of the same layer, only one 2 with the highest probability is reserved, and all other prediction results are set to be 0.
3) If the prediction results of the same layer have more than 1, the nodes of the previous layer are returned.
In the model training process, the cross entropy can be adopted as a loss function to train the model so as to modify the model. If the webpage structure of the site changes, the model can be used for predicting once again to identify a new XPath and further generate a new acquisition script.
Note that, the above labels 0, 1, and 2 are only one implementation of the present embodiment, and the present embodiment is not limited thereto, as long as the text information of three nodes can be identified.
In addition, in this embodiment, in order to verify the validity and accuracy of the analytic model, an experiment is performed on a text task of extracting a bidding text, specifically: using artificially analyzed bidding sites (the specific number may be set as required, for example, tens or hundreds), a number of data training sets of about 20 ten thousand are generated from about 2 thousand pieces of detailed HTML web page data for model training, the experimental data is 2000 HTML pages, and parameters included in feature statistical training vectors applied in the experiment are shown in table 1 below. And the evaluation of the model does not take the accuracy of the node identified by the model as an evaluation method, but adopts the predicted Xpath path to compare with the corresponding actual Xpath path to calculate the accuracy.
TABLE 1 parameters included in the feature statistics training vectors used in the experiments
Imbedding _ size (word dimension) 100
seq _ length (text length) 200
num _ filters (number of convolution kernels) 128
Filter _ sizes (convolution kernel size) [2,3,4,5]
drop _ prob (drop parameter) 0.5
feature _ size (vector dimension of statistical label) 39
learning _ rate (learning rate) 1e-3
batch _ size (batch size) 128
num _ epochs (number of training rounds) 4
The results of the above experiments are shown in table 2 below.
TABLE 2 results of the experiment
Data set for use in training a model Rate of accuracy
Using only text features 78%
Text feature + tag statistical features (this example) 93%
As can be seen from table 2, in this task, the experimental results show that the analysis accuracy of the analysis model obtained by training only with text features is 78%, compared with the recognition effect with or without the tag statistical features. And the analytic model obtained by jointly training the text characteristics and the label statistical characteristics has the analytic accuracy up to 93 percent. Therefore, the label statistical characteristics are very effective characteristics, and by using the label statistical characteristics, the accuracy of the analysis model can be improved by 15 percent on the basis of the original (without the label statistical characteristics), the overall recognition effect can reach 93 percent, and the manual recognition effect is close to the manual recognition effect.
According to the method for generating the data acquisition script, according to the text information contained in each node of the webpage data and the characteristic statistical information of each label, the webpage data are analyzed through the trained analysis model, the path of the data to be acquired can be identified, the data acquisition script of the target site is generated, so that the computer equipment can execute the method, and the data acquisition is automatically performed according to the data acquisition script, so that a large amount of manpower and time cost are saved, the efficiency and accuracy of webpage analysis and data acquisition are effectively improved, and the maintenance cost of a later-stage acquisition task is reduced.
Based on the same concept of the above method for generating a data acquisition script, this embodiment further provides an apparatus for generating a data acquisition script, as shown in fig. 5, the apparatus includes:
the text module is used for respectively acquiring text information contained in each node of the webpage data aiming at the webpage data of the target site;
the label module is used for generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data;
and the script module is used for analyzing the webpage data through the trained analysis model based on the text information and the characteristic statistical information, identifying the path of the data to be acquired and generating a data acquisition script of the target site.
The apparatus for generating a data acquisition script provided in this embodiment can execute the method for generating a data acquisition script, so that at least the beneficial effects that can be achieved by the method for generating a data acquisition script can be achieved, and are not described herein again.
Based on the same concept of the method for generating the data acquisition script, the present embodiment further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for generating the data acquisition script as described above.
The computer device provided in this embodiment can execute the method for generating the data acquisition script, so that at least the beneficial effects that can be achieved by the method for generating the data acquisition script can be achieved, and details are not repeated herein.
Based on the same concept as the above-described method of generating a data acquisition script, the present embodiment also provides a computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the method of generating a data acquisition script as described above.
The computer device provided in this embodiment can execute the method for generating the data acquisition script, so that at least the beneficial effects that can be achieved by the method for generating the data acquisition script can be achieved, and details are not repeated herein.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1.一种生成数据采集脚本的方法,其特征在于,所述方法包括:1. a method for generating a data acquisition script, wherein the method comprises: 针对目标站点的网页数据,分别获取所述网页数据每个节点包含的文本信息;For the webpage data of the target site, obtain the text information contained in each node of the webpage data respectively; 根据所述网页数据中各种标签的多维度特征生成每个标签的特征统计信息;Generate feature statistics of each tag according to the multi-dimensional features of various tags in the webpage data; 基于所述文本信息和所述特征统计信息,通过训练好的解析模型对所述网页数据进行解析,识别待采集数据的路径,并生成所述目标站点的数据采集脚本。Based on the text information and the feature statistical information, the web page data is parsed through the trained parsing model, the path of the data to be collected is identified, and a data collection script of the target site is generated. 2.根据权利要求1所述的方法,其特征在于,所述基于所述文本信息和所述特征统计信息,通过训练好的解析模型对所述网页数据进行解析,识别待采集数据的路径,并生成所述目标站点的数据采集脚本,包括:2. The method according to claim 1, wherein, based on the text information and the feature statistical information, the web page data is parsed by a trained analytical model, and the path of the data to be collected is identified, And generate the data collection script of the target site, including: 层级遍历所述网页数据的每个节点,并基于每个节点的文本信息依次形成各节点对应的文本表征向量;The hierarchy traverses each node of the web page data, and sequentially forms a text representation vector corresponding to each node based on the text information of each node; 依次对所述文本表征向量进行卷积和池化操作,形成新文本表征向量;Perform convolution and pooling operations on the text representation vector in turn to form a new text representation vector; 基于每个标签的所述特征统计信息,形成标签统计向量,并将所述新文本表征向量和所述标签统计向量进行拼接,得到拼接向量;Based on the feature statistical information of each tag, a tag statistics vector is formed, and the new text representation vector and the tag statistics vector are spliced to obtain a splicing vector; 将所述拼接向量依次连接全连接层和输出层,以识别待采集数据的路径,并生成所述目标站点的数据采集脚本。The splicing vector is sequentially connected to the fully connected layer and the output layer to identify the path of the data to be collected, and to generate a data collection script of the target site. 3.根据权利要求1所述的方法,其特征在于,所述基于所述文本信息和所述特征统计信息,通过训练好的解析模型对所述每个节点包含的文本进行解析,并生成数据采集脚本之前,还包括:3. The method according to claim 1, wherein, based on the text information and the feature statistical information, the text contained in each node is parsed by a trained parsing model, and data is generated Before collecting scripts, also include: 基于站点类型相同的多个已解析网页数据,获取所述已解析网页数据的已解析信息和特征统计信息,并通过所述已解析信息和所述特征统计信息训练并生成所述解析模型;所述已解析信息至少包括Xpath路径和站点信息。Obtaining parsed information and feature statistics of the parsed web page data based on multiple parsed web page data of the same site type, and training and generating the parsing model based on the parsed information and the feature statistics information; The parsed information includes at least Xpath path and site information. 4.根据权利要求3所述的方法,其特征在于,所述基于站点类型相同的多个已解析网页数据,获取所述已解析网页数据的已解析信息和特征统计信息,并通过所述已解析信息和所述特征统计信息训练并生成所述解析模型,包括:4. The method according to claim 3, wherein the parsed information and feature statistics of the parsed webpage data are acquired based on a plurality of parsed webpage data of the same site type, and the The parsing information and the feature statistics are trained to generate the parsing model, including: 层级遍历该站点类型相同的每个已解析网页数据的每个节点,获取所述已解析网页数据各节点的已解析信息,并基于所有已解析信息形成训练数据集;The hierarchy traverses each node of each parsed web page data of the same site type, obtains the parsed information of each node of the parsed web page data, and forms a training data set based on all the parsed information; 根据站点类型相同的多个已解析网页数据中各种标签的多维度特征,生成每个标签的特征统计训练向量;According to the multi-dimensional features of various tags in the parsed web page data of the same site type, the feature statistics training vector of each tag is generated; 通过所述训练数据集和所述特征统计训练向量训练并生成所述解析模型。The analytical model is trained and generated through the training data set and the feature statistics training vector. 5.根据权利要求4所述的方法,其特征在于,所述基于所有已解析信息形成训练数据集,包括:5. The method according to claim 4, wherein the forming a training data set based on all parsed information comprises: 基于所有已解析信息,按照是否包含所需信息对各个节点对应的文本信息进行标注,基于所述标注后的文本信息生成训练数据集。Based on all the parsed information, the text information corresponding to each node is marked according to whether the required information is included, and a training data set is generated based on the marked text information. 6.根据权利要求4所述的方法,其特征在于,所述按照是否包含所需信息对各个节点对应的文本信息进行标注,包括:6. The method according to claim 4, wherein the labeling of the text information corresponding to each node according to whether the required information is included, comprising: 对于各个节点,若所述节点的子节点包含所需信息,则对所述子节点的文本信息进行第一标注,对所述节点的文本信息进行第二标注;若所述节点不包含所需信息,则对所述节点的文本信息进行第三标注;且所述第一标注用于使所述解析模型停止遍历所述网页数据的节点。For each node, if the child node of the node contains the required information, the first annotation is performed on the text information of the child node, and the second annotation is performed on the text information of the node; if the node does not contain the required information information, a third annotation is performed on the text information of the node; and the first annotation is used to make the parsing model stop traversing the node of the webpage data. 7.根据权利要求1所述的方法,其特征在于,所述多维度特征至少包括标签数量、标签密度、文本密度以及权重信息。7 . The method according to claim 1 , wherein the multi-dimensional features at least include label quantity, label density, text density, and weight information. 8 . 8.一种生成数据采集脚本的装置,其特征在于,所述装置包括:8. A device for generating a data collection script, wherein the device comprises: 文本模块,用于针对目标站点的网页数据,分别获取所述网页数据每个节点包含的文本信息;a text module, for obtaining the text information contained in each node of the webpage data for the webpage data of the target site; 标签模块,用于根据所述网页数据中各种标签的多维度特征生成每个标签的特征统计信息;The tag module is used to generate feature statistics of each tag according to the multi-dimensional features of various tags in the webpage data; 脚本模块,用于基于所述文本信息和所述特征统计信息,通过训练好的解析模型对所述网页数据进行解析,识别待采集数据的路径,并生成所述目标站点的数据采集脚本。The script module is configured to parse the webpage data through the trained parsing model based on the text information and the feature statistical information, identify the path of the data to be collected, and generate a data collection script of the target site. 9.一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器运行所述计算机程序以实现如权利要求1-7任一项所述的方法。9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method as claimed in the claims The method of any one of 1-7. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行实现如权利要求1-7中任一项所述的方法。10. A computer-readable storage medium on which a computer program is stored, wherein the program is executed by a processor to implement the method according to any one of claims 1-7.
CN202110770812.4A 2021-07-07 2021-07-07 Method, device, computer equipment and storage medium for generating data acquisition script Pending CN113687831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110770812.4A CN113687831A (en) 2021-07-07 2021-07-07 Method, device, computer equipment and storage medium for generating data acquisition script

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110770812.4A CN113687831A (en) 2021-07-07 2021-07-07 Method, device, computer equipment and storage medium for generating data acquisition script

Publications (1)

Publication Number Publication Date
CN113687831A true CN113687831A (en) 2021-11-23

Family

ID=78576776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110770812.4A Pending CN113687831A (en) 2021-07-07 2021-07-07 Method, device, computer equipment and storage medium for generating data acquisition script

Country Status (1)

Country Link
CN (1) CN113687831A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373225A (en) * 2023-11-14 2024-01-09 南京新联电子股份有限公司 Energy data acquisition method
CN120783363A (en) * 2025-09-11 2025-10-14 中国铁塔股份有限公司安徽省分公司 Bid information screening method, system, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294781A (en) * 2013-05-14 2013-09-11 百度在线网络技术(北京)有限公司 Method and equipment used for processing page data
CN109471937A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A text classification method and terminal device based on machine learning
US20200160177A1 (en) * 2018-11-16 2020-05-21 Royal Bank Of Canada System and method for a convolutional neural network for multi-label classification with partial annotations
CN111581476A (en) * 2020-04-28 2020-08-25 深圳合纵数据科技有限公司 Intelligent webpage information extraction method based on BERT and LSTM
CN111625702A (en) * 2020-05-26 2020-09-04 北京墨云科技有限公司 Page structure recognition and extraction method based on deep learning
CN112287272A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN112732994A (en) * 2021-01-07 2021-04-30 上海携宁计算机科技股份有限公司 Method, device and equipment for extracting webpage information and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294781A (en) * 2013-05-14 2013-09-11 百度在线网络技术(北京)有限公司 Method and equipment used for processing page data
CN109471937A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A text classification method and terminal device based on machine learning
US20200160177A1 (en) * 2018-11-16 2020-05-21 Royal Bank Of Canada System and method for a convolutional neural network for multi-label classification with partial annotations
CN111581476A (en) * 2020-04-28 2020-08-25 深圳合纵数据科技有限公司 Intelligent webpage information extraction method based on BERT and LSTM
CN111625702A (en) * 2020-05-26 2020-09-04 北京墨云科技有限公司 Page structure recognition and extraction method based on deep learning
CN112287272A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN112732994A (en) * 2021-01-07 2021-04-30 上海携宁计算机科技股份有限公司 Method, device and equipment for extracting webpage information and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373225A (en) * 2023-11-14 2024-01-09 南京新联电子股份有限公司 Energy data acquisition method
CN120783363A (en) * 2025-09-11 2025-10-14 中国铁塔股份有限公司安徽省分公司 Bid information screening method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN101464905B (en) Web page information extraction system and method
CN111026671B (en) Test case set construction method and test method based on test case set
US20220197923A1 (en) Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information
CN112163424B (en) Data labeling method, device, equipment and medium
CN105279495A (en) Video description method based on deep learning and text summarization
CN106682192A (en) Method and device for training answer intention classification model based on search keywords
CN110602045A (en) Malicious webpage identification method based on feature fusion and machine learning
CN105528422A (en) Focused crawler processing method and apparatus
CN103530429A (en) Webpage content extracting method
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN114416998B (en) Text label identification method and device, electronic equipment and storage medium
CN112328246A (en) Page component generation method, apparatus, computer equipment and storage medium
CN106934055B (en) Semi-supervised webpage automatic classification method based on insufficient modal information
CN113687831A (en) Method, device, computer equipment and storage medium for generating data acquisition script
CN119048964A (en) Supervision data generation method based on video semantic structural analysis
CN112364130B (en) Sample sampling method, apparatus and readable storage medium
CN107527289B (en) Investment portfolio industry configuration method, device, server and storage medium
CN112949299A (en) Method and device for generating news manuscript, storage medium and electronic device
CN108875060B (en) Website identification method and identification system
CN113505889B (en) Processing method and device of mapping knowledge base, computer equipment and storage medium
CN107368464B (en) Method and device for acquiring bidding product information
CN112818699B (en) Risk analysis method, apparatus, device and computer readable storage medium
CN112115362B (en) A programming information recommendation method and device based on similar code recognition
CN117788850B (en) A trademark similarity evaluation method and device
WO2018171189A1 (en) Method, apparatus and terminal for blocking browser advertisement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211123