CN113687831A - Method, device, computer equipment and storage medium for generating data acquisition script - Google Patents
Method, device, computer equipment and storage medium for generating data acquisition script Download PDFInfo
- Publication number
- CN113687831A CN113687831A CN202110770812.4A CN202110770812A CN113687831A CN 113687831 A CN113687831 A CN 113687831A CN 202110770812 A CN202110770812 A CN 202110770812A CN 113687831 A CN113687831 A CN 113687831A
- Authority
- CN
- China
- Prior art keywords
- information
- data
- node
- text
- parsed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
- G06F8/427—Parsing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The application provides a method and a device for generating a data acquisition script, computer equipment and a storage medium, wherein the method for generating the data acquisition script comprises the following steps: respectively acquiring text information contained in each node of the webpage data aiming at the webpage data of the target site; generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data; analyzing the webpage data through the trained analysis model based on the text information and the characteristic statistical information, identifying the path of the data to be acquired, and generating a data acquisition script of the target site. According to the method and the device, the data acquisition script can be automatically generated through the trained model, a large amount of manpower and material resources are saved, and the webpage data analysis speed is effectively improved.
Description
Technical Field
The application belongs to the technical field of data acquisition, and particularly relates to a method and a device for generating a data acquisition script, computer equipment and a storage medium.
Background
Currently, in a data collection task scenario, it is usually necessary to collect required information content from a web document in an HTML format. The existing acquisition modes generally include two types: there are several types of methods: a template-based acquisition mode and a statistics-based acquisition mode.
The template-based acquisition mode is mainly to locate the path of the required content by utilizing open-source analysis templates, such as XPath, Selector CSS, Beautiful and the like, according to the internal structure and content of an HTML document, wherein the most common method is to locate the XPath thereof, and then generate an acquisition script according to the located XPath for data acquisition. The statistical-based acquisition mode mainly includes the steps of counting some characteristics (such as the number of tags, the text length in the tags, the text density in the tags and the like) in an HTML document, fusing to generate a discrimination model, when the predicted value of a certain tag exceeds a threshold value, considering that the structure of the tag is an XPath of required content, and then generating an acquisition script according to the XPath for data acquisition.
However, both of the above two collection methods have great disadvantages, for example, the template-based collection method requires manually analyzing the XPath of the required content one by one, and when the number of sites is large, a lot of manpower and time are consumed. In addition, when the analyzed website web page structure slightly changes, the originally located XPath may also be affected, resulting in increased maintenance cost and poor robustness. The discrimination model generated by the statistical-based acquisition mode is too simple, so that the identification accuracy is low, and the discrimination model is difficult to adapt to a complex acquisition task scene.
Disclosure of Invention
The application provides a method and a device for generating a data acquisition script, computer equipment and a storage medium, the data acquisition script can be automatically generated through a trained model, a large amount of manpower and material resources are saved, and the webpage data analysis speed is effectively improved.
An embodiment of a first aspect of the present application provides a method for generating a data acquisition script, where the method includes:
respectively acquiring text information contained in each node of webpage data aiming at the webpage data of a target site;
generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data;
analyzing the webpage data through a trained analysis model based on the text information and the characteristic statistical information, identifying a path of the data to be acquired, and generating a data acquisition script of the target site.
Optionally, the analyzing the web page data through a trained analysis model based on the text information and the feature statistical information, identifying a path of data to be acquired, and generating a data acquisition script of the target site includes:
traversing each node of the webpage data in a hierarchy mode, and sequentially forming a text representation vector corresponding to each node based on text information of each node;
performing convolution and pooling operation on the text representation vectors in sequence to form new text representation vectors;
forming a label statistical vector based on the characteristic statistical information of each label, and splicing the new text representation vector and the label statistical vector to obtain a spliced vector;
and sequentially connecting the splicing vector with a full connection layer and an output layer to identify a path of data to be acquired and generate a data acquisition script of the target station.
Optionally, before analyzing the text included in each node through a trained analysis model based on the text information and the feature statistical information and generating a data acquisition script, the method further includes:
the method comprises the steps that analyzed information and characteristic statistical information of analyzed webpage data are obtained based on a plurality of analyzed webpage data with the same site type, and the analyzed information and the characteristic statistical information are used for training and generating an analysis model; the parsed information includes at least Xpath path and site information.
Optionally, the acquiring, based on a plurality of parsed web page data with the same site type, parsed information and feature statistical information of the parsed web page data, and training and generating the parsing model through the parsed information and the feature statistical information includes:
the method comprises the steps that a hierarchy traverses each node of each analyzed webpage data with the same type of the website, the analyzed information of each node of the analyzed webpage data is obtained, and a training data set is formed based on all the analyzed information;
generating a characteristic statistical training vector of each label according to the multidimensional characteristics of various labels in a plurality of analyzed webpage data with the same site type;
and training and generating the analytical model through the training data set and the feature statistical training vector.
Optionally, the forming a training data set based on all parsed information includes:
and marking the text information corresponding to each node according to whether the required information is contained or not based on all the analyzed information, and generating a training data set based on the marked text information.
Optionally, the labeling the text information corresponding to each node according to whether the text information includes the required information includes:
for each node, if the child node of the node contains the required information, performing first labeling on the text information of the child node, and performing second labeling on the text information of the node; if the node does not contain the required information, carrying out third labeling on the text information of the node; and the first label is used for enabling the analysis model to stop traversing the nodes of the webpage data.
Optionally, the multi-dimensional features include at least a number of labels, a density of text, and weight information.
An embodiment of a second aspect of the present application provides an apparatus for generating a data acquisition script, the apparatus including:
the text module is used for respectively acquiring text information contained in each node of the webpage data aiming at the webpage data of the target site;
the label module is used for generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data;
and the script module is used for analyzing the webpage data through a trained analysis model based on the text information and the characteristic statistical information, identifying a path of the data to be acquired, and generating a data acquisition script of the target site.
Embodiments of a third aspect of the present application provide a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the method according to the first aspect.
An embodiment of a fourth aspect of the present application provides a computer-readable storage medium having a computer program stored thereon, the program being executable by a processor to implement the method according to the first aspect.
The technical scheme provided in the embodiment of the application at least has the following technical effects or advantages:
according to the method for generating the data acquisition script, according to the text information contained in each node of the webpage data and the characteristic statistical information of each label, the webpage data are analyzed through the trained analysis model, the path of the data to be acquired can be identified, the data acquisition script of the target site is generated, the method can be executed by computer equipment, data acquisition is automatically carried out according to the data acquisition script, a large amount of manpower and time cost are saved, the efficiency and accuracy of webpage analysis and data acquisition are effectively improved, and the maintenance cost of a later-stage acquisition task is reduced.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings.
In the drawings:
FIG. 1 is a schematic flow chart illustrating a method for generating a data acquisition script according to an embodiment of the present application;
FIG. 2 illustrates a flow diagram for parsing web page data using a parsing model;
FIG. 3 is a schematic flow chart illustrating another method for generating a data collection script according to an embodiment of the present application;
FIG. 4 shows a schematic flow chart of labeling a training data set;
fig. 5 shows a schematic structural diagram of an apparatus for generating a data acquisition script according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.
A method, an apparatus, a computer device, and a storage medium for generating a data acquisition script according to embodiments of the present application are described below with reference to the accompanying drawings.
The embodiment of the application provides a method for generating a data acquisition script, which can be applied to a device for generating the data acquisition script, the device can be a computer device with data processing (such as query, calculation and the like) capability, and can also be a processing module capable of performing data processing on the computer device. As shown in fig. 1, the method may include the steps of:
step S1, respectively acquiring text information included in each node of the web page data for the web page data of the target site.
A site may be understood as a document on a computer device that is used to store web content. Generally, the document (HTML document) is structured data and can also be regarded as a tree structure, the tree structure includes a parent node and a child node, that is, each tag (tag) of the document can be regarded as a node of the tree, the node can be a tree node or a child node, and generally the tree node will contain part or all of the information of the child node, so that the text information contained in each node can include all the content under the tag, and the text information contained in all the nodes constitutes the web page data of the site. The sites can be classified according to the content of the website, such as news sites, advertisement sites, picture sites, video sites, and the like, and generally, one type of site corresponds to one data acquisition script. The target site may be any site where a very careful data collection is desired.
In this embodiment, all text information included in each node of the web page data can be acquired through functions of querying, searching and the like, so that the data to be acquired is identified subsequently, and a path of the data to be acquired is obtained.
Step S2, generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data.
In this embodiment, referring to a statistical class-based method for extracting XPath, some tag features in the web page data are used as subsequent data to be acquired for identification, so as to obtain auxiliary features of a data path to be acquired. For example, in the task of collecting news text, there are usually more < p > tags, and the number and density of < p > tags can be used as indicators for judging the text. Similarly, advertisement information usually contains many < a > tags. Accordingly, the multi-dimensional features may include at least a label number, a label density, a text density, and weight information, among others.
Specifically, the multidimensional feature may include 39 dimensional features composed of the number of 33 various tags and 6 other features (tag density, text density, weight information, and the like). Wherein 6 other features may be:
intensity _ of _ a _ text (text density);
element _ of _ calculation (punctuation density);
log10(element. number _ of _ p _ despendants +2) (p-tag density);
class _ weight (empirical class name weight);
element number _ of _ datetime (weight of the number of time texts);
(1-element _ of _ a _ text) element _ of _ publication _ np.log10(element.number _ of _ p _ descriptors +2) (product weight of a plurality of features).
It should be noted that, when performing the tag feature statistics, a proper statistical principle may be formulated according to different collection task environments (in general, the collection task environments correspond to the site types one to one), and the present embodiment does not specifically limit the statistical principle.
And step S3, analyzing the webpage data through the trained analysis model based on the text information and the characteristic statistical information, identifying the path of the data to be acquired, and generating a data acquisition script of the target site.
The trained analytical model is used for identifying the path of the data to be collected according to the text information and the characteristic statistical information of the input webpage data and generating a data collection script of the target site. In particular, the trained analytical model may be, but is not limited to, a neural network model
In this embodiment, after the text information and the feature statistical information are obtained, the obtained text information and the feature statistical information may be input into a trained parsing model, and the parsing model may perform a series of data processing on the text information and the feature statistical information, so as to identify a path of data to be collected and generate a data collection script of a target site, so as to reduce labor and time costs consumed by manually parsing XPath, and greatly improve data collection efficiency.
It should be noted that, in this embodiment, a specific range of the analysis model processing data is not specifically limited, and for example, it may be understood that the process of acquiring the text information from the web page data may also be a processing function of the analysis model.
In a specific implementation manner of this embodiment, the step S3 may include the following steps: traversing each node of the webpage data in a hierarchy mode, and sequentially forming a text representation vector corresponding to each node based on text information of each node; carrying out convolution and pooling operation on the text characterization vectors in sequence to form new text characterization vectors; forming a tag statistical vector based on the characteristic statistical information of each tag, and splicing the new text characterization vector and the tag statistical vector to obtain a spliced vector; and sequentially connecting the splicing vector with the full connection layer and the output layer to identify a path of the data to be acquired and generate a data acquisition script of the target station.
In this embodiment, as shown in fig. 2, text information of each node of the web page data (including information of its child node and parent node) may be obtained in a manner of hierarchical traversal, and a text characterization vector of the node is formed based on the text information (i.e., an Embedding process), for example, text information included in one node is converted into a text characterization vector e ═ e1,..,en]∈Rn*dWherein n is the sequence length of the text information, d is the dimension of the word in the text information, and the value of d can be determined according to the actual situationThe condition setting may range from tens to hundreds, such as 100 dimensions, etc. The trained parsing model may then perform a convolution operation (i.e., convolution process) on the text characterization vector, and the convolution function may be c ═ F (v)Tei:j+h-1) Wherein v ∈ Rf*dA convolution kernel is represented, f represents the size of a window, and can generally take natural numbers such as 2, 3, 4 and the like, and d is the dimension of a word in the text information; i. j and h are respectively the number of the summary points, the number of the child nodes and the height of the structure tree of the webpage data. The resulting convolution vector is C ═ C at k different numbers of convolution kernels1,..,ck]. After the convolution operation, this embodiment performs Pooling operation by using a Max-Pooling policy to obtain an Output text characterization Vector (Output Vector) P ═ Max (c)1),..,max(ck)]. Meanwhile, based on the characteristic statistical information of each label, a corresponding label statistical vector is formed. Next, in this embodiment, the output text Feature Vector P obtained through CNN network learning is spliced with a tag statistical Vector (Feature Vector) S formed in advance through tag statistical Features to form a new Feature Vector R ═ P ≦ S (i.e., a configured Features process), where ≦ is a horizontal splicing operation. The vector contains both textual information and the statistical characteristics of the tag. Finally, a Fully Connected Layer and an Output softmax Layer (i.e. full Connected Layer Output Layer process) are Connected to the vector of the concatenation, i.e. L ═ tanh (W)f TR+bf),O=softmax(Wo TL+bo) Where tanh and softmax are both activation functions, Wf T,bf,Wo T,boAnd forming a new feature vector after splicing.
It should be noted that the convolution function and the pooling strategy are only preferred embodiments of the present embodiment, and the present embodiment is not limited thereto as long as the convolution and pooling operations can be implemented.
Accordingly, before using the trained analytical model, a model building (also understood as model training) step is further included, which may include the following processes: the method comprises the steps that analyzed information and characteristic statistical information of analyzed webpage data are obtained based on a plurality of analyzed webpage data with the same site type, and an analysis model is generated through the training of the analyzed information and the characteristic statistical information; the parsed information includes at least Xpath path and site information.
In this embodiment, as shown in fig. 3, before using the analysis model, the analyzed information and the feature statistical information of the analyzed web page data may be integrated, and the original neural network model may be trained to obtain an analysis model capable of accurately analyzing the web page data. The Xpath path and the site information have a corresponding relationship and can be used for judging whether the analysis result is correct.
Further, the concrete model building process is similar to the principle of the step S3 of analyzing the web page data by using the analytic model, and accordingly, the above-mentioned analyzing information and feature statistical information of the analyzed web page data are obtained based on the multiple analyzed web page data with the same site type, and the analytic model is generated by training through the analyzed information and feature statistical information, which may include the following processes: the method comprises the steps that a hierarchy traverses each node of each analyzed webpage data with the same type of the website, the analyzed information of each node of the analyzed webpage data is obtained, and a training data set is formed based on all the analyzed information; generating a characteristic statistical training vector of each label according to the multidimensional characteristics of various labels in a plurality of analyzed webpage data with the same site type; training and generating an analytic model through a training data set and a characteristic statistic training vector.
In this embodiment, to obtain correct training data, a part of the web page data (i.e., parsed web page data) with the same site type may be manually parsed, then, each node of each parsed web page data is traversed by a hierarchy, all parsed information is obtained, and a training data set of the parsing model may be generated based on the parsed information. In order to further improve the accuracy of parsing, the present embodiment is to use the statistical features of the labels for assistance, so that a feature statistical training vector of each label is also generated according to the multidimensional features of various labels in the parsed web page data, so as to generate the parsing model through the training data sets and the feature statistical training vector.
Specifically, the forming of the training data set based on all the parsed information may include the following processes: and marking the text information corresponding to each node according to whether the required information is contained or not based on all the analyzed information, and generating a training data set based on the marked text information.
In this embodiment, the training of the analytic model is a process of supervised learning, and the training data set may be labeled first according to whether the corresponding node includes the required information, so as to determine whether an analytic result of the analytic model on the training data set is correct, thereby improving the analytic accuracy.
More specifically, the labeling of the text information corresponding to each node according to whether the required information is included may include the following processing: for each node, if the child node of the node contains the required information, performing first labeling on the text information of the child node, and performing second labeling on the text information of the node; if the node does not contain the required information, carrying out third labeling on the text information of the node; and the first label is used to stop the parsing model from traversing the nodes of the web page data.
In this embodiment, as shown in fig. 4, when each node of the web page data is traversed according to a method of hierarchical traversal, text information corresponding to a node (also referred to as a target node) where the required information is located may be labeled as 1 (i.e., text information of a child node is first labeled), text information corresponding to a node (also referred to as a parent node) including the child node is labeled as 2 (i.e., text information of a node is second labeled), and text information corresponding to a node not including the required information is labeled as 0 (i.e., text information of a node not including the required information is third labeled). Then, when the neural network model carries out prediction, the layer-by-layer traversal prediction can be started from the lower-level node of a root node (body label), and if the prediction result is 0, the node is stopped to be searched; if the prediction result is 2, continuing to search the node downwards; and if the prediction result is 1, stopping searching and returning the node. The following is a special case approach to prevent false determinations from accidental errors.
1) If the prediction result of the whole layer is 0, the next search is continued.
2) If there are more than 2 in the prediction results of the same layer, only one 2 with the highest probability is reserved, and all other prediction results are set to be 0.
3) If the prediction results of the same layer have more than 1, the nodes of the previous layer are returned.
In the model training process, the cross entropy can be adopted as a loss function to train the model so as to modify the model. If the webpage structure of the site changes, the model can be used for predicting once again to identify a new XPath and further generate a new acquisition script.
Note that, the above labels 0, 1, and 2 are only one implementation of the present embodiment, and the present embodiment is not limited thereto, as long as the text information of three nodes can be identified.
In addition, in this embodiment, in order to verify the validity and accuracy of the analytic model, an experiment is performed on a text task of extracting a bidding text, specifically: using artificially analyzed bidding sites (the specific number may be set as required, for example, tens or hundreds), a number of data training sets of about 20 ten thousand are generated from about 2 thousand pieces of detailed HTML web page data for model training, the experimental data is 2000 HTML pages, and parameters included in feature statistical training vectors applied in the experiment are shown in table 1 below. And the evaluation of the model does not take the accuracy of the node identified by the model as an evaluation method, but adopts the predicted Xpath path to compare with the corresponding actual Xpath path to calculate the accuracy.
TABLE 1 parameters included in the feature statistics training vectors used in the experiments
| Imbedding _ size (word dimension) | 100 |
| seq _ length (text length) | 200 |
| num _ filters (number of convolution kernels) | 128 |
| Filter _ sizes (convolution kernel size) | [2,3,4,5] |
| drop _ prob (drop parameter) | 0.5 |
| feature _ size (vector dimension of statistical label) | 39 |
| learning _ rate (learning rate) | 1e-3 |
| batch _ size (batch size) | 128 |
| num _ epochs (number of training rounds) | 4 |
The results of the above experiments are shown in table 2 below.
TABLE 2 results of the experiment
| Data set for use in training a model | Rate of accuracy |
| Using only text features | 78% |
| Text feature + tag statistical features (this example) | 93% |
As can be seen from table 2, in this task, the experimental results show that the analysis accuracy of the analysis model obtained by training only with text features is 78%, compared with the recognition effect with or without the tag statistical features. And the analytic model obtained by jointly training the text characteristics and the label statistical characteristics has the analytic accuracy up to 93 percent. Therefore, the label statistical characteristics are very effective characteristics, and by using the label statistical characteristics, the accuracy of the analysis model can be improved by 15 percent on the basis of the original (without the label statistical characteristics), the overall recognition effect can reach 93 percent, and the manual recognition effect is close to the manual recognition effect.
According to the method for generating the data acquisition script, according to the text information contained in each node of the webpage data and the characteristic statistical information of each label, the webpage data are analyzed through the trained analysis model, the path of the data to be acquired can be identified, the data acquisition script of the target site is generated, so that the computer equipment can execute the method, and the data acquisition is automatically performed according to the data acquisition script, so that a large amount of manpower and time cost are saved, the efficiency and accuracy of webpage analysis and data acquisition are effectively improved, and the maintenance cost of a later-stage acquisition task is reduced.
Based on the same concept of the above method for generating a data acquisition script, this embodiment further provides an apparatus for generating a data acquisition script, as shown in fig. 5, the apparatus includes:
the text module is used for respectively acquiring text information contained in each node of the webpage data aiming at the webpage data of the target site;
the label module is used for generating characteristic statistical information of each label according to the multi-dimensional characteristics of various labels in the webpage data;
and the script module is used for analyzing the webpage data through the trained analysis model based on the text information and the characteristic statistical information, identifying the path of the data to be acquired and generating a data acquisition script of the target site.
The apparatus for generating a data acquisition script provided in this embodiment can execute the method for generating a data acquisition script, so that at least the beneficial effects that can be achieved by the method for generating a data acquisition script can be achieved, and are not described herein again.
Based on the same concept of the method for generating the data acquisition script, the present embodiment further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for generating the data acquisition script as described above.
The computer device provided in this embodiment can execute the method for generating the data acquisition script, so that at least the beneficial effects that can be achieved by the method for generating the data acquisition script can be achieved, and details are not repeated herein.
Based on the same concept as the above-described method of generating a data acquisition script, the present embodiment also provides a computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the method of generating a data acquisition script as described above.
The computer device provided in this embodiment can execute the method for generating the data acquisition script, so that at least the beneficial effects that can be achieved by the method for generating the data acquisition script can be achieved, and details are not repeated herein.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110770812.4A CN113687831A (en) | 2021-07-07 | 2021-07-07 | Method, device, computer equipment and storage medium for generating data acquisition script |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110770812.4A CN113687831A (en) | 2021-07-07 | 2021-07-07 | Method, device, computer equipment and storage medium for generating data acquisition script |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN113687831A true CN113687831A (en) | 2021-11-23 |
Family
ID=78576776
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110770812.4A Pending CN113687831A (en) | 2021-07-07 | 2021-07-07 | Method, device, computer equipment and storage medium for generating data acquisition script |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113687831A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117373225A (en) * | 2023-11-14 | 2024-01-09 | 南京新联电子股份有限公司 | Energy data acquisition method |
| CN120783363A (en) * | 2025-09-11 | 2025-10-14 | 中国铁塔股份有限公司安徽省分公司 | Bid information screening method, system, equipment and storage medium |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103294781A (en) * | 2013-05-14 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | Method and equipment used for processing page data |
| CN109471937A (en) * | 2018-10-11 | 2019-03-15 | 平安科技(深圳)有限公司 | A text classification method and terminal device based on machine learning |
| US20200160177A1 (en) * | 2018-11-16 | 2020-05-21 | Royal Bank Of Canada | System and method for a convolutional neural network for multi-label classification with partial annotations |
| CN111581476A (en) * | 2020-04-28 | 2020-08-25 | 深圳合纵数据科技有限公司 | Intelligent webpage information extraction method based on BERT and LSTM |
| CN111625702A (en) * | 2020-05-26 | 2020-09-04 | 北京墨云科技有限公司 | Page structure recognition and extraction method based on deep learning |
| CN112287272A (en) * | 2020-10-27 | 2021-01-29 | 中国科学院计算技术研究所 | Method, system and storage medium for classifying website list pages |
| CN112732994A (en) * | 2021-01-07 | 2021-04-30 | 上海携宁计算机科技股份有限公司 | Method, device and equipment for extracting webpage information and storage medium |
-
2021
- 2021-07-07 CN CN202110770812.4A patent/CN113687831A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103294781A (en) * | 2013-05-14 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | Method and equipment used for processing page data |
| CN109471937A (en) * | 2018-10-11 | 2019-03-15 | 平安科技(深圳)有限公司 | A text classification method and terminal device based on machine learning |
| US20200160177A1 (en) * | 2018-11-16 | 2020-05-21 | Royal Bank Of Canada | System and method for a convolutional neural network for multi-label classification with partial annotations |
| CN111581476A (en) * | 2020-04-28 | 2020-08-25 | 深圳合纵数据科技有限公司 | Intelligent webpage information extraction method based on BERT and LSTM |
| CN111625702A (en) * | 2020-05-26 | 2020-09-04 | 北京墨云科技有限公司 | Page structure recognition and extraction method based on deep learning |
| CN112287272A (en) * | 2020-10-27 | 2021-01-29 | 中国科学院计算技术研究所 | Method, system and storage medium for classifying website list pages |
| CN112732994A (en) * | 2021-01-07 | 2021-04-30 | 上海携宁计算机科技股份有限公司 | Method, device and equipment for extracting webpage information and storage medium |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117373225A (en) * | 2023-11-14 | 2024-01-09 | 南京新联电子股份有限公司 | Energy data acquisition method |
| CN120783363A (en) * | 2025-09-11 | 2025-10-14 | 中国铁塔股份有限公司安徽省分公司 | Bid information screening method, system, equipment and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN101464905B (en) | Web page information extraction system and method | |
| CN111026671B (en) | Test case set construction method and test method based on test case set | |
| US20220197923A1 (en) | Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information | |
| CN112163424B (en) | Data labeling method, device, equipment and medium | |
| CN105279495A (en) | Video description method based on deep learning and text summarization | |
| CN106682192A (en) | Method and device for training answer intention classification model based on search keywords | |
| CN110602045A (en) | Malicious webpage identification method based on feature fusion and machine learning | |
| CN105528422A (en) | Focused crawler processing method and apparatus | |
| CN103530429A (en) | Webpage content extracting method | |
| CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
| CN114416998B (en) | Text label identification method and device, electronic equipment and storage medium | |
| CN112328246A (en) | Page component generation method, apparatus, computer equipment and storage medium | |
| CN106934055B (en) | Semi-supervised webpage automatic classification method based on insufficient modal information | |
| CN113687831A (en) | Method, device, computer equipment and storage medium for generating data acquisition script | |
| CN119048964A (en) | Supervision data generation method based on video semantic structural analysis | |
| CN112364130B (en) | Sample sampling method, apparatus and readable storage medium | |
| CN107527289B (en) | Investment portfolio industry configuration method, device, server and storage medium | |
| CN112949299A (en) | Method and device for generating news manuscript, storage medium and electronic device | |
| CN108875060B (en) | Website identification method and identification system | |
| CN113505889B (en) | Processing method and device of mapping knowledge base, computer equipment and storage medium | |
| CN107368464B (en) | Method and device for acquiring bidding product information | |
| CN112818699B (en) | Risk analysis method, apparatus, device and computer readable storage medium | |
| CN112115362B (en) | A programming information recommendation method and device based on similar code recognition | |
| CN117788850B (en) | A trademark similarity evaluation method and device | |
| WO2018171189A1 (en) | Method, apparatus and terminal for blocking browser advertisement |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211123 |