CN114385893A - Webpage category judgment method and device based on node extraction and terminal equipment - Google Patents

Webpage category judgment method and device based on node extraction and terminal equipment Download PDF

Info

Publication number
CN114385893A
CN114385893A CN202111570549.0A CN202111570549A CN114385893A CN 114385893 A CN114385893 A CN 114385893A CN 202111570549 A CN202111570549 A CN 202111570549A CN 114385893 A CN114385893 A CN 114385893A
Authority
CN
China
Prior art keywords
content
value
webpage
web page
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111570549.0A
Other languages
Chinese (zh)
Other versions
CN114385893B (en
Inventor
黄治军
谢铨
柯家宁
梁秀霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Southern New Media Technology Co ltd
Original Assignee
Guangdong Southern New Media Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Southern New Media Technology Co ltd filed Critical Guangdong Southern New Media Technology Co ltd
Priority to CN202111570549.0A priority Critical patent/CN114385893B/en
Publication of CN114385893A publication Critical patent/CN114385893A/en
Application granted granted Critical
Publication of CN114385893B publication Critical patent/CN114385893B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本发明提供了一种基于节点提取的网页类别判定方法、装置及终端设备,根据预设的间隔周期提取网页信息,并通过获取的网页信息分别计算两次提取网页信息的PR值、计算两次提取网页信息的第一参数和计算两次提取网页信息之间的相似度,并进行加权计算,获得每个网页的分数,从而进行网页类别的区分。通过网页类别的区分,能够减少网络爬虫对列表页的提取次数,减少了资源的损耗。同时使系统有更多的内存去进行正文页内容的分析提取,提高正文提取准确率。

Figure 202111570549

The present invention provides a web page category determination method, device and terminal device based on node extraction. The web page information is extracted according to a preset interval period, and the PR value of the extracted web page information is calculated twice through the acquired web page information, and the calculation is performed twice. The first parameter of web page information is extracted and the similarity between two times of extracted web page information is calculated, and weighted calculation is performed to obtain the score of each web page, so as to distinguish web page categories. By distinguishing the categories of web pages, the number of times the web crawler extracts the list pages can be reduced, and the consumption of resources can be reduced. At the same time, the system has more memory to analyze and extract the content of the text page, and improve the accuracy of text extraction.

Figure 202111570549

Description

Webpage category judgment method and device based on node extraction and terminal equipment
Technical Field
The invention relates to the field of information technology service, in particular to a webpage category judgment method and device based on node extraction and a terminal device.
Background
In an era with well-developed informatization, various data are generated by the social network, consumption information and motion trail of each person, and people can more clearly recognize objects by collecting, integrating and analyzing the data, so that people can make more accurate decisions. The data needed by the user can be generally sorted according to the requirement of the user, and valuable information and viewpoints can be analyzed according to the data, so that the problem solving efficiency is improved. For webpage data, webpage content is acquired, screened, sorted and analyzed through a selected theme, so that more accurate content is obtained, and the web crawler is very important in acquiring data, collecting data and analyzing data aiming at the webpage.
In the prior art, developers develop a media set as a system for storing network media data by themselves in daily data acquisition and analysis, and can perform data mining with wider range and higher depth. This system is divided into: data acquisition, document analysis, data streaming and data retrieval, but the function of document analysis cannot distinguish the types of web pages, so that repeated parts exist in each analysis, and the consumption of resources is increased; meanwhile, the analysis accuracy of the webpage content is low because the list page and the text page cannot be distinguished.
Therefore, a directory node extraction method is needed in the data processing system, which solves the problems of unnecessary resource consumption and low content analysis accuracy in the prior art.
Disclosure of Invention
The embodiment of the invention provides a method and a device for judging webpage categories based on node extraction and a terminal device, which can improve the accuracy of webpage category distinguishing.
In order to solve the above problem, an embodiment of the present invention provides a method, an apparatus, and a terminal device for determining a webpage category based on node extraction, including:
extracting a plurality of webpage information, and acquiring first content and second content of each webpage according to the plurality of webpage information; the acquisition time nodes of the first content and the second content are different;
respectively calculating the similarity between the first content and the second content in each webpage;
respectively calculating the PageRank value of each first content and the PageRank value of each second content to obtain a first PR value and a second PR value corresponding to each webpage, and calculating a first parameter of an external link node in each webpage; wherein, the first parameter is a parameter that the outer chain points back to itself;
according to the first PR value, the second PR value, the first parameter and the similarity corresponding to each webpage, combining a preset weighting algorithm to obtain the score of each webpage, and according to the scores of all the webpages, distinguishing the webpage categories of each webpage; the web page category comprises a list page and a text page.
As an improvement of the above scheme, the extracting the information of the multiple web pages and acquiring the first content and the second content of each web page according to the information of the multiple web pages specifically includes:
extracting width data, height data and a plurality of node data of each webpage according to the plurality of webpage information; each node data comprises position data, label name data and text content data of one node;
calculating the position data of the central point according to the width data and the height data acquired from each webpage;
calculating distance data from the node to the central point according to the position data of the node acquired in each webpage;
determining a first node according to the distance data in each webpage, and acquiring text content of each webpage through the first node as the first content of each webpage; after a first preset time interval, re-extracting the width data, the height data and the node data of the multiple webpages to obtain the second content of each webpage; wherein the text content comprises: a first text content and a second text content.
As an improvement of the above scheme, the determining a first node according to the distance data in each web page and acquiring the text content of each web page through the first node specifically includes:
selecting the node with the minimum distance data in each webpage as a first node;
if the tag name data of the first node is a paragraph element, merging the text content data of all nodes with tag names of the paragraph elements to obtain a first text content of each webpage;
if the label name of the first node is not paragraph data, determining a central area according to the position data of the central point, and combining text content data of the nodes in the central area to obtain second text content of each webpage.
As an improvement of the above scheme, the determining a center region according to the position data of the center point, and combining text contents of nodes in the center region specifically include:
determining a rectangular central area at the position of the central point; wherein, the rectangle central region is demarcated according to the golden ratio of mathematics, and r% ═ 0.382, the formula is:
Figure BDA0003423234820000031
where centerX, centerY is position data of the center point, width is width data of the page, height is height data of the page, and X isiAnd YiWidth data and height data of the central region are distinguished.
As an improvement of the above scheme, the calculating the similarity between the first content and the second content in each web page respectively specifically includes:
vectorizing the first content and the second content by TFIDF;
similarity calculation is carried out on the first content and the second content which are subjected to vectorization processing, and first similarity is obtained, wherein a calculation formula is as follows:
Figure BDA0003423234820000041
wherein Sim (pi) is the similarity, CT0(pi)For vectorization of the first content, CT(pi)Vectorizing the second content;
processing the first similarity to obtain a second similarity, wherein a calculation formula is as follows:
Sim′(pi)=1-Sim(pi)
in the formula, Sim' (p)i) For the second degree of similarity, Sim (p)i) Is a first similarity; wherein the second similarity is a similarity between the first content and the second content.
As an improvement of the above scheme, the step of calculating the PageRank value of each first content and the PageRank value of each second content respectively to obtain a first PR value and a second PR value corresponding to each web page specifically includes:
calculating the PageRank value of the first content to obtain a third PR value, and calculating the PageRank value of the second content to obtain a fourth PR value;
respectively carrying out normalization processing on the third PR value and the fourth PR value to obtain a first PR value and a second PR value, wherein the formula of the normalization processing is as follows:
Figure BDA0003423234820000042
Figure BDA0003423234820000043
of formula (II) PR'o(pi) Is the first PR value, PR' (p)i) Is the second PR value, PRo(pi) Is the third PR value, PR (p)i) The fourth PR value, max (PR), min (PR), is the maximum value and the minimum value of the second PR values corresponding to all the web pages; max (PR)o)、min(PRo) Is the maximum and minimum of the corresponding first PR values for all web pages.
As an improvement of the above scheme, the obtaining of the score of each web page according to the first PR value, the second PR value, the first parameter and the similarity corresponding to each web page in combination with a preset weighting algorithm and the distinguishing of the web page categories according to the scores of all the web pages specifically include:
carrying out weighted calculation on the first PR value, the second PR value, the first parameter and the similarity of each webpage, and obtaining the fixed weight value of each dimension according to a grid search method, thereby weighting each dimension and obtaining the score of each page;
sorting the scores of all the pages from high to low, wherein the top N% of the pages are judged as list pages, and the rest are judged as content pages; wherein N is a positive number.
Correspondingly, the invention also provides a device for judging the webpage category based on node extraction, which comprises the following steps: the system comprises an information extraction module, a similarity module, a PR value calculation module and a distinguishing module;
the information extraction module is used for extracting a plurality of webpage information and acquiring first content and second content of each webpage according to the plurality of webpage information; the acquisition time nodes of the first content and the second content are different;
as an improvement of the above scheme, the information extraction module includes: the system comprises a webpage information extraction unit, a first position calculation unit, a second position calculation unit and a text content unit;
the webpage information extraction unit is used for extracting width data, height data and a plurality of node data of each webpage according to the plurality of webpage information; each node data comprises position data, label name data and text content data of one node;
the first position calculation unit is used for calculating the position data of the central point according to the width data and the height data acquired in each webpage;
the second position calculation unit is used for calculating distance data from the node to the central point according to the position data of the node acquired in each webpage;
the text content unit is used for determining a first node according to the distance data in each webpage and acquiring the text content of each webpage through the first node as the first content of each webpage; after a first preset time interval, re-extracting the width data, the height data and the node data of the multiple webpages to obtain the second content of each webpage; wherein the text content comprises: a first text content and a second text content.
As an improvement of the above scheme, the determining a first node according to the distance data in each web page and acquiring the text content of each web page through the first node specifically includes:
selecting the node with the minimum distance data in each webpage as a first node;
if the tag name data of the first node is a paragraph element, merging the text content data of all nodes with tag names of the paragraph elements to obtain a first text content of each webpage;
if the label name of the first node is not paragraph data, determining a central area according to the position data of the central point, and combining text content data of the nodes in the central area to obtain second text content of each webpage.
As an improvement of the above scheme, the determining a center region according to the position data of the center point, and combining text contents of nodes in the center region specifically include:
determining a rectangular central area at the position of the central point; wherein, the rectangle central region is demarcated according to the golden ratio of mathematics, and r% ═ 0.382, the formula is:
Figure BDA0003423234820000061
where centerX, centerY is position data of the center point, width is width data of the page, height is height data of the page, and X isiAnd YiWidth data and height data of the central region are distinguished.
The similarity module is used for respectively calculating the similarity between the first content and the second content in each webpage;
as an improvement of the above, the similarity module includes: the device comprises a preprocessing unit, a first similarity unit and a second similarity unit.
The pre-processing unit is configured to vectorize the first content and the second content by TFIDF;
the first similarity unit is used for performing similarity calculation on the first content and the second content which are subjected to vectorization processing to obtain a first similarity, and the calculation formula is as follows:
Figure BDA0003423234820000071
wherein Sim (pi) is the similarity, CT0(pi)For vectorization of the first content, CT(pi)Vectorizing the second content;
the second similarity unit is configured to process the first similarity to obtain a second similarity, and a calculation formula is as follows:
Sim′(pi)=1-Sim(pi)
in the formula, Sim' (p)i) For the second degree of similarity, Sim (p)i) Is a first similarity; wherein the second similarity is a similarity between the first content and the second content.
The PR value calculation module is used for calculating the PageRank value of each first content and the PageRank value of each second content respectively, obtaining a first PR value and a second PR value corresponding to each webpage, and calculating a first parameter of an external link node in each webpage;
as an improvement of the above scheme, the PR value calculation module includes: an initial value calculation unit and a normalization unit;
the initial value calculating unit is used for calculating the PageRank value of the first content to obtain a third PR value, calculating the PageRank value of the second content to obtain a fourth PR value;
the normalization unit is used for respectively performing normalization processing on the third PR value and the fourth PR value to obtain a first PR value and a second PR value, wherein the formula of the normalization processing is as follows:
Figure BDA0003423234820000072
Figure BDA0003423234820000073
of formula (II) PR'o(pi) Is the first PR value, PR' (p)i) Is the second PR value, PRo(pi) Is the third PR value, PR (p)i) The fourth PR value, max (PR), min (PR), is the maximum value and the minimum value of the second PR values corresponding to all the web pages; max (PR)o)、min(PRo) Is the maximum and minimum of the corresponding first PR values for all web pages.
The distinguishing module is used for obtaining the score of each webpage according to the first PR value, the second PR value, the first parameter and the similarity corresponding to each webpage by combining a preset weighting algorithm and distinguishing the webpage category of each webpage; the web page category comprises a list page and a text page.
As an improvement of the above scheme, the distinguishing module includes: a score calculating unit and a sorting unit;
the score calculating unit is used for carrying out weighting calculation on the first PR value, the second PR value, the first parameter and the similarity of each webpage, and obtaining the fixed weight value of each dimension according to a grid searching method, so that each dimension is weighted and the score of each page is obtained;
the sorting unit is used for sorting the scores of all the pages from high to low, the top N% of the pages are judged as list pages, and the rest are judged as content pages; wherein N is a positive number.
Accordingly, the present invention further provides a computer terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements a method for determining a web page category based on node extraction according to any one of the present invention.
Correspondingly, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, a device in which the computer-readable storage medium is located is controlled to execute the method for determining a web page category based on node extraction according to any one of the present invention.
Therefore, the invention has the following beneficial effects:
the invention provides a webpage category judgment method and device based on node extraction and terminal equipment. By distinguishing the webpage categories, the extraction times of the web crawler to the list pages can be reduced, and the resource loss is reduced. And meanwhile, the system has more memories to analyze and extract the content of the text page, so that the accuracy rate of text extraction is improved.
Drawings
Fig. 1 is a schematic flowchart of a method for determining a webpage category based on node extraction according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a device for determining a category of a web page based on node extraction according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1, fig. 1 is a schematic flowchart of a method for determining a webpage category based on node extraction according to an embodiment of the present invention, as shown in fig. 1, the present embodiment includes steps 101 to 104, and each step specifically includes the following steps:
step 101: extracting a plurality of webpage information, and acquiring first content and second content of each webpage according to the plurality of webpage information; wherein the acquisition time nodes of the first content and the second content are different.
As a preferred scheme of this embodiment, extracting a plurality of pieces of web page information, and acquiring a first content and a second content of each web page according to the plurality of pieces of web page information specifically includes: extracting width data, height data and a plurality of node data of each webpage according to the information of the plurality of webpages; each node data comprises position data, label name data and text content data of one node; calculating the position data of the central point according to the width data and the height data acquired from each webpage; calculating distance data from the node to the central point according to the position data of the node acquired in each webpage; determining a first node according to the distance data in each webpage, and acquiring text content of each webpage through the first node as the first content of each webpage; after a first preset time interval, re-extracting the width data, the height data and the node data of the multiple webpages to obtain the second content of each webpage; wherein the text content comprises: a first text content and a second text content.
As a preferred scheme of this embodiment, determining a first node according to the distance data in each web page, and acquiring text content of each web page through the first node specifically includes: selecting a node with the minimum distance data in each webpage as a first node; if the tag name data of the first node is a paragraph element, merging the text content data of all nodes with tag names of the paragraph elements to obtain a first text content of each webpage; if the label name of the first node is not paragraph data, determining a central area according to the position data of the central point, and combining text content data of the nodes in the central area to obtain second text content of each webpage.
As a preferred scheme of this embodiment, determining a center region according to the position data of the center point, and combining text contents of nodes in the center region specifically includes: determining a rectangular central area at the position of the central point; wherein, the rectangle central region is demarcated according to the golden ratio of mathematics, and r% ═ 0.382, the formula is:
Figure BDA0003423234820000101
where centerX, centerY is position data of the center point, width is width data of the page, height is height data of the page, and X isiAnd YiWidth data and height data of the central region are distinguished.
As a preferred scheme of this embodiment, according to information of a plurality of web pages, width data, height data, and a plurality of node data of each web page are extracted, specifically: determining the width and height of the page or screen, calculating the center point of the centrX, the centrY: centerX is width/2; centerY ═ height/2;
acquiring all nodes D containing contents in a webpage: { diI ∈ 1, 2, 3, …, N }, and node diCoordinates of vertices (top)i,bottomi,lefti,righti) Calculating distance between the node and the central pointi
Xi=(righti-lefti)/2.0+lefti
Yi=(bottomi-topi)/2.0+topi
If the following conditions are met: top isi≤centerX and bottomiNot less than or equal to the centerX, and simultaneously satisfies the following conditions: lefti≤centerY and rightiNot less than centerY, then distancei=0;
If the following conditions are met: top isi≤centerX and bottomiNot less than centrX, then distancei=|Yi-centerY|;
If the following conditions are met: lefti≤centerY and rightiNot less than centrery, then distanccei=|Xi-centerX|;
If none of the three conditions is satisfied, then
Figure BDA0003423234820000111
As a preferred scheme of this embodiment, a web crawler is used to capture web page content, extract all links of a web page, and distinguish nodes into directory nodes and text nodes by whether the links exist in a directory set.
Step 102: and respectively calculating the similarity between the first content and the second content in each webpage.
As a preferable solution of this embodiment, the first content and the second content are vectorized by TFIDF; similarity calculation is carried out on the first content and the second content which are subjected to vectorization processing, and first similarity is obtained, wherein a calculation formula is as follows:
Figure BDA0003423234820000112
wherein Sim (pi) is the similarity, CT0(pi)For vectorization of the first content, CT(pi) Vectorizing the second content; processing the first similarity to obtain a second similarity, wherein a calculation formula is as follows:
Sim′(pi)=1-Sim(pi)
in the formula, Sim' (p)i) For the second degree of similarity, Sim (p)i) Is a first similarity; wherein the second similarity is a similarity between the first content and the second content.
As a preferable solution of this embodiment, the first content and the second content are vectorized by TFIDF, and a calculation formula is:
Figure BDA0003423234820000121
Figure BDA0003423234820000122
in the formula, WiWord sets, w, for the main content of a web pageikIs a word subset of the main contents of the web page, T is a set of all the main contents of the web page, TwkTo include the word wkCT(pi)Vectorization of the main content of a web page.
Step 103: respectively calculating the PageRank value of each first content and the PageRank value of each second content to obtain a first PR value and a second PR value corresponding to each webpage, and calculating a first parameter of an external link node in each webpage; wherein, the first parameter is a parameter that the outer chain points back to itself.
As a preferred scheme of this embodiment, the step of calculating the PageRank value of each first content and the PageRank value of each second content respectively includes: assume that web site W has N web pages P: { piI ∈ 1, 2, 3, …, N }, where M (p)i) Is all to piSet of web pages with out-links, L (p)j) Is pjAll the out-link webpage sets of the webpages; PR at time point when t is set to 00(pi) Is x, the damping coefficient is alpha, and the PageRank value PR at the iteration time t is calculatedt(pi):
Figure BDA0003423234820000123
And giving a minimum value E, stopping iteration if the difference between the PageRank value at the iteration t time and the PageRank value at the iteration t-1 time is infinitesimally small, otherwise, continuing the iteration:
PR(pi)=PRt(pi),if|PRt(pi)-PRt-1(pi)|<∈
and obtaining the PageRank value of the webpage content according to the iteration result, thereby calculating and obtaining the PageRank of the first content and the second content.
As a preferred scheme of this embodiment, a PageRank value of a first content is calculated to obtain a third PR value, and a PageRank value of a second content is calculated to obtain a fourth PR value; respectively carrying out normalization processing on the third PR value and the fourth PR value to obtain a first PR value and a second PR value, wherein the formula of the normalization processing is as follows:
Figure BDA0003423234820000131
Figure BDA0003423234820000132
of formula (II) PR'o(pi) Is the first PR value, PR' (p)i) Is the second PR value, PRo(pi) Is the third PR value, PR (p)i) The fourth PR value, max (PR), min (PR), is the maximum value and the minimum value of the second PR values corresponding to all the web pages; max (PH)o)、min(PRo) Is the maximum and minimum of the corresponding first PR values for all web pages.
As a preferred scheme of this embodiment, calculating a first parameter of an out-link node in each web page specifically includes: judging whether the nodes of the external link point back to the external link, wherein the calculation formula is as follows:
Figure BDA0003423234820000133
if pointing back to itself, the first parameter ML (p)i) Is 1, otherwise the first parameter ML (p)i) Is 0.
Step 104: according to the first PR value, the second PR value, the first parameter and the similarity corresponding to each webpage, combining a preset weighting algorithm to obtain the score of each webpage, and according to the scores of all the webpages, distinguishing the webpage categories of each webpage; the web page category comprises a list page and a text page.
As a preferred scheme of this embodiment, a weighting calculation is performed on the first PR value, the second PR value, the first parameter, and the similarity of each web page, and a weighting value of each dimension is obtained according to a web search method, so that each dimension is weighted and a score of each page is obtained; sorting the scores of all the pages from high to low, wherein the top N% of the pages are judged as list pages, and the rest are judged as content pages; wherein N is a positive number.
As a preferred scheme of this embodiment, a grid search method is used to perform optimal coefficient fitting on the manually labeled webpage result set and verification set to obtain the fixed weights a, b, c, and d, and the weighted calculation formula is:
Score(pi)=aPR(pi)+bPR′o(pi)+cSim′(pi)+dML(pi)
thereby obtaining a Score (p) of each web pagei) Wherein, PR'o(pi) Is the first PR value, PR' (p)i) Is a second PR value, Sim' (p)i) For similarity, ML (p)i) Is the first parameter.
As a preferable mode of this embodiment, N may be any number between 1 and 20.
The embodiment of the invention has the following effects:
therefore, the invention discloses a webpage category judgment method based on node extraction. The method comprises the steps of carrying out multi-dimensional calculation on two times of main contents extracted from the webpage, obtaining a fixed weight value aiming at each dimension, weighting each dimension through the fixed weight value to obtain the score of the webpage, and judging the list webpage and the text webpage of the webpage. By judging the webpage categories, the web crawler can be prevented from repeatedly crawling all the webpages, the resource consumption for acquiring the text webpage content is reduced, meanwhile, the analysis of the text webpage content can be more concentrated, and the accuracy of the text webpage content analysis is improved.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a device for determining a category of a web page based on node extraction according to an embodiment of the present invention, including: the invention also provides a device for judging the webpage category based on node extraction, which comprises the following components: an information extraction module 201, a similarity module 202, a PR value calculation module 203 and a distinguishing module 204;
the information extraction module 201 is configured to extract a plurality of pieces of web page information, and obtain a first content and a second content of each web page according to the plurality of pieces of web page information; the acquisition time nodes of the first content and the second content are different;
as an improvement of the above solution, the information extraction module 201 includes: the system comprises a webpage information extraction unit, a first position calculation unit, a second position calculation unit and a text content unit;
the webpage information extraction unit is used for extracting width data, height data and a plurality of node data of each webpage according to the plurality of webpage information; each node data comprises position data, label name data and text content data of one node;
the first position calculation unit is used for calculating the position data of the central point according to the width data and the height data acquired in each webpage;
the second position calculation unit is used for calculating distance data from the node to the central point according to the position data of the node acquired in each webpage;
the text content unit is used for determining a first node according to the distance data in each webpage and acquiring the text content of each webpage through the first node as the first content of each webpage; after a first preset time interval, re-extracting the width data, the height data and the node data of the multiple webpages to obtain the second content of each webpage; wherein the text content comprises: a first text content and a second text content.
As an improvement of the above scheme, the determining a first node according to the distance data in each web page and acquiring the text content of each web page through the first node specifically includes:
selecting the node with the minimum distance data in each webpage as a first node;
if the tag name data of the first node is a paragraph element, merging the text content data of all nodes with tag names of the paragraph elements to obtain a first text content of each webpage;
if the label name of the first node is not paragraph data, determining a central area according to the position data of the central point, and combining text content data of the nodes in the central area to obtain second text content of each webpage.
As an improvement of the above scheme, the determining a center region according to the position data of the center point, and combining text contents of nodes in the center region specifically include:
determining a rectangular central area at the position of the central point; wherein, the rectangle central region is demarcated according to the golden ratio of mathematics, and r% ═ 0.382, the formula is:
Figure BDA0003423234820000151
where centerX, centerY is position data of the center point, width is width data of the page, height is height data of the page, and X isiAnd YiWidth data and height data of the central region are distinguished.
The similarity module 202 is configured to calculate a similarity between the first content and the second content in each web page respectively;
as an improvement of the above scheme, the similarity module 202 includes: the device comprises a preprocessing unit, a first similarity unit and a second similarity unit.
The pre-processing unit is configured to vectorize the first content and the second content by TFIDF;
the first similarity unit is used for performing similarity calculation on the first content and the second content which are subjected to vectorization processing to obtain a first similarity, and the calculation formula is as follows:
Figure BDA0003423234820000161
wherein Sim (pi) is the similarity, CT0(pi)For vectorization of the first content, CT(pi)Vectorizing the second content;
the second similarity unit is configured to process the first similarity to obtain a second similarity, and a calculation formula is as follows:
Sim′(pi)=1-Sim(pi)
in the formula, Sim' (p)i) For the second degree of similarity, Sim (p)i) Is a first similarity; wherein the second similarity is a similarity between the first content and the second content.
The PR value calculating module 203 is configured to calculate a PageRank value of each first content and a PageRank value of each second content, obtain a first PR value and a second PR value corresponding to each web page, and calculate a first parameter of an external link node in each web page;
as an improvement of the above scheme, the PR value calculation module 203 includes: an initial value calculation unit and a normalization unit;
the initial value calculating unit is used for calculating the PageRank value of the first content to obtain a third PR value, calculating the PageRank value of the second content to obtain a fourth PR value;
the normalization unit is used for respectively performing normalization processing on the third PR value and the fourth PR value to obtain a first PR value and a second PR value, wherein the formula of the normalization processing is as follows:
Figure BDA0003423234820000171
Figure BDA0003423234820000172
of formula (II) PR'o(pi) Is the first PR value, PR' (p)i) Is the second PR value, PRo(pi) Is the third PR value, PR (p)i) The fourth PR value, max (PR), min (PR), is the maximum value and the minimum value of the second PR values corresponding to all the web pages; max (PR)o)、min(PRo) Is the maximum and minimum of the corresponding first PR values for all web pages.
The distinguishing module 204 is configured to obtain a score of each web page according to the first PR value, the second PR value, the first parameter, and the similarity corresponding to each web page, in combination with a preset weighting algorithm, and distinguish a category of each web page; the web page category comprises a list page and a text page.
As an improvement of the above solution, the distinguishing module 204 includes: a score calculating unit and a sorting unit;
the score calculating unit is used for carrying out weighting calculation on the first PR value, the second PR value, the first parameter and the similarity of each webpage, and obtaining the fixed weight value of each dimension according to a grid searching method, so that each dimension is weighted and the score of each page is obtained;
the sorting unit is used for sorting the scores of all the pages from high to low, the top N% of the pages are judged as list pages, and the rest are judged as content pages; wherein N is a positive number.
By implementing the embodiment of the invention, the webpage category can be well judged, the information module extracts the webpage content, the text content of the webpage is obtained through judgment of the visual center, the similarity module and the PR value calculation module are used for obtaining the parameters, and then the distinguishing module obtains the weighting value to perform weighting calculation on the parameters, so that the judgment score of each webpage is obtained, the efficiency of webpage content analysis is favorably improved in an auxiliary manner, and the resource consumption is saved.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
A terminal device of this embodiment includes: a processor 301, a memory 302 and a computer program stored in said memory 302 and executable on said processor 301. The processor 301, when executing the computer program, implements the steps of the above-mentioned various methods for determining a category of a web page based on node extraction in embodiments, such as all the steps of the method for determining a category of a web page based on node extraction shown in fig. 1. Alternatively, the processor, when executing the computer program, implements the functions of the modules in the device embodiments, for example: all the modules of the apparatus for determining a category of a web page based on node extraction shown in fig. 2.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, a device in which the computer-readable storage medium is located is controlled to execute the method for determining a webpage category based on node extraction according to any of the above embodiments.
It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a terminal device and does not constitute a limitation of a terminal device, and may include more or less components than those shown, or combine certain components, or different components, for example, the terminal device may also include input output devices, network access devices, buses, etc.
The Processor 301 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor 301 is a control center of the terminal device and connects various parts of the whole terminal device by using various interfaces and lines.
The memory 302 can be used for storing the computer programs and/or modules, and the processor 301 implements various functions of the terminal device by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory 302. The memory 302 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the terminal device integrated module/unit can be stored in a computer readable storage medium if it is implemented in the form of software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1.一种基于节点提取的网页类别判定方法,其特征在于,包括:1. a web page category determination method based on node extraction, is characterized in that, comprising: 提取多个网页信息,并根据所述多个网页信息,获取每个网页的第一内容和第二内容;其中,第一内容和第二内容的获取时间节点不同;Extracting a plurality of webpage information, and obtaining the first content and the second content of each webpage according to the plurality of webpage information; wherein, the acquisition time nodes of the first content and the second content are different; 分别计算每个网页中第一内容和第二内容之间的相似度;Calculate the similarity between the first content and the second content in each web page respectively; 分别计算各第一内容的PageRank值和各第二内容的PageRank值,获得各网页对应的第一PR值和第二PR值,以及计算每个网页中外链节点的第一参数;其中,所述第一参数为外链指回自身的参数;Calculate the PageRank value of each first content and the PageRank value of each second content respectively, obtain the first PR value and the second PR value corresponding to each web page, and calculate the first parameter of the external link node in each web page; wherein, the described The first parameter is the parameter that the external link points back to itself; 根据每个网页对应的第一PR值、第二PR值、第一参数和相似度,结合预设的加权算法,获得每个网页的分数,并根据全部网页的分数,对每个网页进行网页类别的区分;其中,网页类别包括列表页和正文页。According to the first PR value, the second PR value, the first parameter and the similarity corresponding to each web page, combined with the preset weighting algorithm, the score of each web page is obtained, and according to the scores of all the web pages, the web page is evaluated for each web page. The distinction between categories; wherein, the category of web pages includes list pages and text pages. 2.根据权利要求1所述的基于节点提取的网页类别判定方法,其特征在于,所述提取多个网页信息,并根据所述多个网页信息,获取每个网页的第一内容和第二内容,具体为:2 . The method for determining a webpage category based on node extraction according to claim 1 , wherein the extracting a plurality of webpage information, and obtaining the first content and the second content of each webpage according to the plurality of webpage information. 3 . content, specifically: 根据所述多个网页信息,提取每个网页的宽度数据、高度数据和多个节点数据;其中,每个节点数据包括一个节点的位置数据、标签名数据和文本内容数据;According to the plurality of webpage information, the width data, height data and multiple node data of each webpage are extracted; wherein, each node data includes the position data, label name data and text content data of a node; 根据在每个网页中获取的宽度数据、高度数据,计算中心点的位置数据;Calculate the position data of the center point according to the width data and height data obtained in each web page; 根据在每个网页中获取的节点的位置数据,计算所述节点到所述中心点的距离数据;Calculate the distance data from the node to the center point according to the position data of the node obtained in each webpage; 根据每个网页中的所述距离数据确定第一节点,并通过所述第一节点获取每个网页的文本内容,作为每个网页的所述第一内容;在间隔第一预设时间之后,重新提取所述多个网页的宽度数据、高度数据和节点数据,获取每个网页的所述第二内容;其中所述文本内容包括:第一文本内容和第二文本内容。A first node is determined according to the distance data in each webpage, and the text content of each webpage is obtained through the first node as the first content of each webpage; after a first preset time interval, Re-extract the width data, height data and node data of the plurality of web pages to obtain the second content of each web page; wherein the text content includes: the first text content and the second text content. 3.根据权利要求2所述的基于节点提取的网页类别判定方法,其特征在于,所述根据每个网页中的所述距离数据确定第一节点,并通过所述第一节点获取每个网页的文本内容,具体为:3 . The method for determining a webpage category based on node extraction according to claim 2 , wherein the first node is determined according to the distance data in each webpage, and each webpage is obtained through the first node. 4 . The text content of , specifically: 在每个网页中选择所述距离数据最小的节点为第一节点;In each webpage, the node with the smallest distance data is selected as the first node; 若所述第一节点的标签名数据是段落元素,则将所有标签名为段落元素的节点的文本内容数据进行合并,获得每个网页的第一文本内容;If the label name data of the first node is a paragraph element, then combine the text content data of all nodes whose label name is the paragraph element to obtain the first text content of each web page; 若所述第一节点的标签名不是段落数据,则根据所述中心点的位置数据确定中心区域,合并中心区域中的节点的文本内容数据,获得每个网页的第二文本内容。If the tag name of the first node is not the paragraph data, the central area is determined according to the position data of the central point, and the text content data of the nodes in the central area are combined to obtain the second text content of each webpage. 4.根据权利要求3所述的基于节点提取的网页类别判定方法,其特征在于,所述则根据所述中心点的位置数据确定中心区域,合并中心区域中的节点的文本内容,具体为:4. The web page category determination method based on node extraction according to claim 3, wherein the described then determines a central area according to the position data of the central point, and merges the text content of the nodes in the central area, specifically: 在所述中心点的位置确定一个矩形中心区域;其中,所述矩形中心区域按照数学的黄金比例进行划定,r%=0.382,公式为:A rectangular central area is determined at the position of the central point; wherein, the rectangular central area is demarcated according to the mathematical golden ratio, r%=0.382, and the formula is:
Figure FDA0003423234810000021
Figure FDA0003423234810000021
式中,centerX,centerY是中心点的位置数据,width是页面的宽度数据,height是页面的高度数据,Xi和Yi是别是中心区域的宽度数据和高度数据。In the formula, centerX and centerY are the position data of the center point, width is the width data of the page, height is the height data of the page, and X i and Y i are the width data and height data of the center area, respectively.
5.根据权利要求1所述的基于节点提取的网页类别判定方法,其特征在于,所述分别计算每个网页中第一内容和第二内容之间的相似度,具体为:5. The web page category determination method based on node extraction according to claim 1, wherein the calculation of the similarity between the first content and the second content in each web page respectively is specifically: 通过TFIDF对所述第一内容和所述第二内容进行向量化;Vectorizing the first content and the second content by TFIDF; 对经过向量化处理的第一内容和第二内容进行相似度计算,获得第一相似度,计算公式为:The similarity calculation is performed on the vectorized first content and the second content to obtain the first similarity, and the calculation formula is:
Figure FDA0003423234810000031
Figure FDA0003423234810000031
式中,Sim(pi)为所述相似度,CT0(pi)为第一内容的向量化,CT(pi)为第二内容的向量化;In the formula, Sim(pi) is the similarity, C T0(pi) is the vectorization of the first content, and C T(pi) is the vectorization of the second content; 对所述第一相似度进行处理,获得第二相似度,计算公式为:The first similarity is processed to obtain the second similarity, and the calculation formula is: Sim′(pi)=1-Sim(pi)Sim'( pi )=1-Sim( pi ) 式中,Sim′(pi)为第二相似度,Sim(pi)为第一相似度;其中,第二相似度为所述第一内容和第二内容之间的相似度。In the formula, Sim'( pi ) is the second similarity, Sim( pi ) is the first similarity; wherein, the second similarity is the similarity between the first content and the second content.
6.根据权利要求1所述的基于节点提取的网页类别判定方法,其特征在于,所述分别计算各第一内容的PageRank值和各第二内容的PageRank值,获得各网页对应的第一PR值和第二PR值,具体为:6. the web page category determination method based on node extraction according to claim 1, is characterized in that, described calculating respectively the PageRank value of each first content and the PageRank value of each second content, obtain the first PR corresponding to each web page value and the second PR value, specifically: 计算所述第一内容的PageRank值,获得第三PR值,计算所述第二内容的PageRank值,获得第四PR值;Calculate the PageRank value of the first content, obtain the third PR value, calculate the PageRank value of the second content, obtain the fourth PR value; 分别对第三PR值和第四PR值进行归一化处理,获得第一PR值和第二PR值,其中,归一化处理的公式为:The third PR value and the fourth PR value are respectively normalized to obtain the first PR value and the second PR value, where the formula for the normalization is:
Figure FDA0003423234810000032
Figure FDA0003423234810000032
Figure FDA0003423234810000033
Figure FDA0003423234810000033
式中,PR′o(pi)为第一PR值,PR′(pi)为第二PR值,PRo(pi)为第三PR值,PR(pi)为第四PR值,max(PR)、min(PR)是所有网页的对应的第二PR值中的最大值与最小值;max(PRo)、min(PRo)是所有网页的对应的第一PR值中的最大值与最小值。In the formula, PR′ o ( pi ) is the first PR value, PR′(pi ) is the second PR value, PR o ( pi ) is the third PR value, and PR( pi ) is the fourth PR value , max(PR), min(PR) are the maximum and minimum values in the corresponding second PR values of all web pages; max(PR o ), min(PR o ) are the corresponding first PR values of all web pages the maximum and minimum values.
7.根据权利要求1所述的基于节点提取的网页类别判定方法,其特征在于,所述根据每个网页对应的第一PR值、第二PR值、第一参数和相似度,结合预设的加权算法,获得每个网页的分数,并根据全部网页的分数进行网页类别的区分,具体为:7. The web page category determination method based on node extraction according to claim 1, wherein, according to the first PR value, the second PR value, the first parameter and the similarity corresponding to each web page, in combination with a preset The weighting algorithm of , obtains the score of each web page, and distinguishes web page categories according to the scores of all web pages, specifically: 对每个网页的第一PR值、第二PR值、第一参数和相似度,进行加权计算,根据网格搜索法进行每个维度的定权值的获取,从而加权各个维度,获得每个页面的分数;The first PR value, the second PR value, the first parameter and the similarity of each webpage are weighted and calculated, and the fixed weight value of each dimension is obtained according to the grid search method, so as to weight each dimension and obtain each dimension. the score of the page; 对所有页面的分数进行从高到低的排序,前N%的网页判定为列表页,其余的判定为内容页;其中,N为正数。The scores of all pages are sorted from high to low, and the top N% of the web pages are determined as list pages, and the rest are determined as content pages; wherein, N is a positive number. 8.一种基于节点提取的网页类别判定装置,其特征在于,包括:信息提取模块、相似度模块、PR值计算模块和区分模块;8. A web page category determination device based on node extraction, characterized in that it comprises: an information extraction module, a similarity module, a PR value calculation module and a differentiation module; 所述信息提取模块用于提取多个网页信息,并根据所述多个网页信息,获取每个网页的第一内容和第二内容;其中,第一内容和第二内容的获取时间节点不同;The information extraction module is configured to extract multiple webpage information, and obtain the first content and the second content of each webpage according to the multiple webpage information; wherein, the acquisition time nodes of the first content and the second content are different; 所述相似度模块用于分别计算每个网页中第一内容和第二内容之间的相似度;The similarity module is used to calculate the similarity between the first content and the second content in each webpage respectively; 所述PR值计算模块用于分别计算各第一内容的PageRank值和各第二内容的PageRank值,获得各网页对应的第一PR值和第二PR值,以及计算每个网页中外链节点的第一参数;The PR value calculation module is used to calculate the PageRank value of each first content and the PageRank value of each second content respectively, obtain the first PR value and the second PR value corresponding to each web page, and calculate the external link node in each web page. first parameter; 所述区分模块用于根据每个网页对应的第一PR值、第二PR值、第一参数和相似度,结合预设的加权算法,获得每个网页的分数,对每个网页进行网页类别的区分;其中,网页类别包括列表页和正文页。The distinguishing module is used to obtain the score of each web page according to the first PR value, the second PR value, the first parameter and the similarity corresponding to each web page, combined with a preset weighting algorithm, and classify each web page into a web page category. Wherein, the web page category includes list page and body page. 9.一种计算机终端设备,其特征在于,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至7中任意一项所述的一种基于节点提取的网页类别判定方法。9. A computer terminal device, characterized in that it comprises a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, when the processor executes the computer program, A method for determining a webpage category based on node extraction according to any one of claims 1 to 7. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括存储的计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在设备执行如权利要求1至7中任意一项所述的一种基于节点提取的网页类别判定方法。10. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program, wherein, when the computer program is run, the device where the computer-readable storage medium is located is controlled to perform as claimed in the claims A method for determining a webpage category based on node extraction according to any one of 1 to 7.
CN202111570549.0A 2021-12-21 2021-12-21 A method, device and terminal device for determining web page category based on node extraction Active CN114385893B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111570549.0A CN114385893B (en) 2021-12-21 2021-12-21 A method, device and terminal device for determining web page category based on node extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111570549.0A CN114385893B (en) 2021-12-21 2021-12-21 A method, device and terminal device for determining web page category based on node extraction

Publications (2)

Publication Number Publication Date
CN114385893A true CN114385893A (en) 2022-04-22
CN114385893B CN114385893B (en) 2024-11-12

Family

ID=81198422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111570549.0A Active CN114385893B (en) 2021-12-21 2021-12-21 A method, device and terminal device for determining web page category based on node extraction

Country Status (1)

Country Link
CN (1) CN114385893B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100023630A (en) * 2008-08-22 2010-03-04 고려대학교 산학협력단 Method and system of classifying web page using categogory tag information and recording medium using by the same
CN104699817A (en) * 2015-03-24 2015-06-10 中国人民解放军国防科学技术大学 Search engine ordering method and search engine ordering system based on improved spectral clusters
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN109933739A (en) * 2019-03-01 2019-06-25 重庆邮电大学移通学院 A kind of Web page sequencing method and system based on transition probability

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100023630A (en) * 2008-08-22 2010-03-04 고려대학교 산학협력단 Method and system of classifying web page using categogory tag information and recording medium using by the same
CN104699817A (en) * 2015-03-24 2015-06-10 中国人民解放军国防科学技术大学 Search engine ordering method and search engine ordering system based on improved spectral clusters
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN109933739A (en) * 2019-03-01 2019-06-25 重庆邮电大学移通学院 A kind of Web page sequencing method and system based on transition probability

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈广胜;李思阳;张凡;李丹;: "基于林业主题的PageRank算法优化的研究", 黑龙江大学自然科学学报, no. 04, 25 August 2016 (2016-08-25), pages 117 - 122 *

Also Published As

Publication number Publication date
CN114385893B (en) 2024-11-12

Similar Documents

Publication Publication Date Title
CN110909725B (en) Method, device, equipment and storage medium for recognizing text
CN111797239B (en) Application program classification method and device and terminal equipment
CN112434168B (en) Knowledge graph construction method and fragmented knowledge generation method based on library
CN115860271B (en) Scheme management system and method for artistic design
CN108399180B (en) Knowledge graph construction method and device and server
WO2017045443A1 (en) Image retrieval method and system
CN111737997A (en) A text similarity determination method, device and storage medium
CN113962199B (en) Text recognition method, text recognition device, text recognition equipment, storage medium and program product
CN110458078A (en) A kind of face image data clustering method, system and equipment
CN105630975B (en) Information processing method and electronic equipment
CN111209827A (en) A method and system for OCR identification bill problem based on feature detection
CN113033269A (en) Data processing method and device
CN114021716B (en) A method, system, and electronic device for model training
CN113627542A (en) Event information processing method, server and storage medium
CN108959329A (en) A kind of file classification method, device, medium and equipment
CN110688540B (en) Cheating account screening method, device, equipment and medium
CN106156794A (en) Character recognition method based on writing style identification and device
CN113988878A (en) Graph database technology-based anti-fraud method and system
CN116402644A (en) Legal supervision method and system based on big data multi-source data fusion analysis
CN121051306A (en) Cross-domain recommendation methods, systems, devices, and storage media based on contrastive learning
US11709798B2 (en) Hash suppression
CN116308376A (en) Abnormal account identification method and device, electronic device and storage medium
CN115035347A (en) Picture identification method and device and electronic equipment
CN116958722A (en) Training methods, devices, equipment and storage media for target detection models
CN109145307A (en) User portrait recognition method, pushing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 26/F, Building A, News Center, No. 289, Guangzhou Avenue, Yuexiu District, Guangzhou, Guangdong 510000

Applicant after: Guangdong Southern Intelligent Media Technology Co.,Ltd.

Address before: 510000 room 306, 3 / F, news center, 289 Guangzhou Avenue central, Yuexiu District, Guangzhou City, Guangdong Province

Applicant before: Guangdong Southern New Media Technology Co.,Ltd.

Country or region before: China

GR01 Patent grant
GR01 Patent grant