CN105243062B - Method and device for detecting webpage feature area - Google Patents
Method and device for detecting webpage feature area Download PDFInfo
- Publication number
- CN105243062B CN105243062B CN201410245946.4A CN201410245946A CN105243062B CN 105243062 B CN105243062 B CN 105243062B CN 201410245946 A CN201410245946 A CN 201410245946A CN 105243062 B CN105243062 B CN 105243062B
- Authority
- CN
- China
- Prior art keywords
- page
- result
- filtering
- area
- screenshot
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000001914 filtration Methods 0.000 claims abstract description 74
- 230000000694 effects Effects 0.000 abstract description 7
- 238000012545 processing Methods 0.000 abstract description 7
- 230000008859 change Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000750 progressive effect Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method and a device for detecting a webpage feature area. Firstly, generating a first page result of a page under the condition of normal effectiveness of filtering, and generating a second page result of the page after setting threshold time; and then comparing the second page result with the first page result, and if different areas exist, determining the different areas as characteristic areas causing problems. In the scenario of advertisement filtering of a web page, these characteristic regions causing problems are advertisement regions, and the reasons for this may be that the advertisement filtering rules are invalid, so that the advertisement that should be filtered appears, or that the advertisement filtering rules do not include new advertisements. Therefore, by comparing the webpage with the reference webpage under the condition of normal effectiveness of filtering, the method and the device can quickly detect the characteristic region (advertisement region) in the webpage, quickly find problems and provide reference basis for subsequent webpage filtering processing, so that the filtering rule can be adjusted, and a better filtering effect can be obtained.
Description
Technical Field
The present invention relates to the field of mobile communications technologies, and in particular, to a method and an apparatus for detecting a feature area of a web page.
Background
The current webpage contains various advertisements, the advertisements influence the user experience on one hand, the consumption of extra traffic during access can be caused on the other hand, and a browser or a browser plug-in capable of intelligently filtering the advertisements in the webpage can bring great improvement on the user experience.
The existing browser generally sets advertisement filtering rules, the rule making mode of advertisement filtering is to check whether the webpage on the internet generates new advertisements through two modes of user feedback and manual checking, the user feedback mode is not timely enough, and the manual checking mode is not efficient enough.
The existing system for automatically detecting the advertisement on the webpage also detects the advertisement in a mode of comparing the difference between a DOM tree and a Render tree generated in the process of analyzing and typesetting the webpage. The method specifically comprises the steps of obtaining webpages of a DOM tree and a Render tree without advertisements after filtering the advertisements, and then comparing the webpages to be detected with the webpages without advertisements to detect the advertisements.
However, this method is generally directed to a test page whose content does not change, and for an internet page whose content changes, it is impossible to distinguish whether the change is caused by an advertisement or caused by the content of the web page itself, and thus, it may be impossible to detect an advertisement. In addition, in the prior art, advertisement filtering is to filter advertisements through the DOM structure of the web page, and if the same mechanism is adopted by the system for automatically detecting advertisements, the purpose of detecting advertisements is difficult to achieve.
Disclosure of Invention
In view of the foregoing problems, an object of the present invention is to provide a method and an apparatus for detecting a feature area of a web page, which can quickly detect the feature area in the web page, facilitate quick problem finding during filtering of web page advertisements, provide a reference for subsequent filtering of web page advertisements, and adjust a filtering rule, thereby obtaining a better filtering effect.
According to one aspect of the present invention, there is provided a method for detecting a characteristic region of a web page, including:
generating a first page result of the page under the condition of normal effectiveness of filtering;
after the threshold time is set, a second page result of the page is obtained;
and comparing the second page result with the first page result, and if different areas exist, determining the different areas as characteristic areas generating problems.
Wherein: generating a first page result for the page under the filter normal validation condition comprises: generating a first page result of a content logic area divided by a page under the condition of normal effectiveness of filtering, wherein the content logic area is generated by executing multiple page loads and comparing the difference of the page loaded each time and then combining the page loads;
comparing the second page result to the first page result comprises:
comparing the second page result with an area of the first page result other than the logical area of content.
The method for executing multiple page loads and comparing differences of the loaded web pages each time to generate the content logic area comprises the following steps:
screenshot is carried out on the loaded page every time, differences of all the screenshots are compared, and pixel points with the differences are recorded;
generating a plurality of rectangular areas surrounding the differential pixel points according to the differential pixel points;
adjacent rectangular areas are merged into a content logic area.
Wherein comparing the second page result to the first page result comprises,
judging whether the page has offset or not;
calculating a page offset value if a page offset exists;
and performing page alignment according to the page deviation value and then comparing.
Wherein, judging whether the page has the offset comprises:
circulating from the first row of the page, comparing whether other rows have the same color characteristic values as the red, blue and green characteristic values of the current row, if the other rows have the same color characteristic values, continuously comparing whether the color characteristic values of each row in the range of the set threshold value are all equal one by one, and if the color characteristic values of each row are equal, determining that the current comparison page has offset; otherwise, determining that no page offset occurs;
wherein calculating the page offset value comprises:
and calculating the position difference of the two offset rows, wherein the position difference is the page offset value.
Wherein the content logic area is configured to display a first color and the determined problematic feature area is configured to display a second color.
In another aspect, the present invention further provides an apparatus for detecting a feature area of a web page, including:
the reference page generating unit is used for generating a first page result of the page under the condition of normal effectiveness of filtering;
the comparison page generating unit is used for acquiring a second page result of the page after the threshold time is set;
and the characteristic region determining unit is used for comparing the second page result with the first page result, and if different regions exist, determining the different regions as characteristic regions which cause problems.
Wherein the reference result generating unit includes:
the loading module is used for executing page loading for multiple times;
the difference searching module is used for comparing the difference of the webpage loaded by executing the webpage loading for multiple times;
and the content area generating module is used for generating a content logic area from the difference of the webpage.
Wherein, the reference result generating unit further comprises:
the screenshot module is used for screenshot for each loaded page;
and the rectangular area generating module is used for generating a plurality of rectangular areas according to the different pixel points for the content area generating module to combine the plurality of rectangular areas into a content logic area.
Wherein, the characteristic region determining unit includes:
a comparison module for comparing the second page result with the first page result;
the offset judgment module is used for judging whether the page has offset or not when the difference exists between the currently compared rows when the first page screenshot and the second page screenshot is carried out;
the offset value calculating module is used for calculating the offset value of the page when the judging module judges that the page has the offset;
the alignment module is used for aligning the page according to the page deviation value;
and the characteristic region determining module is used for determining the finally determined difference region after the page alignment as the characteristic region of the webpage.
The method and the device for detecting the characteristic region of the webpage firstly generate a first page result of the webpage under the condition of normal effectiveness of filtering, and generate a second page result of the webpage after setting a threshold time; and then comparing the second page result with the first page result, and if different areas exist, determining the different areas as characteristic areas causing problems. In the scenario of advertisement filtering of a web page, these characteristic regions causing problems are advertisement regions, and the reasons for this may be that the advertisement filtering rules are invalid, so that the advertisement that should be filtered appears, or that the advertisement filtering rules do not include new advertisements. Therefore, by comparing the webpage with the reference webpage under the condition of normal effectiveness of filtering, the method and the device can quickly detect the characteristic region (advertisement region) in the webpage, quickly find problems and provide reference basis for subsequent webpage filtering processing, so that the filtering rule can be adjusted, and a better filtering effect can be obtained.
To the accomplishment of the foregoing and related ends, one or more aspects of the invention comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the annexed drawings set forth in detail certain illustrative aspects of the invention. These aspects are indicative, however, of but a few of the various ways in which the principles of the invention may be employed. Further, the present invention is intended to include all such aspects and their equivalents.
Drawings
Other objects and results of the present invention will become more apparent and more readily appreciated as the same becomes better understood by reference to the following description and appended claims, taken in conjunction with the accompanying drawings. In the drawings:
fig. 1 is a flowchart of a method for detecting a characteristic region of a web page according to an embodiment of the present invention;
FIG. 2 is a detailed flowchart of an embodiment of a method for detecting a characteristic region of a web page according to the present invention;
FIG. 3 is a block diagram of an apparatus for detecting characteristic regions of a web page according to the present invention;
FIG. 4 is a block diagram of a reference result generating unit of an embodiment of an apparatus for detecting a characteristic region of a web page according to the present invention;
fig. 5 is a block diagram illustrating a feature area determination unit according to an embodiment of an apparatus for detecting a feature area of a web page.
The same reference numbers in all figures indicate similar or corresponding features or functions.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method and the device for detecting the webpage feature area can quickly detect the feature area in the webpage, are convenient for quickly finding problems during filtering the webpage advertisements, provide reference basis for subsequent filtering processing of the webpage advertisements, and can adjust the filtering rules so as to obtain better filtering effect.
FIG. 1 is a flow chart illustrating a method of detecting a characteristic region of a web page according to an embodiment of the present invention;
as shown in fig. 1, the method for detecting a characteristic region of a web page according to the present invention includes:
s110: generating a first page result of the page under the condition of normal effectiveness of filtering;
the filtering is normally effective, namely advertisement filtering is executed when the page is loaded, and a page without advertisements is obtained, wherein the page is a first page result.
The first page result of the page in the step under the condition of normal effectiveness of filtering comprises the following steps: and generating a first page result of the page which is divided into a content logic area under the condition that the filtering is normally effective, wherein the content logic area is generated by executing multiple page loads and comparing the difference of the page loaded each time and then combining. Specifically, page loading is executed for multiple times, advertisement filtering is executed during the loading, then screenshot is carried out on the loaded page every time, the differences of multiple screenshots are compared, and pixel points with the differences are recorded; generating a plurality of rectangular areas according to the different pixel points; adjacent rectangular areas are merged into a content logic area. The page containing the logical area of content is then screenshot as the first page result.
Here, the first page result and the information of the content logical area need to be saved.
For example, after two page loads are executed, the difference between the two loaded pages is compared, and the page recorded with the difference point is saved as the first page result. And performing page loading twice, wherein a certain time interval is required, for example, page loading of the same URL is performed successively at a time interval of about 3-7 days, then screenshot is performed respectively to form two screenshots, and then the difference of the two screenshots is compared to record the pixel points with the difference. And generating a plurality of rectangular areas according to the recorded different pixel points. And then combining the similar areas of the rectangular areas into a content logic area, and determining a page screenshot containing the content logic area as a first page result. Wherein the information of the content logical area also needs to be further saved.
S120: after the threshold time is set, a second page result of the page is obtained;
this execution of page loading needs to be executed after a certain time interval after the first page result is formed, that is, after a threshold time is set, for example, 3 to 7 days, a second page result of the page is generated. Here, the loading of the same URL page is executed, advertisement filtering is also executed during the loading, and the generated page result is the second page result.
S130: and comparing the second page result with the first page result, and if different areas exist, determining the different areas as characteristic areas generating problems.
A full page comparison may be performed at this step, such as comparing the entire page screen shot. But the difference with respect to the area of the main content of the web page is negligible.
Another preferred way is to compare areas other than the logical area of content. The contents logical area formed in S110 is determined as the web page main contents area, and the area other than the main contents area is compared in S130. If the areas are different, the new content of the areas appears, and the areas are determined as characteristic areas. These feature areas may be new advertisement areas. In the embodiment, only the content logic area, namely the webpage main content area is compared, and compared with the whole page screenshot, the workload of picture comparison is reduced. Time is saved, and the comparison speed is higher.
The method for detecting the characteristic region of the webpage comprises the steps of firstly generating a first page result of the webpage under the condition of normal effectiveness of filtering, and generating a second page result of the webpage after setting threshold time; and then comparing the second page result with the first page result, and if different areas exist, determining the different areas as characteristic areas causing problems. In the scenario of advertisement filtering of a web page, these characteristic regions causing problems are advertisement regions, and the reasons for this may be that the advertisement filtering rules are invalid, so that the advertisement that should be filtered appears, or that the advertisement filtering rules do not include new advertisements. Therefore, by comparing the webpage with the reference webpage under the condition of normal effectiveness of filtering, the method and the device can quickly detect the characteristic region (advertisement region) in the webpage, quickly find problems and provide reference basis for subsequent webpage filtering processing, so that the filtering rule can be adjusted, and a better filtering effect can be obtained.
Fig. 2 is a detailed flowchart of an embodiment of a method for detecting a feature area of a web page according to an embodiment of the present invention.
As shown in fig. 2, the method for detecting a web page feature area according to the embodiment of the present invention includes:
firstly, S200 is executed, page loading is executed, and a basic page screenshot is generated for the page screenshot. Advertisement filtering needs to be performed during the process of performing page loading. And then executing the screenshot on the filtered webpage. And then, executing page loading once again, and generating a secondary page screenshot for the page screenshot (S210). This page loading process also requires ad filtering to be performed. The two page screen capture loading operations need a certain time interval, that is, after setting a threshold time, S210 is executed again, for example, the time threshold is 3-7 days. The threshold time is set to ensure that the web page generates a change in the main content of the web page during the time. The web page content is likely to not change within a relatively short interval. The main content of the web page cannot be identified.
And then executing S220, comparing the screenshot of the basic page with the screenshot of the secondary page, and recording the different pixels. The screenshots are compared by a common graph comparison method, and the comparison is not repeated herein.
After the page screenshot recording the difference pixel points is obtained, S230 is executed, and a plurality of rectangular areas are generated according to the difference pixel points. In S230, the difference pixel points are scanned line by line, and each two difference pixel points are divided into a rectangular region when the distance between the two difference pixel points is within a certain range. A certain range in the present embodiment means that dx is satisfied2+dy2<1000 are considered to be adjacent, and the two pixel points are divided into a rectangular area, wherein x represents a transverse distance and y represents a longitudinal distance.
After the rectangular region division is completed, S240 is executed to merge adjacent rectangular regions into a content logical region. Wherein the content logical area is to be recorded as a content area of the page. The merging of adjacent rectangular areas in S240 means that two rectangular areas are merged into one content logic area when the distance between the two rectangular areas is within a certain range, that is, the distance between the two rectangular areas satisfies dx2+dy2<1000 are considered to be adjacent, and the two rectangular areas are divided into a content logic area, wherein x represents a horizontal distance and y represents a vertical distance. And after S240 is finished, S250 is executed, the page screenshot with the content logic area is determined to be the first page screenshot, and the first page screenshot and the information of the content logic area are stored. At this time, the first page screenshot is equivalent to the first page result in S110 in the previous embodiment. Here, the information of the content logical area includes at least position information of the content logical area in the page screenshot.
And S260, executing page loading, and generating a second page screenshot for the page screenshot. The second page screenshot here corresponds to the second page result in S120 in the previous embodiment. The page loading of this step is the same as the steps S200 and S210 with respect to the screenshot of the page, advertisement filtering needs to be performed when the loading is performed, and after S250 is completed, S260 is performed after a time threshold is met, for example, S260 is performed after 3 to 7 days.
And S270, comparing the screenshot of the first page with the screenshot of the second page. The page screenshot of the upcoming content logical area is compared with the third page screenshot. In this embodiment, the screenshots of the areas other than the content logical area determined in S250 are compared. In the comparison, the page screenshots are compared by using the position information of the content logic area stored in the S250 in the page screenshots. In the step, the screenshots are compared in a progressive scanning mode. When the color feature values of each row are the same, the contents of the two rows are considered to be the same. It should be noted that the comparison in this step can be performed on the whole screenshot of the page, but the change of the content logic area is ignored. Only the content logic area, namely the webpage main content area, is compared, compared with the whole page screenshot, the workload of picture comparison is reduced, the time is saved, and the comparison speed is higher.
Due to the instability of the network, the running time of the script in the page is uncertain, and the situation that the screenshot result of a certain page or a certain area in the result picture is partially shifted downwards or upwards integrally compared with the result of the first page may occur, at this time, the result of direct comparison is different, but the frame of the whole page is not changed actually. Therefore, in a preferred embodiment, when comparing the difference between the first page screenshot and the second page screenshot, specifically, when comparing the difference between the currently compared rows in the screenshot, first, the method proceeds to S280 to determine whether the page has an offset.
The method for judging whether the page has the offset in the embodiment is as follows:
firstly, calculating a color characteristic value for each line of a region picture to be compared, wherein the color characteristic value calculation mode of the jth line is as follows:
i represents a column; j represents a row; jrowColor represents the color feature value of the entire row; color (i, j) is the value of the three colors R, G, B of the current pixel. Width denotes the maximum Width of the current line.
And then, starting to cycle from the first line of the page, comparing whether the color characteristic values of other lines are the same as the red, blue and green color characteristic values of the current line, if so, continuously comparing whether the color characteristic values of each line in a set threshold range are all equal one by one, if so, determining that the page offset occurs in the current comparison page, and otherwise, determining that the page offset does not occur.
If there is a page offset, S281 is performed to calculate a page offset value. And calculating the position difference of the two offset rows, wherein the position difference is the page offset value.
After the page offset value is calculated, page alignment is performed based on the page offset value (S282). After the page alignment is completed, the process returns to S270, and at this time, S270 starts comparison from the aligned region to the back.
If it is determined in S280 that the page does not have a page offset, it is finally determined that there is a difference in the currently compared page areas.
And S290, determining the finally determined difference area as a characteristic area of the webpage.
In a preferred embodiment, the content logic area may be configured to display a first color and the determined problematic feature area may be configured to display a second color. The characteristic region is conveniently identified.
Compared with the comparison of the whole page screenshot, the comparison workload is reduced, the time is saved, and the comparison speed is higher.
The method for detecting the webpage characteristic area ignores the change of the webpage main content, compares that if the areas except the main content change, the areas are judged as the characteristic areas, namely, the areas are likely to be new advertisement areas, and the reason of the areas is likely to be the advertisement filtering rule failure to cause the filtered advertisements to appear or the advertisement filtering rule does not include the new advertisements. Therefore, by comparing the webpage with the reference webpage under the condition of normal effectiveness of filtering, the method and the device can quickly detect the characteristic region (advertisement region) in the webpage, quickly find problems and provide reference basis for subsequent webpage filtering processing, so that the filtering rule can be adjusted, and a better filtering effect can be obtained.
The invention also provides a device for detecting the webpage characteristic area.
Fig. 3 is a block diagram of an apparatus for detecting a characteristic region of a web page according to the present invention.
As shown in fig. 3, an apparatus for detecting a characteristic region of a web page according to the present invention includes: a reference result generating unit 300, a comparison page generating unit 310, and a characteristic region determining unit 320.
A reference result generating unit 300, configured to generate a first page result of the page under the filtering normal validation condition;
the filtering is normally effective, namely advertisement filtering is executed when the page is loaded, and a page without advertisements is obtained, wherein the page is a first page result.
Fig. 4 is a block diagram illustrating a reference result generating unit of a preferred embodiment of an apparatus for detecting a characteristic region of a web page according to the present invention, where the reference result generating unit 300 shown in fig. 4 includes,
and the loading module 301 is configured to perform multiple page loads. The load module 301 performs ad filtering each time a page is loaded.
The difference searching module 302 is configured to compare differences of web pages that are subjected to multiple page loads.
A content area generating module 303, configured to generate a content logic area from the difference of the web page.
In a preferred embodiment, the reference result generating unit 300 further includes:
and a screenshot module 304, configured to perform screenshot on the page loaded each time. Specifically, the loading module 301 executes multiple page loads, advertisement filtering is executed in the loading process, and then the screenshot module 304 captures a page loaded each time.
At this time, the difference search module 302 is configured to obtain the difference pixel points in a progressive scanning manner. A rectangular region generating module 305, configured to generate a plurality of rectangular regions according to the different pixel points, so that the content region generating module 303 merges the plurality of rectangular regions into a content logic region. When the distance between every two different pixel points is within a certain range, the pixel points are divided into a rectangular area. In this embodiment, a certain range means that when dx2+ dy2<1000 is satisfied, the two pixel points are considered to be adjacent, and the two pixel points are divided into a rectangular region by the rectangular region generation module 305, where x represents a lateral distance and y represents a longitudinal distance.
In this case, the content area generating module 303 is configured to merge adjacent rectangular areas generated by the rectangular area generating module 305, that is, merge two rectangular areas into one content logical area when the distance between the two rectangular areas is within a certain range, for example: two rectangular regions, the distance between which is dx2+ dy2<1000, are considered to be adjacent, and the two rectangular regions are divided by the content region generation module 303 into a content logic region, where x represents a horizontal distance and y represents a vertical distance.
A first page result determining module 306, configured to determine a page screenshot of the content logical area as a first page result.
And a saving module (not shown in the figure) for saving the first page result and the information of the content logical area.
At the time interval of 3-7 days, the loading module 301 sequentially executes page loading of the same URL, the screenshot module 304 respectively captures the pages to form two screenshots, the difference of the two screenshots is compared by the difference searching module 302, and the different pixel points are recorded. Then, the rectangular region generating module 305 generates a plurality of rectangular regions according to the recorded differentiated pixels. The content region generation module 303 then merges the adjacent rectangular regions into a content logical region. The first page result determination module 306 determines a screenshot of a page containing a logical area of content as a first page result. And then the storage module stores the first page result and the information of the content logic area.
The comparison page generating unit 310 of the present invention shown in fig. 3 is configured to obtain a second page result of the page.
After the set threshold time is reached, the loading of the same URL page is executed, advertisement filtering is executed during the loading process, and the page result is formed, and the page loading needs to be executed after a certain time interval, for example, 3-7 days interval, after the reference result generating unit 300 generates the first page result.
The characteristic region determining unit 320 shown in fig. 3 is configured to compare the second page result with the first page result, and if a different region is found, determine that the different region is a characteristic region causing a problem.
In a preferred embodiment, the characteristic region determining unit 320 compares regions other than the content logical region. The content logical area generated by the content area generation module 303 is determined as the main content area of the web page, and the characteristic area determination unit 320 compares the areas other than the main content area. If the regions differ in point, new content appears on behalf of the regions. These feature areas may be new advertisement areas.
In a preferred embodiment, the content logic area may be configured to display a first color and the determined problematic feature area may be configured to display a second color. The characteristic region is conveniently identified.
It is noted that the characteristic region determination unit 320 may compare the entire pages, just ignoring the change of the content logical region. Comparing only the content logic area, i.e. the main content area of the web page, reduces the comparison workload compared with comparing the whole screenshot of the page. Time is saved, and the comparison speed is higher.
FIG. 5 is a block diagram of a feature area determination unit of an embodiment of an apparatus for detecting a feature area of a web page according to the present invention;
as shown in fig. 5, the feature region determining unit 320 includes a comparing module 321, an offset determining module 322, an offset value calculating module 323, an aligning module 324, and a feature region determining module 325.
A comparing module 321, configured to compare the second page result with the first page result.
The comparison module 321 compares page areas other than the logical area of the content generated by the content area generation module 303. The content logical area generated by the content area generation module 303 is determined as the main content area of the web page, and the comparison module 321 compares the areas other than the main content area. The comparing module 321 performs the comparison of the screenshots in a progressive scanning manner. When the color feature values of each row are the same, the two rows are considered to be the same, namely, advertisement areas and the like, if the areas are distinguished, new content appears in the areas, and the areas are determined as feature areas. These feature areas may be new advertisement areas.
Due to the instability of the network, the running time of the script in the page is uncertain, and the situation that the screenshot result of a certain page or a certain area in the result picture is partially shifted downwards or upwards integrally compared with the result of the first page may occur, at this time, the result of direct comparison is definitely not the same, but the frame of the whole page is not changed actually. Therefore, in a preferred embodiment, when the comparing module 321 compares that there is a difference in the region outside the content logic region, specifically, when there is a difference in the currently compared row, it is determined whether there is an offset in the page; an offset determination module 322 is required to determine whether the page has an offset.
In this embodiment, the method for determining whether the page has an offset by the offset determining module 322 is as follows:
firstly, calculating a color characteristic value for each line of a region picture to be compared, wherein the color characteristic value calculation mode of the jth line is as follows:
i represents a column; j represents a row; jrowColor represents the color feature value of the entire row; color (i, j) is the value of the three colors R, G, B of the current pixel. Width denotes the maximum Width of the current line.
And then, starting to cycle from the first line of the page, comparing whether the color characteristic values of other lines are the same as the red, blue and green color characteristic values of the current line, if so, continuously comparing whether the color characteristic values of each line in a set threshold range are all equal one by one, if so, determining that the page offset occurs in the current comparison page, and otherwise, determining that the page offset does not occur.
The offset value calculating module 323 is configured to calculate a page offset value when the determining module 322 determines that the page has an offset. Namely, the position difference of the two offset rows is calculated, and the position difference is the page offset value.
And an alignment module 324, configured to perform page alignment according to the page offset value.
The feature region determining module 325 is configured to determine a difference region finally determined after the page alignment is performed as a feature region of the web page. The change of the main content is ignored by the device for detecting the webpage characteristic region, and compared with the regions except the main content, if the regions are changed, the regions are judged as the characteristic regions, namely, the regions are likely to be new advertisement regions, and the reason is likely to be that the advertisement filtering rules are invalid to cause the filtered advertisements to appear or the new advertisements which are not included in the advertisement filtering rules. Therefore, by comparing the webpage with the reference webpage under the condition of normal effectiveness of filtering, the method and the device can quickly detect the characteristic region (advertisement region) in the webpage, quickly find problems and provide reference basis for subsequent webpage filtering processing, so that the filtering rule can be adjusted, and a better filtering effect can be obtained.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (4)
1. A method for detecting a webpage feature area comprises the following steps:
executing page loading for multiple times, executing advertisement filtering during loading, performing screenshot on each loaded page, comparing differences of the screenshots, and recording pixels with the differences;
generating a plurality of rectangular areas surrounding the differential pixel points according to the differential pixel points;
combining adjacent rectangular areas into a content logic area, and generating a first page result of a page under the condition of normal effectiveness of filtering, wherein the first page result is a page screenshot containing the content logic area;
after the threshold time is set, page loading is executed under the condition of normal effectiveness of filtering, and a second page result for dividing a content logic area is generated according to a screenshot of the loaded page;
and comparing the second page result with the area except the content logic area in the first page result, and if different areas exist, determining the different areas as characteristic areas causing problems.
2. The method of detecting a characteristic region of a web page of claim 1, wherein comparing the second page result with the first page result includes,
judging whether the page has offset or not;
calculating a page offset value if a page offset exists;
and performing page alignment according to the page deviation value and then comparing.
3. The method of detecting characteristic regions of a web page of claim 1, wherein the content logic region is configured to display a first color and the determined problematic characteristic region is configured to display a second color.
4. An apparatus for detecting a feature area of a web page, comprising:
a reference page generating unit, the reference page generating unit including: the loading module is used for executing page loading for multiple times and executing advertisement filtering during loading; the screenshot module is used for screenshot for each loaded page; the difference searching module is used for comparing the differences of all screenshots and recording the pixels with the differences; a rectangular region generation module for generating a plurality of rectangular regions surrounding the differential pixel points according to the differential pixel points; the content area generation module is used for combining adjacent rectangular areas into a content logic area and generating a first page result of a page under the condition of normal effectiveness of filtering, wherein the first page result is a page screenshot containing the content logic area;
the comparison page generating unit is used for executing page loading under the condition of normal effectiveness of filtering after the threshold time is set, and generating a second page result for dividing a content logic area according to the screenshot of the loaded page;
and the characteristic region determining unit is used for comparing the second page result with the region except the content logic region in the first page result, and if different regions exist, determining the different regions as characteristic regions causing problems.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410245946.4A CN105243062B (en) | 2014-06-04 | 2014-06-04 | Method and device for detecting webpage feature area |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410245946.4A CN105243062B (en) | 2014-06-04 | 2014-06-04 | Method and device for detecting webpage feature area |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN105243062A CN105243062A (en) | 2016-01-13 |
| CN105243062B true CN105243062B (en) | 2020-10-30 |
Family
ID=55040714
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410245946.4A Active CN105243062B (en) | 2014-06-04 | 2014-06-04 | Method and device for detecting webpage feature area |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN105243062B (en) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107193956A (en) * | 2017-05-23 | 2017-09-22 | 深圳天珑无线科技有限公司 | Page processing method and device |
| CN110134904B (en) * | 2019-05-21 | 2022-11-29 | 腾讯科技(上海)有限公司 | A page inspection method, device, equipment and medium |
| CN114840798A (en) * | 2022-05-16 | 2022-08-02 | 北京百度网讯科技有限公司 | Information generation method, device, equipment and storage medium |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102999636A (en) * | 2012-12-19 | 2013-03-27 | 北京奇虎科技有限公司 | Method and browser for carrying out interception treatment on popup window in webpage |
| CN103092999A (en) * | 2013-02-22 | 2013-05-08 | 人民搜索网络股份公司 | Webpage crawling cycle adjusting method and device |
| CN103530560A (en) * | 2013-09-29 | 2014-01-22 | 北京金山网络科技有限公司 | Method, device and client side for advertisement blocking |
| CN103699665A (en) * | 2013-12-27 | 2014-04-02 | 贝壳网际(北京)安全技术有限公司 | Method and device for filtering web page advertisements |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20060235960A1 (en) * | 2004-11-23 | 2006-10-19 | Inventec Appliances Corporation | Method for blocking network advertising |
-
2014
- 2014-06-04 CN CN201410245946.4A patent/CN105243062B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102999636A (en) * | 2012-12-19 | 2013-03-27 | 北京奇虎科技有限公司 | Method and browser for carrying out interception treatment on popup window in webpage |
| CN103092999A (en) * | 2013-02-22 | 2013-05-08 | 人民搜索网络股份公司 | Webpage crawling cycle adjusting method and device |
| CN103530560A (en) * | 2013-09-29 | 2014-01-22 | 北京金山网络科技有限公司 | Method, device and client side for advertisement blocking |
| CN103699665A (en) * | 2013-12-27 | 2014-04-02 | 贝壳网际(北京)安全技术有限公司 | Method and device for filtering web page advertisements |
Also Published As
| Publication number | Publication date |
|---|---|
| CN105243062A (en) | 2016-01-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9767387B2 (en) | Predicting accuracy of object recognition in a stitched image | |
| CN112016438A (en) | Method and system for identifying certificate based on graph neural network | |
| US8549627B2 (en) | Detection of objectionable videos | |
| US20150093040A1 (en) | Backlight Detection Method and Device | |
| CN110517246A (en) | A kind of image processing method, device, electronic equipment and storage medium | |
| CN113744200B (en) | Camera dirt detection method, device and equipment | |
| CN112988557B (en) | Search box positioning method, data acquisition method, device and medium | |
| CN113034447A (en) | Edge defect detection method and device | |
| CN105243062B (en) | Method and device for detecting webpage feature area | |
| CN103699843A (en) | Malicious activity detection method and device | |
| CN108960012B (en) | Feature point detection method and device and electronic equipment | |
| CN106528758B (en) | Picture selection method and device | |
| CN105446968B (en) | A kind of method and apparatus detecting web page characteristics region | |
| US11182932B2 (en) | Color gradient capture from source image content | |
| CN105389308B (en) | Webpage display processing method and device | |
| CN107220981A (en) | Character segmentation method, device, equipment and storage medium | |
| KR20140062993A (en) | Apparatus and method for detection mura in display device | |
| CN110264531A (en) | A kind of catching for X-comers takes method, apparatus, system and readable storage medium storing program for executing | |
| US20150317536A1 (en) | System and method for evaluating data | |
| CN106910207B (en) | Method and device for identifying local area of image and terminal equipment | |
| CN120353700A (en) | Interface display abnormality detection method and device, electronic equipment and storage medium | |
| US20120045143A1 (en) | Apparatus and method for high speed filtering of image for high precision | |
| CN112580638B (en) | Text detection method and device, storage medium and electronic equipment | |
| CN111127489B (en) | Image frequency division method and device, storage medium and terminal | |
| CN112561823B (en) | Filtering method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| TA01 | Transfer of patent application right |
Effective date of registration: 20200526 Address after: 310051 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province Applicant after: Alibaba (China) Co.,Ltd. Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping B radio 14 floor tower square Applicant before: GUANGZHOU UCWEB COMPUTER TECHNOLOGY Co.,Ltd. |
|
| TA01 | Transfer of patent application right | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |

