CN114528457B - Web fingerprint detection method and related equipment - Google Patents

Web fingerprint detection method and related equipment Download PDF

Info

Publication number
CN114528457B
CN114528457B CN202111681406.7A CN202111681406A CN114528457B CN 114528457 B CN114528457 B CN 114528457B CN 202111681406 A CN202111681406 A CN 202111681406A CN 114528457 B CN114528457 B CN 114528457B
Authority
CN
China
Prior art keywords
web
information
target site
fingerprint
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111681406.7A
Other languages
Chinese (zh)
Other versions
CN114528457A (en
Inventor
金正平
刘冰
张承宇
秦素娟
时忆杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
National Computer Network and Information Security Management Center
Original Assignee
Beijing University of Posts and Telecommunications
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications, National Computer Network and Information Security Management Center filed Critical Beijing University of Posts and Telecommunications
Priority to CN202111681406.7A priority Critical patent/CN114528457B/en
Publication of CN114528457A publication Critical patent/CN114528457A/en
Application granted granted Critical
Publication of CN114528457B publication Critical patent/CN114528457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application provides a Web fingerprint detection method and related equipment. The method comprises the following steps: crawling source codes of a plurality of webpages from a target site by utilizing a web crawler, and acquiring key information of a static file path based on the source codes; the method comprises the steps that a web crawler sends a predefined HTTP request to a host server of a target site, and head information of a response message of the host server is obtained; identifying a Content Management System (CMS) type by matching the key information with a Web fingerprint library; predicting a Web server type by using a machine learning model based on the header information; and detecting the fingerprint of the host port by utilizing a network connection end scanning tool to scan the open port of the host server and the service corresponding to the open port. The method realizes comprehensive, accurate and efficient detection of the Web component information of the target site.

Description

Web fingerprint detection method and related equipment
Technical Field
The application relates to a Web (world wide Web) security technology, in particular to a Web fingerprint detection method and related equipment.
Background
With the rapid development of internet technology, the number of Web sites has been rapidly increased, the application scenario of the Web sites has been diversified, and the Web application has been widely applied in social networks, banking services, online shopping, web mail, blogs and other fields closely related to people's life. While Web applications bring great convenience to our lives, the security issues of Web sites pose a threat to our lives and property security moments. A large number of open source components are widely used for the construction of Web sites, but these open source components themselves may also suffer from various vulnerabilities and drawbacks that are readily exploited by attackers.
Most of the current Web attacks utilize vulnerabilities known to exist in Web components, and further acquire advanced rights and important data of a server through attack means, so that the security of a Web site is in vital connection with whether the components used by the site have the vulnerabilities or not. Therefore, web fingerprint identification is an important step in an information collection link before safety detection is carried out on the target site, web component information of the target site can be accurately identified, safety test efficiency can be improved, and the method is greatly helpful for making a penetration detection strategy. However, the related Web component type detection system has the technical problems of low accuracy, poor expansibility, incomplete detection, low detection efficiency and the like.
Disclosure of Invention
In view of the above, the present application aims to provide a Web fingerprint detection method and related devices.
Based on the above object, a first aspect of the present application provides a Web fingerprint detection method, including:
Crawling web source codes of a plurality of web pages from a target site by utilizing a web crawler, and acquiring key information of a static file path based on the web source codes;
The web crawler sends a predefined HTTP request to a host server of the target site to acquire the head information of a response message of the host server;
identifying a Content Management System (CMS) type of the target site by matching the key information with a Web fingerprint library;
Predicting a Web server type of the target site using a trained machine learning model based on the header information;
And scanning an open port of the host server and a service corresponding to the open port by using a network connection end scanning tool, and detecting the host port fingerprint of the target site.
A second aspect of the present application proposes a Web fingerprint detection device comprising:
A crawling module configured to: crawling source codes of a plurality of webpages from a target site by utilizing a web crawler, and acquiring key information of a static file path based on the source codes; the web crawler sends a predefined HTTP request to a host server of the target site to acquire the head information of a response message of the host server;
the Web fingerprint detection module is configured to: the CMS type of the target site is identified by matching the key information with a Web fingerprint library; predicting a Web server type of the target site using a trained machine learning model based on the header information; and scanning an open port of the host server and a service corresponding to the open port by using a network connection end scanning tool, and detecting the host port fingerprint of the target site.
A third aspect of the application proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, wherein the processor implements the method provided by the first aspect of the application when executing the computer program.
As can be seen from the above, the Web fingerprint detection method and the related device provided by the application use the Web crawler to crawl the Web source codes of a plurality of Web pages from the target site, and acquire the key information of the static file path based on the Web source codes, and the crawling of the Web source codes has higher recognition accuracy than crawling of a single Web source code. The head information of the response message of the host Server is acquired by sending a predefined HTTP request to the host Server of the target site by the web crawler, the head information of the response message is utilized for identification, and the condition that the response message Server field is modified and deleted can still be accurately identified, so that the method has higher accuracy and better expansibility compared with an identification mode based on fixed rules. The content management system CMS type of the target site is identified by matching the key information with the Web fingerprint library; based on the head information, predicting the Web server type of the target site by using a trained machine learning model; the method comprises the steps of scanning an open port of a host server and services corresponding to the open port by using a network connection end scanning tool, and detecting a host port fingerprint of a target site; the functions of Web server identification, CMS system identification, host information detection and the like are integrated, the technical problems of low accuracy, poor expansibility, incomplete detection, low detection efficiency and the like in the related technology are solved, and the comprehensive, accurate and efficient detection of the component information of the target site is realized. According to the detection result, attack aiming at Web components and host service loopholes can be effectively prevented, and the safety of Web sites is maintained.
Drawings
In order to more clearly illustrate the technical solutions of the present application or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a flowchart of a Web fingerprint detection method according to an embodiment of the present application;
FIG. 2 is a flowchart of step 400 of a Web fingerprint detection method according to an embodiment of the present application;
FIG. 3 is a flow chart of Web server type detection in an embodiment of the application;
FIG. 4 is a flowchart of steps 500 of a Web fingerprint detection method according to an embodiment of the present application;
FIG. 5 is a flowchart of a host port fingerprint detection method according to an embodiment of the present application;
FIG. 6 is a flow chart of information crawling of a crawling module according to an embodiment of the present application;
FIG. 7 is a diagram illustrating a web page relationship according to an embodiment of the present application;
FIG. 8 is a flowchart of step 100 of a Web fingerprint detection method according to an embodiment of the present application;
FIG. 9 is a flow chart of CMS system type detection in accordance with an embodiment of the present application;
FIG. 10 is a flowchart of step 200 of a Web fingerprint detection method according to an embodiment of the present application;
FIG. 11 is a block diagram of a Web fingerprint detection apparatus provided by an embodiment of the present application;
FIG. 12 is a flowchart of an execution process of the Web fingerprint detection module according to an embodiment of the present application;
FIG. 13 is a flowchart of a detection process according to an embodiment of the present application;
fig. 14 is a schematic diagram of an electronic device according to an embodiment of the application.
Detailed Description
The present application will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present application more apparent.
It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present application belongs. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
In some embodiments, as shown in fig. 1, a Web fingerprint detection method includes:
Step 100, crawling web source codes of a plurality of web pages from a target site by utilizing a web crawler, and acquiring key information of a static file path based on the web source codes.
In this step, crawling multiple web page source codes has higher recognition accuracy than crawling single web page source codes.
Step 200, the head information of the response message of the host server is obtained by sending a predefined HTTP request to the host server of the target site by the web crawler.
In the step, the head information of the response message comprises the relative position information of the important field and the content information of the important field, and the condition that the Server field is modified and deleted can still be accurately identified after the characteristics of the head information are normalized.
And step 300, identifying the CMS type of the content management system of the target site by matching the key information with the Web fingerprint library.
In this step, the identification for the CMS system is mainly based on the web page source code keyword information and the path information of the static resource file. The CMS system is identified by common Wappalyzer and Whatweb detection tools in the related art, which mainly grasp the source codes of single pages, analyze notes and keyword information in the source codes, perform regular matching with a Web fingerprint library, and cannot be identified by using the method if a developer hides or modifies the keyword information. The related technology can also select to acquire static file paths in a plurality of labels in a single response page source code, and match the static file paths with the Web fingerprint library to identify the CMS system in the Web application component, but the method has higher requirements on the completeness of the Web fingerprint library and lower universality. It can be seen that the existing CMS system identification is mainly based on a single webpage source code keyword and a static resource file full path, and the two types of parties have low identification accuracy and have high requirements on the completeness of a Web fingerprint library. According to the method provided by the application, the crawler is used for acquiring the source codes of the plurality of webpages of the target site, the key static path of the static file is extracted from the source codes, and finally the matching is carried out through the Web fingerprint library, so that the CMS system can be accurately identified, the crawler is used for crawling the source code information of the plurality of webpages, and compared with the traditional identification accuracy based on the single webpage source code information, the method is higher.
Step 400, predicting the Web server type of the target site by using the trained machine learning model based on the header information.
In this step, the identification of the Web server component is mainly based on the response information of the server, including an abnormal page, a response code, a message header content, and the like. In the related art, by constructing an abnormal HTTP request to access resources which do not exist in a Server, the type of the Web Server is judged according to a response code, the content of the head of a response message and the content of the text of an abnormal page, and the detection mode cannot be applied to the condition that the head of the response message conceals a Server field and the abnormal page is redirected. And constructing various malformed HTTP requests, and judging the type of the Web server by using a naive Bayesian algorithm classification model by taking the response state code as a characteristic. This approach requires multiple HTTP requests to be sent to the server, resulting in a relatively inefficient identification. By analyzing the sequence of a large number of head fields of response messages and summarizing the fixed rule for identifying the type of the Web server, the mode has higher false alarm rate, and the mode only identifies Apache, IIS, nginx mainstream Web servers, and is complex to realize and poor in expansibility. The method provided by the application extracts information from the head of the response message by utilizing the trained machine learning model, comprises important field relative position information and important field content information, normalizes the characteristics of the important field relative position information and the important field content information, trains by using a random forest algorithm to obtain a prediction model, further can accurately identify the type of the Web Server, can accurately identify the condition that the Server field is modified and deleted, and has higher accuracy and better expansibility compared with the traditional identification mode based on fixed rules.
Step 500, the network connection end scanning tool is utilized to scan the open port of the host server and the service corresponding to the open port, so as to detect the host port fingerprint of the target site.
In this step, alternatively, scanning for host services is based primarily on Nmap and Zmap scanning tools, both of which acquire basic information of the target host through IP address probing, but Zmap scanning tools have incomparable advantages in scanning rate, and Nmap scanning tools have unique advantages in scanning accuracy and comprehensiveness.
The method integrates the functions of Web server identification, CMS system identification, host information detection and the like, solves the technical problems of low accuracy, poor expansibility, incomplete detection, low detection efficiency and the like in the related technology, and realizes comprehensive, accurate and efficient detection of component information of a target site. According to the detection result, attack aiming at Web components and host service loopholes can be effectively prevented, and the safety of Web sites is maintained.
In some embodiments, predicting Web server types for a target site based on head information using a trained machine learning model as shown in fig. 2 and 3, includes:
In step 410, the header information is preprocessed.
In the step, the head information of the response message is subjected to feature normalization processing according to the response rule, a text vector in the head information is converted into a digital vector, and a basis is provided for predicting the type of the Web server through a random forest algorithm by using a machine learning model.
Step 420, predicting the Web server type by using a machine learning model through a random forest algorithm based on the preprocessed header information.
In the step, based on the digital vector in the head information obtained after preprocessing, a random forest algorithm is executed on the head information by using a trained machine learning model to obtain a classification result, and then the type of the Web server is predicted. Wherein the random forest algorithm has the following advantages:
1. For a variety of input variables, a high accuracy classification result can be obtained. The method is beneficial to improving the accuracy of the type identification of the Web server.
2. A large number of input variables can be handled. The input variable has smaller requirement and higher expansibility.
3. The importance of the variables may be evaluated in determining the category. Can be used to improve the accuracy of the prediction result.
4. In building forests, an unbiased estimate can be generated internally for generalized errors. The accuracy of the type identification of the Web server is improved.
5. The missing data can be estimated and if a significant portion of the data is missing, accuracy can be maintained. The problem of incomplete input variables caused by other steps in question can be effectively avoided, and the accuracy of the type identification of the Web server is improved.
6. The learning process is very fast. The method is beneficial to improving the efficiency of the type identification of the Web server.
In some embodiments, as shown in fig. 4 and 5, detecting a host port fingerprint of a target site by scanning an open port of a host server and a service corresponding to the open port with a network connection end scanning tool includes:
In step 510, a probe report is generated by scanning the open ports and services corresponding to the open ports using the scanning tool Nmap.
In this step, nmap is integrated into the detection system by presetting Nmap parameters and by means of inter-process calling, and when the open port and the service corresponding to the open port need to be scanned, the scanning tool Nmap is called to scan by generating an execution command, so as to obtain the detection report.
Optionally, to meet the automatic scanning requirements of the method of the present embodiment on the open port, running service, and operating system of the host server, the Nmap parameters are set to: nmap-sS-sV-O-T5 ip, and is integrated into the detection system by means of inter-process calls.
In step 520, the probe report is parsed to obtain a host port fingerprint.
In this step, the critical information is filtered and formatted by parsing the probe report to obtain the host port fingerprint.
In some embodiments, as shown in FIG. 6, crawling web page source code of a plurality of web pages from a target site using a web crawler includes: and crawling the web page source codes from the target site by using a web crawler and adopting a breadth-first strategy.
Setting parameters of user threads, crawler depth and execution time according to the task information to be tested in the set crawler task queue; and generating a connection pool from the proxy pool and the User-Agent pool, and selecting a link from the connection pool, wherein the link can be a socket link. And then crawling the webpage source codes from the target site by adopting a breadth-first strategy. The breadth-first, also called breadth-first, refers to that in the grabbing process, after the current level of searching is completed, the next level of searching is performed. The design and implementation of the algorithm is relatively simple. The breadth-first search method can cover as many web pages as possible, and is beneficial to acquiring source codes of a plurality of web pages by the method of the embodiment. In crawlers, it is generally considered that there is a high probability that a web page within a certain link distance from the original URL will have a subject matter relevance. The crawling process is to directly insert the links found by the newly downloaded web page into the end of the URL queue to be crawled, that is, the web crawler will crawl all the web pages in the initial page first, then select one of the connection web pages, and continue crawling all the web pages linked in the web page. Referring to fig. 7, taking the web page relationship in the figure as an example, the breadth-first crawling order is: A-B-C-D-E-F-G-H-I. Optionally, breadth-first searching may be used in combination with web page filtering techniques, where web pages are first crawled using breadth-first policies, and then web pages that are not relevant are filtered out.
Alternatively, a breadth-first traversal policy may be utilized to obtain a page with a response code of 2 xx.
In some embodiments, as shown in fig. 8 and 9, obtaining key information for a static file path based on web page source code includes:
step 110, parsing out the predetermined label from the web page source code.
In this step, the predetermined tag may alternatively be an HTML tag.
Step 120, extracting static file path information from the predetermined tag using the regular expression.
In this step, static file path information is extracted from predetermined tags using regular expressions by means of element screening.
And 130, saving the static file path information as a target text, and denoising the target text.
In the step, the screened static file path information is stored as a target text, and irrelevant character strings in the target text are removed through a filter network, so that denoising processing is carried out on the target text.
And 140, performing text slicing processing on the target text subjected to the denoising processing to extract key information.
In this step, the target file after denoising is subjected to file slicing, and is divided into a plurality of slices, and key information of the static file path is found and extracted from the plurality of slices, so as to provide a detection basis for step 300.
In some embodiments, as shown in fig. 10 and fig. 6, by sending a predefined HTTP request to a host server of a target site by a web crawler, obtaining header information of a response message of the host server includes:
At step 210, an HTTP request is sent by a web crawler to a host server.
In the step, a custom HTTP request is constructed according to the parameter information of the task to be tested in the crawler task queue, and then the web crawler sends the custom HTTP request to the host server.
Step 220, obtain the response message of the host server to the HTTP request.
In this step, the host server receives the HTTP request and returns a response message.
In step 230, the relative position information of the first predetermined field and the content information of the second predetermined field are extracted from the header of the response message as header information.
Wherein the HTTP request includes "GET/404page.html HTTP/1.1/r/n/r/n".
The first predetermined field includes a "Date" field, a "Server" field, a "Content-Type" field, a "Content-Length" field, a "Connection" field, and an "Expires" field.
The second predetermined field includes a "Content-Length" field and an "X-Power-By" field.
In some embodiments, the Web fingerprint detection method provided by the embodiment of the present application further includes: the detection results of the CMS type, the Web server type and the host port fingerprint as Web fingerprints of the target site are written into a remote dictionary service (Redis) queue.
The List data structure of Redis is used as a message queue to finish data transmission and reduce the coupling in the system. Because Redis is a memory type database, the access to data is operated in the memory, and the efficiency of data transmission among modules is improved.
It should be noted that, the method of the embodiment of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the method of an embodiment of the present application, the devices interacting with each other to accomplish the method.
It should be noted that the foregoing describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Based on the same inventive concept, the application also provides a Web fingerprint detection device corresponding to the method of any embodiment.
Referring to fig. 11, the Web fingerprint detection device may include:
The crawling module 10 is configured to: crawling web source codes of a plurality of web pages from a target site by utilizing a web crawler, and acquiring key information of a static file path based on the web source codes; and acquiring the head information of the response message of the host server by sending a predefined HTTP request to the host server of the target site by the web crawler.
The crawling process of the web crawler comprises the following steps: and acquiring static file path information as much as possible by adopting a breadth-first strategy, downloading page source codes according to the seed URL, extracting a static file path and URL links, filtering external station links by adopting a regular matching mode, adding the URL of the local station into a seed URL queue, and continuously repeating the process until the number of static file paths with preset thresholds is acquired.
Further, in order to improve the crawler efficiency of the web crawler, three parameter settings including thread number, crawler depth and maximum execution time are provided for the outside. In order to cope with the anti-crawling policy of the target website, the access IP, the access frequency and the access browser type are set according to the response of the target host server, and the page with failed acquisition is abandoned so as to improve the overall efficiency of the system.
The Web fingerprint detection module 20 is configured to: the CMS type of the target site is identified by matching the key information with the Web fingerprint library; based on the head information, predicting the Web server type of the target site by using a trained machine learning model; and scanning the open port of the host server and the service corresponding to the open port by using a network connection end scanning tool, and detecting the host port fingerprint of the target site.
A system memory module 30 configured to: storing user data, storing crawling information, and storing scan result data.
The system storage module 30 uses two storage management modes, namely a relational database MySQL and a memory database Redis. And a List data structure of Redis is used as a message queue to finish data transmission among modules, so that the coupling among modules in the system is reduced. In addition, because Redis is a memory type database, the access to the data is operated in the memory, the data transmission efficiency between modules is improved, and the data stored by Redis mainly comprises main task basic information, crawler intermediate data, detection subtask creation information and detection report information. And using MySQL to store the data in a lasting mode, wherein the stored data comprises user information, login information, web fingerprint information, task information and detection report information.
A task scheduling module 40 configured to: the dispatch crawling module 10 executes, the dispatch Web fingerprint detection module 20 executes, and the dispatch storage module executes.
The task scheduling module 40 mainly performs unified management and scheduling on each module, decouples the whole system, realizes the design concept of high cohesion and low coupling of the whole system, and improves the reusability and the expandability of the program module. The core functions of the system are completed by coordination and cooperation of the crawling module 10, the Web fingerprint detection module 20 and the system data storage module 03, and execution logic and data transmission among the modules are required to be scheduled and managed. For the management of execution logic among the modules, the system uses a Quartz task scheduling framework, and the framework not only can schedule tasks, but also can reduce the complexity of service logic and improve the fault tolerance of the system. In addition, redis is used as a message queue to decouple the whole system, a producer-consumer mode is used, the modules are not directly called, but are communicated in a message queue mode, and intermediate data among the modules mainly comprises basic information of detection tasks, intermediate data collected by a crawler and detection results.
A user interaction module 50 configured to: user management, detection task management and detection result display.
The user interaction module 50 is configured to provide various interaction functions for a user at a front-end interface, and mainly includes a user management function, a detection task management function, and a detection result display function.
The Web fingerprint detection module 20 is a core module of the system, and includes a CMS system type detection sub-module, a host port information detection sub-module, and a Web server type detection sub-module. As shown in fig. 12, the Web fingerprint detection module 20 specifically executes the following:
(1) The CMS system type detection submodule extracts static path key information from the Redis intermediate data queue, and then matches the static path key information in the Web fingerprint library to identify the CMS system of the target site.
(2) The host port information detection sub-module extracts the seed URL and the IP address from the Redis intermediate data queue, calls Nmap to detect, and identifies the port fingerprint of the host server.
(3) The Web server type detection sub-module extracts message response header information from the Redis intermediate data queue, performs data preprocessing, and uses a random forest detection model to identify the type of the Web server.
(4) Integrating the detection results of (1) - (3) and writing into a Redis result queue.
(5) And writing the data to be recovered in the Redis result queue into a database.
In some embodiments, based on the above modules, the Web fingerprint detection method is summarized in that the crawling module 10 is responsible for collecting information of a target site, including key information of a static file path and header information of a response message, and preprocessing the obtained information as input of the Web fingerprint detection module 20; the task scheduling module 40 is responsible for coordinating the execution sequence of other modules, and mainly comprises a crawler task execution, a fingerprint detection task execution and a result storage task; the system memory module 30 is responsible for storing user data and data or results generated during execution of other modules; the user interaction module 50 provides various interaction functions for the user at the front end interface, and mainly comprises the functions of user management, detection task management, detection result display and the like; the Web fingerprint detection module 20 is a specific implementation of a Web component detection technology, and mainly comprises Web server component type detection, CMS system type detection and host port fingerprint detection.
In order to better implement the present invention, further, the crawling module 10 is aimed at acquiring the key information of the static file path of the target website and the text information of the response message header, and providing analysis data for the Web fingerprint detection module 20. In order to collect as much static file path information as possible, the crawling module 10 employs breadth-first policies. For the acquisition module of the static file path key information, in order to improve the efficiency of the crawler, more static file path information is acquired as much as possible in a short time, three parameters of thread number, crawler depth and maximum execution time are provided, and a user can set according to own requirements.
In order to better implement the present invention, further, the task scheduling module 40 is responsible for coordinating the execution of the three core modules, namely, the crawling module 10, the Web fingerprint detection module 20 and the system storage module 30, and performing scheduling management on execution logic and data transfer among the modules.
In order to better implement the present invention, further, the system storage module 30 is responsible for storing user data and data or results generated during the execution of other modules, including basic parameter information of tasks, path information of static files of websites and text information of the response message header acquired by the crawling module 10, and result information, user information and the like executed by the Web fingerprint detection module 20.
In order to better implement the present invention, further, the user interaction module 50 provides various interaction functions for the user at the front end interface, so that the user management function is implemented to divide the operation rights of the user according to the roles of the user; the task management function is used for creating, deleting, suspending, starting and other operations of the task; the Web fingerprint function is implemented for a system administrator user to add Web fingerprints to a Web fingerprint library.
In some embodiments, as shown in fig. 13, a detection process is illustrated as follows:
(1) The user configures, via the user interaction module 50, the front end of the Web fingerprint detection device to scan relevant parameters including the URL or IP address of the target site, the number of crawler threads, the limited task, the maximum scan time, etc. And creating a task according to the parameter scannable result set by the user, storing the parameter information of the task into the system memory module 30, and marking the task as to-be-detected to obtain the to-be-detected task.
(2) The task scheduling module 40 monitors the tasks to be detected in the database, sequences the basic information of the tasks into a crawler task queue, and delivers the task queue to be detected to the crawling module 10 for processing.
(3) The crawling module 10 takes out task information from tasks to be checked in the crawler task queue, collects target site information, then preprocesses the collected information, stores the processed data into an intermediate data queue, and dequeues the intermediate data into a redis pair of the system memory module 30.
(4) After the task scheduling module 40 monitors that the system storage module 30 stores the intermediate data queue, the scheduling system storage module 30 stores the preprocessed data into the database of the system storage module 30, and issues a detection task queue to the Web fingerprint detection module 20.
(5) The Web fingerprint detection module 20 monitors the detection task queue at regular time, when there is a task to be detected, invokes the CMS system type detection submodule, the host port information detection submodule and the Web server type detection submodule respectively to detect the CMS system type, the host port fingerprint and the Web server type, and stores detection result data into the detection result queue.
(6) The task scheduling module 40 stores the detection result into the system storage module 30 after detecting the task information of the detection result queue, and sets the state of the task in the system storage module 30 to be completed.
(7) The user queries the detection result from the database of the system storage module 30 by using the user interaction module 50 through the front-end interface, and displays the detection result on the Web interface.
For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.
The device of the foregoing embodiment is configured to implement the corresponding Web fingerprint detection method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same inventive concept, the application also provides an electronic device corresponding to the method of any embodiment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the Web fingerprint detection method of any embodiment when executing the program.
Fig. 14 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage, dynamic storage, etc. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.
The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).
It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.
The electronic device of the foregoing embodiment is configured to implement the corresponding Web fingerprint detection method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same inventive concept, the present application also provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the Web fingerprint detection method according to any of the above embodiments, corresponding to the method according to any of the above embodiments.
The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The storage medium of the foregoing embodiment stores computer instructions for causing the computer to execute the Web fingerprint detection method according to any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the application, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the application as described above, which are not provided in detail for the sake of brevity.
Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
While the application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, and the like, which are within the spirit and principles of the embodiments of the application, are intended to be included within the scope of the application.

Claims (8)

1. A Web fingerprint detection method, comprising:
Crawling source codes of a plurality of webpages from a target site by utilizing a web crawler, and acquiring key information of a static file path based on the source codes;
the crawling, by the web crawler, the source codes of the plurality of web pages from the target site includes: crawling the source codes from the target site by using the web crawler and adopting a breadth-first strategy;
Wherein, the obtaining the key information of the static file path based on the source code includes: analyzing a preset label from the source code; extracting static file path information from the predetermined tag using a regular expression; storing the static file path information as a target text, and denoising the target text; performing text slicing processing on the target text subjected to denoising processing to extract the key information;
The web crawler sends a predefined HTTP request to a host server of the target site to acquire the head information of a response message of the host server;
Identifying the CMS type of the target site by matching the key information with a Web fingerprint library;
Predicting a Web server type of the target site using a trained machine learning model based on the header information;
And scanning an open port of the host server and a service corresponding to the open port by using a network connection end scanning tool, and detecting the host port fingerprint of the target site.
2. The method of claim 1, wherein the predicting the Web server type of the target site using a trained machine learning model based on the header information comprises:
Preprocessing the head information;
Based on the preprocessed header information, predicting the Web server type by using the machine learning model through a random forest algorithm.
3. The method of claim 1, wherein the detecting the host port fingerprint of the target site by scanning an open port of the host server and a service corresponding to the open port using a network connection end scanning tool comprises:
Generating a probe report by scanning the open port and the service corresponding to the open port by using a scanning tool Nmap;
and analyzing the detection report to obtain the fingerprint of the host port.
4. The method of claim 1, wherein the obtaining header information of the response message of the host server by sending a predefined HTTP request to the host server of the target site by the web crawler includes:
sending, by the web crawler, the HTTP request to the host server;
acquiring a response message of the host server to the HTTP request;
and extracting the relative position information of the first predetermined field and the content information of the second predetermined field from the head of the response message as the head information.
5. The method of claim 4, wherein,
The HTTP request includes "GET/404page.html HTTP/1.1/r/n/r/n";
the first preset field comprises a Date field, a Server field, a Content-Type field, a Content-Length field, a Connection field and an expres field;
the second predetermined field includes a "Content-Length" field and an "X-Power-By" field.
6. The method of any one of claims 1 to 5, further comprising:
and writing the CMS type, the Web server type and the host port fingerprint into a remote dictionary service Redis queue as detection results of the Web fingerprint of the target site.
7. A Web fingerprint detection device, comprising:
A crawling module configured to: crawling source codes of a plurality of webpages from a target site by utilizing a web crawler, and acquiring key information of a static file path based on the source codes; the web crawler sends a predefined HTTP request to a host server of the target site to acquire the head information of a response message of the host server;
the crawling, by the web crawler, the source codes of the plurality of web pages from the target site includes: crawling the source codes from the target site by using the web crawler and adopting a breadth-first strategy;
Wherein, the obtaining the key information of the static file path based on the source code includes: analyzing a preset label from the source code; extracting static file path information from the predetermined tag using a regular expression; storing the static file path information as a target text, and denoising the target text; performing text slicing processing on the target text subjected to denoising processing to extract the key information;
the Web fingerprint detection module is configured to: the CMS type of the target site is identified by matching the key information with a Web fingerprint library; predicting a Web server type of the target site using a trained machine learning model based on the header information; and scanning an open port of the host server and a service corresponding to the open port by using a network connection end scanning tool, and detecting the host port fingerprint of the target site.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executed by the processor, wherein the processor implements the method according to any one of claims 1 to 6 when the computer program is executed.
CN202111681406.7A 2021-12-31 2021-12-31 Web fingerprint detection method and related equipment Active CN114528457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111681406.7A CN114528457B (en) 2021-12-31 2021-12-31 Web fingerprint detection method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111681406.7A CN114528457B (en) 2021-12-31 2021-12-31 Web fingerprint detection method and related equipment

Publications (2)

Publication Number Publication Date
CN114528457A CN114528457A (en) 2022-05-24
CN114528457B true CN114528457B (en) 2024-09-17

Family

ID=81621061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111681406.7A Active CN114528457B (en) 2021-12-31 2021-12-31 Web fingerprint detection method and related equipment

Country Status (1)

Country Link
CN (1) CN114528457B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115941280B (en) * 2022-11-10 2024-01-26 北京源堡科技有限公司 Penetration method, device, equipment and medium based on web fingerprint information
CN116304901B (en) * 2023-02-01 2024-01-30 北京市燃气集团有限责任公司 Webpage server fingerprint identification method, device, equipment and storage medium
CN116127236B (en) * 2023-04-19 2023-07-21 远江盛邦(北京)网络安全科技股份有限公司 Web page web component identification method and device based on parallel structure
JP7344614B1 (en) * 2023-05-08 2023-09-14 株式会社エーアイセキュリティラボ Systems, methods, and programs for testing website vulnerabilities
CN116702027B (en) * 2023-05-26 2026-03-10 云盾智慧安全科技有限公司 Similar website classification method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107395651A (en) * 2017-09-07 2017-11-24 赛尔网络有限公司 Service system and information processing method
CN108429747A (en) * 2018-03-08 2018-08-21 国家计算机网络与信息安全管理中心 A kind of extensive Web server information collecting method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766014B (en) * 2015-04-30 2017-12-01 安一恒通(北京)科技有限公司 Method and system for detecting malicious web addresses
CN108628722A (en) * 2018-05-11 2018-10-09 华中科技大学 A kind of distributed Web Component services detection system
CN112182587A (en) * 2020-09-30 2021-01-05 中南大学 Web vulnerability scanning method, system, device, storage medium and computer equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107395651A (en) * 2017-09-07 2017-11-24 赛尔网络有限公司 Service system and information processing method
CN108429747A (en) * 2018-03-08 2018-08-21 国家计算机网络与信息安全管理中心 A kind of extensive Web server information collecting method

Also Published As

Publication number Publication date
CN114528457A (en) 2022-05-24

Similar Documents

Publication Publication Date Title
CN114528457B (en) Web fingerprint detection method and related equipment
US12169528B2 (en) Systems and methods for web content inspection
CN115033876B (en) Log processing methods, log processing devices, computer equipment, and storage media
CN112491602B (en) Behavior data monitoring method and device, computer equipment and medium
US9614862B2 (en) System and method for webpage analysis
EP3534263A1 (en) Systems and methods for web analytics testing and web development
CN107087001A (en) A kind of important address spatial retrieval system in distributed internet
US20200250015A1 (en) Api mashup exploration and recommendation
US12450348B2 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
US12591670B2 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
US20250030704A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
US20240348639A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
US12524523B2 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
US20250028826A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
KR102863777B1 (en) Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information
US20240054215A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
US20250028827A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
KR20240019739A (en) Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information
US20250294036A1 (en) Methods and Systems for Forecasting Subsequent Computer System Log Events Based on Analysis of Historical Log Data
CN121485956A (en) A method, system, device, and medium for detecting logical vulnerabilities based on intelligent agents.
US20250071130A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
CN119766546A (en) Attack behavior detection method, device, equipment and medium
CN116627466B (en) A service path extraction method, system, equipment and medium
US12368731B2 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
CN118555140A (en) Construction method of attack detection model and attack detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant