CN118446297A

CN118446297A - Patent theme map generation method and system based on deep semantic hierarchical clustering

Info

Publication number: CN118446297A
Application number: CN202410621253.4A
Authority: CN
Inventors: 丁恒; 曹高辉
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2024-05-20
Filing date: 2024-05-20
Publication date: 2024-08-06

Abstract

The present invention discloses a method and system for generating a patent subject map based on deep semantic hierarchical clustering, including: obtaining a patent document set PT for generating a patent subject map; using a patent deep semantic representation model to semantically encode each patent document in the patent document set PT to obtain a semantic representation vector matrix V={v ₁ ,…,v _N }; inputting the semantic representation vector matrix V into a hierarchical clustering algorithm to obtain a hierarchical clustering tree structure corresponding to the patent document set PT; generating a corresponding subject description for each non-leaf node on the hierarchical clustering tree structure; merging the generated subject description and the hierarchical clustering tree structure into a patent subject map with a hierarchical structure. The patent subject map generation method and system of the present application can help users quickly mine the hierarchical relationship between patent documents at the subject level, meet the user's analysis needs for large-scale patent information, and improve the efficiency of user patent analysis.

Description

Patent theme map generation method and system based on deep semantic hierarchical clustering

Technical Field

The invention relates to the technical field of information, in particular to a method and a system for generating a patent theme map based on deep semantic hierarchical clustering.

Background

The patent literature is a knowledge treasury with high value in the current era, through carrying out effective analysis on the patent, the development trend of the technology can be foreseen, the research hot spots in different fields can be mastered, the competition information can be obtained, the technical capability and innovation direction of the competitor can be known, the knowledge treasury can be used for evaluating the commercial value and market potential of the technology, new technical solutions and innovation ideas can be found, and enterprises can be helped to formulate intellectual property strategies and risk management strategies. However, the current patent topic analysis method also relies heavily on manual patent interpretation, lacks an automatic and hierarchical patent topic analysis means, and particularly lacks a patent topic map generation technology capable of revealing the upper and lower association of patent topics.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, some embodiments of the invention provide a method and a system for generating patent topic atlas based on deep semantic hierarchical clustering, which can help a user to quickly mine the upper and lower hierarchical relationship of patent literature on a topic level, meet the analysis demands of the user on large-scale patent information and improve the efficiency of patent analysis of the user.

The technical scheme adopted by the invention for solving the technical problems is as follows:

in some embodiments, a patent topic map generation method based on deep semantic hierarchical clustering is provided, the method comprising:

Obtaining a patent document set pt= { P ₁,P₂,…,P_N }, wherein P _i (i=1, 2, … N) represents one patent document in the patent document set PT, and N is a natural number;

Using a patent depth semantic representation model to carry out semantic coding on each patent document in a patent document set PT, and coding P _i into a depth semantic representation vector V _i to obtain a semantic representation vector matrix V= { V ₁,…,v_N };

Inputting a semantic representation vector matrix V of a patent document set PT= { P ₁,P₂,…,P_N } into a hierarchical clustering algorithm to obtain a hierarchical clustering tree structure corresponding to the patent document set PT, wherein the hierarchical clustering tree structure comprises non-leaf nodes and root nodes, and each patent document in the patent document set PT is connected with the non-leaf nodes on the hierarchical clustering tree structure;

generating a corresponding topic description for each non-leaf node on the hierarchical cluster tree structure;

And merging the generated topic description and the hierarchical clustering tree structure into a patent topic map with an upper and lower hierarchical structure.

In some embodiments, the patent depth semantic representation model is trained using the following method:

Obtaining a seed patent literature set ps= { P ₁,P₂,…,P_M }, wherein P _x (x=1, 2, … M) represents one patent literature in the seed patent literature set, and M is a natural number;

Extracting application numbers, quotation among patents and quotation relations from each patent document in the seed patent document set PS, constructing a patent quotation relation network, and storing the patent quotation relation network in an EdgeList format;

performing word segmentation pretreatment on the title, abstract and claim text parts of each node patent in the patent citation relation network by adopting a word segmentation algorithm;

Constructing a triplet marginal Loss function loss= maxdP _t,P_m-dP_t,P_n +k,0 as an objective function based on the quotation and quotation relations between the patents, wherein P _t represents any patent document in the seed patent document set PS, P _m represents the quotation patent and the quotation patent of P _t, P _n is the non-quotation and non-quotation patent of P _t, d (·) is an L2 norm distance function ,d(P_t,P_m)＝v_t,v_m ²,d(P_t,P_n)＝v_t,v_n ²,, v _t、v_m、v_n is the semantic representation vector of patent P _t、P_m、P_n output by the Transformer neural network, respectively, and k is a hyper-parameter;

For any patent document P _t in the seed patent document set PS, Q pairs (P _m,P_n) are extracted from the patent citation relation network, and are combined with P _t to form Q triplet data for training a transducer neural network encoder, wherein Q is a natural number not less than 5;

and training a transducer neural network based on the triplet data and the objective function to obtain a patent depth semantic representation model.

In some embodiments, the hierarchical clustering algorithm is HDBSCAN clustering algorithm.

In some embodiments, the generating a corresponding topic description for each non-leaf node on the hierarchical cluster tree structure includes:

Calculating the distances from all non-leaf nodes to the root node, and sorting according to the distances from big to small; each non-leaf node C _j includes a patent document class cluster PC _j, and the patent document class cluster PC _j is composed of a plurality of patent documents with similar subjects, that is, PC _j＝{P_i,…,P_j},P_i,…,P_j is all patent documents in the patent document collection PT;

For the non-leaf node C _j, if C _j has no offspring non-leaf node, the title, abstract and main claim of all patent documents in PC _j＝{P_i,…,P_j are spliced into text fragments, and the topic description corresponding to the non-leaf node C _j is automatically generated by using an encoder-decoder text generation model Text (PC _j) represents the title, abstract and main claim Text of all patent documents in PC _j,A topic description corresponding to C _j;

If C _j has a descendant non-leaf node { C '_l,…,C'_m }, concatenating the title, abstract and main claim of all patent documents in PC _j＝{P_i,…,P_j } into a Text segment Text ({ P _i,…,P_j }), and concatenating the subject description corresponding to the descendant non-leaf node { C' _l,…,C'_m } of C _j into a Text segment The topic description corresponding to C _j is then generated using the encoder-decoder text generation model,

In some embodiments, the seed patent literature set ps= { P ₁,P₂,…,P_M } is obtained from the USPTO database; the patent citation relation network is constructed by using a parse-uspto-xml tool to extract the application number and the citation and citation relation among the patents from the seed patent document P _x, taking the patents as nodes and taking the citation and citation relation among the patents as edges.

In some embodiments, the word segmentation algorithm is a WordPiece word-level word segmentation algorithm.

In some embodiments, there is also provided a patent topic map generation system based on deep semantic hierarchical clustering, the patent topic map generation system comprising:

a target patent document set acquisition module, configured to acquire a patent document set pt= { P ₁,P₂,…,P_N }, where P _i (i=1, 2, … N) represents one patent document in the patent document set PT, and N is a natural number;

the semantic representation vector matrix generation module is used for carrying out semantic coding on each patent document in the patent document set PT by utilizing the patent depth semantic representation model, and coding P _i into a depth semantic representation vector V _i to obtain a semantic representation vector matrix V= { V ₁,…,v_N };

The hierarchical clustering tree structure generation module is used for inputting a semantic representation vector matrix V of a patent document set PT= { P ₁,P₂,…,P_N } into a hierarchical clustering algorithm to obtain a hierarchical clustering tree structure corresponding to the patent document set PT, wherein the hierarchical clustering tree structure comprises non-leaf nodes and root nodes, and each patent document in the patent document set PT is connected with the non-leaf nodes on the hierarchical clustering tree structure;

The topic description generation module is used for generating corresponding topic description for each non-leaf node on the hierarchical clustering tree structure;

And the patent topic map generation module is used for combining the generated topic description and the hierarchical clustering tree structure into a patent topic map with an upper and lower hierarchical structure.

In some embodiments, the patent topic map generating system further includes a display module, where the display module is configured to display the patent topic map and the corresponding patent document class clusters on a display interface.

In some embodiments, there is also provided an electronic device including:

A processor;

a memory storing processor-executable instructions, wherein:

the processor reads instructions from the memory to implement the steps of the method according to any of the embodiments described above.

In some embodiments, there is also provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method of any of the embodiments described above.

Compared with the prior art, the application at least comprises the following advantages: in the embodiment of the application, the corresponding topic description is generated through each non-leaf node on the hierarchical clustering tree structure, the patent topic map can be automatically generated, and the patent topic description has a hierarchical structure with upper and lower association. In some embodiments of the application, the semantic representation vector matrix is generated through the trained patent depth semantic representation model, and the training model is trained by adopting real data in a patent database and reference relations among patents, so that the accuracy is higher. In some embodiments of the application, the topic description with the offspring non-leaf nodes and the topic description without the offspring non-leaf nodes are generated in different modes, so that the layering property and the accuracy of the topic description are further improved. The display mode and the display module provided in some embodiments of the present application further enhance the interactive visual experience and the operation efficiency. Of course, the advantages of the present application are not limited to the above list. The patent topic map generation method and system based on deep semantic hierarchical clustering can meet the analysis requirement of users on large-scale patent information, and improve the efficiency of patent analysis of the users.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a patent topic map generation method based on deep semantic hierarchical clustering according to some embodiments of the present invention.

FIG. 2 is a flow chart of a training method of a patent depth semantic representation model according to some embodiments of the invention.

FIG. 3 is a flow chart of a method of generating a corresponding topic description for each non-leaf node on a hierarchical cluster tree structure in accordance with some embodiments of the invention.

Fig. 4 is a schematic structural diagram of a patent topic map generation system based on deep semantic hierarchical clustering according to some embodiments of the present invention.

Fig. 5 is a schematic diagram of a patent document collection stored in EdgeList format according to some embodiments of the present invention.

Fig. 6 is a schematic diagram of word segmentation preprocessing performed by the word segmentation algorithm according to some embodiments of the present invention.

FIG. 7 is a schematic diagram of a semantic representation vector matrix obtained according to some embodiments of the present invention.

FIG. 8 is a schematic diagram of a hierarchical cluster tree structure generated in accordance with some embodiments of the invention.

FIG. 9 is a schematic illustration of a patent topic map generated in accordance with some embodiments of the invention.

Fig. 10 is a schematic structural diagram of a patent topic map generation system based on deep semantic hierarchical clustering according to some embodiments of the present invention.

Fig. 11 is a schematic structural diagram of an electronic device according to some embodiments of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Fig. 1 is a schematic flow chart of a patent topic map generation method based on deep semantic hierarchical clustering according to some embodiments of the present invention. Referring to fig. 1, the patent theme map generation method based on deep semantic hierarchical clustering of the present invention includes the following steps:

FIG. 2 is a flow chart of a training method of a patent depth semantic representation model according to some embodiments of the invention. Referring to FIG. 2, in some embodiments, the patent depth semantic representation model is trained using the following method:

Extracting application numbers, references among patents and referenced relations from each patent document in the seed patent document set PS, constructing a patent reference relation network, and storing the patent reference relation network in an edge list format; an edge list is a data structure that represents edges (Edges) in a graph. Each item in the edge list typically contains information of two vertices (Vertices) indicating that an edge exists between the two vertices. Such a data structure may be an array, a linked list, or any other collection that may store edges in order.

Constructing a triplet marginal Loss function loss= maxdP _t,P_m-dP_t,P_n +k,0 as an objective function based on the quoted and quoted relations between the patents, wherein P _t represents any patent document in the seed patent document set PS, P _m represents the quoted and quoted patents of P _t, P _n is the non-quoted and non-quoted patents of P _t, d (·) is an L2 norm distance function ,d(P_t,P_m)＝v_t,v_m ²,d(P_t,P_n)＝v_t,v_n ²,, wherein v _t、v_m、v_n is the semantic representation vector of patent P _t、P_m、P_n output by the transducer neural network, k is a superparameter, default is set to 1, and adjustment is performed according to actual training data; the distance of the semantic representation vector of patent document P _t from the cited or cited patent P _m is reduced as much as possible by this objective function, while at the same time the distance between P _t and non-cited or non-cited patent P _n is enlarged. Patents with similar topic descriptions are mapped to similar positions in the semantic vector space by the objective function, while patents with dissimilar topics are mapped to points in the semantic vector space that are farther apart.

FIG. 3 is a flow chart of a method of generating a corresponding topic description for each non-leaf node on a hierarchical cluster tree structure in accordance with some embodiments of the invention. Referring to FIG. 3, in some embodiments, the generating a corresponding topic description for each non-leaf node on the hierarchical cluster tree structure includes:

For the non-leaf node C _j, if C _j has no offspring non-leaf node, the title, abstract and main claim of all patent documents in PC _j＝{P_i,…,P_j are spliced into text fragments, and the topic description corresponding to the non-leaf node C _j is automatically generated by using an encoder-decoder text generation model Text (PC _j) represents the title, abstract and main claim Text of all patent documents in PC _j,A topic description corresponding to C _j; in some examples, the main claim is the independent claim or the first claim.

If C _j has a descendant non-leaf node { C '_l,…,C'_m }, concatenating the title, abstract and main claim of all patent documents in PC _j＝{P_i,…,P_j } into a Text segment Text ({ P _i,…,P_j }), and concatenating the subject description corresponding to the descendant non-leaf node { C' _l,…,C'_m } of C _j into a Text segmentThe topic description corresponding to C _j is then generated using the encoder-decoder text generation model,

Fig. 4 is a schematic structural diagram of a patent topic map generation system based on deep semantic hierarchical clustering according to some embodiments of the present invention. Referring to fig. 4, in some embodiments, the patented subject map generation system includes:

A target patent document set obtaining module 100, configured to obtain a patent document set pt= { P ₁,P₂,…,P_N }, where P _i (i=1, 2, … N) represents one patent document in the patent document set PT, and N is a natural number;

The semantic representation vector matrix generating module 200 is configured to perform semantic encoding on each patent document in the patent document set PT by using a patent depth semantic representation model, and encode P _i into a depth semantic representation vector V _i, so as to obtain a semantic representation vector matrix v= { V ₁,…,v_N };

The hierarchical clustering tree structure generating module 300 is configured to input a semantic representation vector matrix V of a patent document set pt= { P ₁,P₂,…,P_N } into a hierarchical clustering algorithm, and obtain a hierarchical clustering tree structure corresponding to the patent document set PT, where the hierarchical clustering tree structure includes non-leaf nodes and root nodes, and each patent document in the patent document set PT is connected with a non-leaf node on the hierarchical clustering tree structure;

A topic description generation module 400, configured to generate a corresponding topic description for each non-leaf node on the hierarchical cluster tree structure;

The patent topic map generation module 500 is configured to combine the generated topic description and the hierarchical cluster tree structure into a patent topic map with an upper-lower hierarchical structure.

FIG. 5 is a schematic diagram of a patent document collection stored in an EdgeList format for training a patent depth semantic representation model according to some embodiments of the present application. Referring to fig. 5, in some embodiments of the present application, a training method for a patent depth semantic representation model may collect a set of seed patent documents ps= { P ₁,P₂,…,P_M},P_i∈[1,M] representing one patent document in the set of seed patent documents, which is also referred to as a seed patent. In some embodiments, the data for the seed patent literature is obtained from the USPTO (UnitedStatesPatentandTrademarkOffice, U.S. patent and trademark office) patent database. Of course, in some embodiments, the data of the seed patent document may be obtained from other commercial patent databases or official databases. The data of the seed patent document at least comprises patent numbers, quotes and quoted patent and patent text data. In some embodiments, the patent number may be a patent application number, a patent publication number, or a patent publication number. It is understood that the data of the seed patent document can be obtained in the patent database by the patent number. Patent numbers refer to patent documents cited as patent documents in the seed patent document collection, for example, patent documents cited as comparison documents or prior art by an examiner during examination, patent documents cited as background art in patent application writing, and prior patent documents cited as priority or the like. the cited patent is a patent document referencing the seed patent. Like the cited patents, the cited patents include patent documents in which seed patents are cited as a comparison document or prior art by an examiner during examination, patent documents in which seed patents are cited as background art in patent application writing, and later patent documents in which seed patents are cited as priority and the like. Patent documents cited by an examiner as a reference document or prior art, patent documents cited as background art in patent application writing, and prior patent documents cited as priority or the like. The patent text data includes at least the title, abstract and claims of the patent document. Patent number (source_patent_num in fig. 5), as well as the quotation and quotation relationship between the patents, are extracted from seed patent P _i using a parameter-uspto-xml tool, an open source tool, the quotation patent of seed patent P _i and the patent number of the quotation patent (target_patent_num in fig. 5), and constructing a patent citation relation network and storing the patent citation relation network in an EdgeList format. the quoted and quoted relationships between patents include the relationship of seed patent P _i to its quoted and quoted patents. In the embodiment shown in fig. 5, the patent numbers are patent publication numbers. The title of the patent document is patent name.

Fig. 6 is a schematic diagram of word segmentation preprocessing performed by the word segmentation algorithm according to some embodiments of the present invention. Referring to fig. 6, for the title, abstract and claim text portions of each node patent in the patent citation relationship network, word segmentation preprocessing is performed by using a WordPiece (a greedy longest match search algorithm is used to segment the original text into subwords) word-level word segmentation algorithm. Word segmentation preprocessing is carried out through a word-level word segmentation algorithm, and the text of the title, the abstract and the claim of the patent after word segmentation preprocessing is obtained, wherein the text is word text.

A triplet marginal Loss function loss= maxdP _t,P_m-dP_t,P_n +k,0 is constructed based on the reference relationship between patents, which can be used as an objective function. Wherein P _t represents any patent document in the seed patent document set PS, P _m represents the quoted patent or the quoted patent of P _t, P _n is the non-quoted and non-quoted patent of P _t, d (·,) is the L2 norm distance function, d (P _t,P_m)＝v_t,v_m ², wherein v _t is the semantic representation vector of patent P _t output by the transducer neural network.

For any patent document P _t in the seed patent document set PS, K pairs (P _m,P_n) are extracted from the patent citation relation network, and are combined with P _t to form K triplets for training a transducer neural network encoder, wherein the specific value of K is 5. Specifically, based on the triplet data and the objective function, training a transducer neural network to obtain a patent depth semantic representation model M.

FIG. 7 is a schematic diagram of a semantic representation vector matrix obtained according to some embodiments of the present utility model. Referring to fig. 7, in some embodiments of the present utility model, a patent topic map generation method based on deep semantic hierarchical clustering includes: the patent literature set pt= { P ₁,P₂,…,P_N},P_i∈[1,N] in which the topic map is to be generated is acquired represents one patent literature in the patent literature set PT. P _i consists of at least three parts of text, title (Title), abstract (Abstract) and claim (Claims). P _i is encoded into a depth semantic representation vector v _i,v_i＝M(P_i using the patent depth semantic representation model M constructed in the above embodiment. Each patent document in the PT is subjected to semantic coding in turn, so that a semantic representation vector matrix v= { V ₁,…,v_N }. In some embodiments, the collection of patent documents that acquire the topic map to be generated may be obtained through a patent database. For example, by patent applicant as a search field, all patent applications of a specific applicant are directly obtained as a set of patent documents to be generated as a topic map are searched in a patent database. In some embodiments, the Title (Title), abstract (Abstract) and claim (Claims) text of each patent in the collection of patent documents to be generated the topic map may be obtained. In some embodiments, the text of the technical field and summary of each patent in the patent literature collection of which the topic map is to be generated may also be obtained. Correspondingly, in the training process of the patent depth semantic representation model M, the text of the utility model content part of each patent in the seed patent literature set can be acquired for training. Of course, the summary section refers to the summary of the utility model or the text of the summary section in the specification of the patent document.

FIG. 8 is a schematic diagram of a hierarchical cluster tree structure generated in accordance with some embodiments of the invention. FIG. 9 is a schematic illustration of a patent topic map generated in accordance with some embodiments of the invention. Referring to fig. 8 and 9, in some embodiments, the semantic representation vector matrix V of the patent document set pt= { P ₁,P₂,…,P_N } is input into the hierarchical clustering algorithm HDBSCAN, and a hierarchical clustering tree structure CTree corresponding to the patent document set PT is obtained, where any patent document P _i∈[1,N] (shown by small dots at the end of the hierarchical clustering tree structure in fig. 8) is connected to a non-leaf node on the hierarchical clustering tree structure CTree (shown by large dots in the middle of the hierarchical clustering tree structure in fig. 8). The HDBSCAN hierarchical clustering algorithm only selects the size of the minimum generated cluster, and the algorithm can automatically recommend the optimal cluster result.

A corresponding topic description is generated for each non-leaf node on hierarchical cluster tree structure CTree, which is the node shown by the large dot in the middle of the hierarchical cluster tree structure in fig. 9. The method specifically comprises the following steps:

the distances from all the non-leaf nodes to the root node are calculated and are ordered from big to small. The root nodes are the nodes shown by the big dots at the head end of the hierarchical clustering tree structure in fig. 9, and the non-leaf nodes are the nodes between the root nodes and the nodes corresponding to the patent literature (shown by the small dots at the tail end of the hierarchical clustering tree structure in fig. 9). The nodes corresponding to the patent documents are patent nodes.

Each non-leaf node C _j includes a patent document class cluster PC _j, and the patent document class cluster PC _j is composed of a plurality of patent documents with similar subjects, that is, PC _j＝{P_i,…,P_j},P_i,…,P_j is all patent documents in the patent document collection PT;

For a non-leaf node C _j (such as a second level node corresponding to a non-leaf node in FIG. 9), if C _j has no descendant non-leaf node, that is, the non-leaf node C _j is directly connected with a patent node, and the non-leaf node C _j is connected with a patent node corresponding to a patent document cluster PC _j, the titles, summaries and claims of all patent documents in PC _j＝{P_i,…,P_j are spliced into text fragments, and the topic description corresponding to the non-leaf node C _j is automatically generated by using an encoder-decoder text generation model EDM Text (PC _j) represents the title, abstract and claim Text of all patent documents in PC _j,A topic description corresponding to C _j; in some examples, a claim may be an independent claim or a first claim only. Of course, in some embodiments, the titles, summaries, claims and summary of all patent documents in PC _j＝{P_i,…,P_j may also be spliced into text segments for processing.

If the non-leaf node C _j (e.g., the first level node corresponding to the non-leaf node in FIG. 9) has a descendant non-leaf node { C '_l,…,C'_m } (e.g., the second level node corresponding to the non-leaf node in FIG. 9), i.e., the non-leaf node C _j is not directly connected to the patent node, then the titles, summaries, and main claims of all patent documents in PC _j＝{P_i,…,P_j } are spliced into a Text segment Text ({ P _i,…,P_j }), and the subject description corresponding to the descendant non-leaf node { C' _l,…,C'_m } of C _j is spliced into a Text segmentThe topic description corresponding to C _j is then generated using the encoder-decoder text generation model,By the method, the topic description corresponding to the non-leaf node with the offspring non-leaf node can be more accurate and superior and is matched with the corresponding hierarchy.

And merging the generated topic description and the hierarchical clustering tree structure into a patent topic map with an upper and lower hierarchical structure. In some embodiments, the patent nodes and patent documents corresponding to the patent nodes may also be displayed on the patent topic map through the display module. The patent documents shown include at least patent numbers and headings. The patent number may be a patent application number, a patent publication number, or a patent publication number. By displaying the patent numbers and titles of the patent documents, the interactive experience of the user can be increased, so that the patent documents corresponding to the subject descriptions are clear and convenient to visually read. In some embodiments, the information of the displayed patent document may be customized by a user. For example, it may be set to display only the patent number. In some embodiments, the type of patent number may be custom set by the user. For example, if the user sets a patent number as a patent application number, the patent application number is displayed as a patent number. If the user sets the patent number as the patent publication number, the patent publication number is displayed as the patent number.

In some embodiments, the topic information corresponding to the root node may also be displayed on the patent topic map by the display module. The subject information corresponding to the root node may be search field information of all patent documents. For example, if the search field is a search field by a patent applicant, the information of the patent applicant is regarded as the subject information corresponding to the root node. In some embodiments, keyword extraction may be performed on the search field, for example, the patent applicant acts as the search field, and the searched patent application is "hua as technology limited company", and after keyword extraction is performed on the search field, the "hua as" is displayed on the root node as corresponding topic information on the patent topic map.

In some embodiments, a maximum of 5 patent nodes are displayed per non-leaf node directly connected to a patent node, and the patent literature corresponding to each patent node is displayed on a patent topic map (as described in fig. 9). In some embodiments, displayed patent nodes and corresponding patent documents may be determined according to a patent's topic description matching degree ranking. Thus, the wanted patent literature and patent theme map can be better displayed on the limited space interface. Of course, in some embodiments, the display mode may be customized. For example, the number of patent nodes that are displayed at most per non-leaf node directly connected to the patent node is customized. Thus, the number of patent documents displayed on the non-leaf node is determined according to the user's own selection. If the user does not perform self-defined selection, each non-leaf node directly connected with the patent node is defaulted to display 5 patent nodes at most, if the number of the patent nodes connected with the non-leaf node exceeds 5, other patent nodes are hidden and displayed, a unfolding control can be set, and the user triggers the unfolding control to perform all display. By setting the display mode and displaying 5 patent nodes at most by default, the interactive visual experience and the operation efficiency are further enhanced. Specifically, the expansion control includes control prompt information, for example, the control prompt information is text information of "click to view more patents". In some embodiments, the control hint information includes a subject description of a non-leaf node to which the patent node corresponds. For example, the topic of the non-leaf node corresponding to the patent node (the second level node corresponding to the non-leaf node in fig. 9) is described as "photo", and the control prompt information is text information of "click to view more [ photo ] patents". The control prompt is connected with the non-leaf node and is positioned below the displayed patent document.

In some embodiments, referring to fig. 10, the patent topic map generating system 1000 further includes a display module 600, where the display module is configured to display the patent topic map and the corresponding patent document class clusters on a display interface. And displaying the patent nodes and patent documents corresponding to the patent nodes on a patent theme map through the arrangement of the display module. The display module displays patent documents in the same manner as in the above embodiment. The display module includes the expansion control. Specifically, the expansion control includes control prompt information, for example, the control prompt information is text information of "click to view more patents". In some embodiments, the control hint information includes a subject description of a non-leaf node to which the patent node corresponds. For example, the topic of the non-leaf node corresponding to the patent node is described as "photographing", and the control prompt information is text information of "click to view more [ photograph ] patents". The control prompt is connected with the non-leaf node and is positioned below the displayed patent document.

The patent theme map generation system provided by the embodiment of the invention can realize each process in the method embodiment described in the above embodiment, and can be applied to the electronic device as described below, and for the effect thereof, for avoiding repetition, the description is omitted here.

In some embodiments, there is also provided an electronic device, as shown in fig. 11, including: a processor 10; a memory 20 storing processor-executable instructions, wherein: the processor reads the instructions from the memory to implement the steps of the method for generating a patent topic map as described in any one of the above, and can achieve the same technical effects, and for avoiding repetition, the description is omitted here.

In some embodiments, the electronic device includes a display or a display screen, where the display or the display screen is configured to perform a function of the display module, and a display interface may be generated on the display or the display screen, and the patent theme map and the corresponding patent document class cluster are displayed on the display interface.

In some embodiments, the electronic device may include, but is not limited to, a smart phone, a tablet, a wearable device, a Personal Computer (PC), a netbook, a Personal Digital Assistant (PDA), a smart watch, an in-vehicle device, a robot, a desktop computer, and the like.

Some embodiments also provide a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed in a computer, causes the computer to perform the patent topic map generation method described in the above method embodiments, respectively.

Some embodiments also provide a computer readable storage medium having stored thereon data of the above-described patent topic map generation method and/or system.

The computer readable storage medium may be a memory that can be used to store a software program as well as various data. The memory may mainly include a memory program area and a memory data area, wherein the memory program area may store the above-mentioned computer program. The storage data area may store data for the patent depth semantic representation model, word segmentation algorithm, text generation model, etc. described above. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

In some embodiments, the computer readable storage medium of the present application may include one or more databases, such as a key value database, mySQL database, etc., and the present application does not describe the type of each database and its data storage manner. One or more databases of some embodiments of the present application may be integrated with an electronic device, or may exist as a separate server or in a cloud storage form, and may specifically be determined according to a system structure and an application requirement of an application platform to which the present application is applied.

The processor is a control center of the electronic device, and connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or models stored in the memory, and calling data stored in the memory, thereby performing overall control of the electronic device. The processor may include one or more processing units; preferably, the processor may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor.

In some embodiments, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the patented subject matter map generating method described in any of the embodiments above.

Those of skill in the art will appreciate that in one or more of the above examples, the functions described in the various embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A patent theme map generation method based on deep semantic hierarchical clustering is characterized by comprising the following steps:

2. The patent topic map generation method based on deep semantic hierarchical clustering according to claim 1, wherein the patent deep semantic representation model is trained by the following method:

Constructing a triplet marginal Loss function loss=max { (d (P _t,P_m)-d(P_t,P_n)) +k,0} as an objective function based on the cited and cited relationships between patents, wherein P _t represents any patent document in the seed patent document collection PS, P _m represents the cited and cited patents of P _t, P _n is the non-cited and non-cited patents of P _t, d (·) is an L2 norm distance function ,d(P_t,P_m)＝||v_t,v_m||²,d(P_t,P_n)＝||v_t,v_n||²,, v _t、v_m、v_n is the semantic representation vector of patent P _t、P_m、P_n output by the transducer neural network, respectively, k is a hyper-parameter;

3. The depth semantic hierarchical clustering-based patent topic map generation method of claim 1, wherein the hierarchical clustering algorithm is HDBSCAN clustering algorithm.

4. The method for generating a patent topic map based on deep semantic hierarchical clustering according to claim 1, wherein said generating a corresponding topic description for each non-leaf node on the hierarchical clustering tree structure comprises:

5. The depth semantic hierarchical clustering-based patent topic map generation method of claim 2, wherein the seed patent literature set ps= { P ₁,P₂,…,P_M } is obtained from a USPTO database; the patent citation relation network is constructed by using a parse-uspto-xml tool to extract the application number and the citation and citation relation among the patents from the seed patent document P _x, taking the patents as nodes and taking the citation and citation relation among the patents as edges.

6. The method for generating patent topic map based on deep semantic hierarchical clustering according to claim 2, wherein the word segmentation algorithm is a WordPiece word-level word segmentation algorithm.

7. A patent topic map generation system based on deep semantic hierarchical clustering, characterized in that the patent topic map generation system comprises:

8. The system for generating a patent topic map based on deep semantic hierarchical clustering according to claim 7, further comprising a display module for displaying the patent topic map and corresponding patent document class clusters on a display interface.

9. An electronic device, the electronic device comprising:

A processor;

a memory storing processor-executable instructions, wherein:

A processor reads instructions from a memory to implement the steps of the method according to any of claims 1-6.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any of claims 1-6.