CN119669443A - Content retrieval method, device, computer equipment and readable storage medium - Google Patents
Content retrieval method, device, computer equipment and readable storage medium Download PDFInfo
- Publication number
- CN119669443A CN119669443A CN202411782174.8A CN202411782174A CN119669443A CN 119669443 A CN119669443 A CN 119669443A CN 202411782174 A CN202411782174 A CN 202411782174A CN 119669443 A CN119669443 A CN 119669443A
- Authority
- CN
- China
- Prior art keywords
- content
- vector
- search
- retrieval
- sparse
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present application relates to a content retrieval method, apparatus, computer device, computer readable storage medium and computer program product. The method comprises the steps of obtaining a content retrieval request, respectively constructing high-dimensional vector representation and sparse vector representation based on the content retrieval request, searching in a preset high-dimensional vector library based on an approximate retrieval mode according to the high-dimensional vector representation to obtain a first retrieval result, wherein the high-dimensional vector library comprises high-dimensional vector representations corresponding to candidate contents, searching in a preset sparse vector library according to the sparse vector representation to obtain a second retrieval result, wherein the sparse vector library comprises sparse vector representations corresponding to the candidate contents, and obtaining a content retrieval result corresponding to the content retrieval request based on the first retrieval result and the second retrieval result. By adopting the method, the accuracy of content retrieval can be improved.
Description
Technical Field
The present application relates to the field of computer technology, and in particular, to a content retrieval method, apparatus, computer device, computer readable storage medium, and computer program product.
Background
With the rapid development of information technology, knowledge base systems are increasingly widely used in information management and knowledge acquisition, and images and documents are used as main carriers of information, and the number of the images and the documents is explosively increased. Efficient retrieval and utilization of information in such images and documents is becoming critical, whether in the scientific, educational, medical, or business fields. However, the accuracy of the conventional content retrieval method is low in the face of millions or even tens of millions of images and documents.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a content retrieval method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve the accuracy of content retrieval.
In a first aspect, the present application provides a content retrieval method, including:
Acquiring a content retrieval request, and respectively constructing a high-dimensional vector representation and a sparse vector representation based on the content retrieval request;
Searching in a preset high-dimensional vector library based on an approximate searching mode according to the high-dimensional vector representation to obtain a first searching result;
Searching in a preset sparse vector library according to the sparse vector representation to obtain a second search result;
And obtaining a content retrieval result corresponding to the content retrieval request based on the first retrieval result and the second retrieval result.
In a second aspect, the present application also provides a content retrieval device, including:
the retrieval request acquisition module is used for acquiring a content retrieval request and respectively constructing a high-dimensional vector representation and a sparse vector representation based on the content retrieval request;
the high-dimensional vector retrieval module is used for retrieving in a preset high-dimensional vector library based on an approximate retrieval mode according to the high-dimensional vector representation to obtain a first retrieval result;
The sparse vector retrieval module is used for retrieving in a preset sparse vector library according to the sparse vector representation to obtain a second retrieval result;
And the retrieval result obtaining module is used for obtaining a content retrieval result corresponding to the content retrieval request based on the first retrieval result and the second retrieval result.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the above content retrieval method when the processor executes the computer program.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the above content retrieval method.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the above content retrieval method.
The content retrieval method, the device, the computer equipment, the computer readable storage medium and the computer program product respectively construct high-dimensional vector representation and sparse vector representation based on the content retrieval request, retrieve the first retrieval result in a preset high-dimensional vector library based on an approximate retrieval mode according to the high-dimensional vector representation, retrieve the second retrieval result in the preset sparse vector library according to the sparse vector representation, and retrieve the content retrieval result based on the first retrieval result and the second retrieval result. In the content retrieval processing process, a high-dimensional vector representation and a sparse vector representation are constructed for the content retrieval request to be respectively retrieved, so that multi-dimensional content retrieval of the high-dimensional vector and the coefficient vector can be realized, and the accuracy of content retrieval is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are needed in the description of the embodiments of the present application or the related technologies will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a diagram of an application environment for a content retrieval method in one embodiment;
FIG. 2 is a flow diagram of a content retrieval method in one embodiment;
FIG. 3 is a flow diagram of determining content retrieval results in one embodiment;
FIG. 4 is a schematic flow chart of a Chinese character library in another embodiment;
FIG. 5 is a flow diagram of multiple document library retrieval in one embodiment;
FIG. 6 is a flow chart of a content retrieval method according to yet another embodiment;
FIG. 7 is a block diagram of a content retrieval device in one embodiment;
fig. 8 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The content retrieval method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server.
A user may send a content retrieval request to the server 104 through the terminal 102 to request the server 104 for content retrieval. After obtaining the content search request sent by the terminal 102, the server 104 respectively constructs a high-dimensional vector representation and a sparse vector representation based on the content search request, searches a preset high-dimensional vector library according to the high-dimensional vector representation based on an approximate search mode to obtain a first search result, searches a preset sparse vector library according to the sparse vector representation to obtain a second search result, and the server 104 obtains the content search result based on the first search result and the second search result. The server 104 may also send the obtained content search result to the terminal 102, so as to display the corresponding content search result in the terminal 102.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, projection devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The head-mounted device may be a Virtual Reality (VR) device, an augmented Reality (Augmented Reality, AR) device, smart glasses, or the like. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services.
In an exemplary embodiment, as shown in fig. 2, a content retrieval method is provided, and an example of application of the method to the server in fig. 1 is described, including the following steps 202 to 208. Wherein:
step 202, obtaining a content retrieval request, and respectively constructing a high-dimensional vector representation and a sparse vector representation based on the content retrieval request.
Wherein the content retrieval request is a message for requesting content retrieval, and can be transmitted to the server by the terminal. The content retrieval request may carry a retrieval statement to retrieve the corresponding desired content. For example, a document keyword may be included in the content retrieval request to retrieve a document that matches the document keyword. The high-dimensional vector representation is a vector representation comprising a plurality of dimension elements, each dimension element may characterize a feature, through which a plurality of features of the content retrieval request may be characterized from a plurality of dimensions. The sparse vector representation is a vector representation with most elements being zero and only few elements being non-zero, and the sparse vector representation can greatly reduce the storage space and the calculation time by only storing the non-zero elements and the position information thereof, and can represent as much knowledge as possible with fewer resources, thereby improving the efficiency and the accuracy of data processing.
The server may obtain a content retrieval request sent by the user through the terminal, and the server may construct a corresponding vector representation based on the content retrieval request, and may specifically include a high-dimensional vector representation and a sparse vector representation. In some embodiments, the server may vector map the content retrieval request to map the content retrieval request into a high-dimensional vector representation and a sparse vector representation, respectively.
And 204, searching in a preset high-dimensional vector library based on an approximate search mode according to the high-dimensional vector representation to obtain a first search result, wherein the high-dimensional vector library comprises high-dimensional vector representations corresponding to candidate contents.
The approximate search mode is a method for searching vectors similar to the target vectors in the vector space without traversing the whole data set, and can comprise at least one of a tree-based search method, a hash-based method, a graph-based method and a quantization-based method. The high-dimensional vector library is a vector database which is pre-constructed for candidate contents, the high-dimensional vector library can comprise high-dimensional vector representations corresponding to the candidate contents, the candidate contents are for retrievable contents, such as contents stored in a knowledge base, and a user can retrieve corresponding needed target contents from the candidate contents through a content retrieval request. The first search result is a result obtained by searching in a high-dimensional vector library, and specifically may include a high-dimensional vector representation corresponding to a candidate content matched with the high-dimensional vector representation of the content search request, and according to the high-dimensional vector representation obtained by searching, the corresponding candidate content may be determined, thereby obtaining a search result corresponding to the content search request.
For example, the server may obtain a preset high-dimensional vector library, where high-dimensional vector representations corresponding to respective candidate contents are stored. In some embodiments, the server may map each candidate content in advance to obtain a high-dimensional vector representation corresponding to the candidate content, and the server may construct a high-dimensional vector library based on each high-dimensional vector representation of each candidate content. In some embodiments, the server may also record a mapping relationship between each candidate content and the respective high-dimensional vector representation, such that the corresponding candidate content may be determined based on the respective high-dimensional vector representation. The server may perform a search in the high-dimensional vector library based on the high-dimensional vector representation of the content search request, and in particular may perform a search in the high-dimensional vector library based on an approximate search, e.g., may perform a search in the high-dimensional vector library based on an approximate nearest neighbor search (Approximate Nearest Neighbors Search, ANNS) algorithm, to obtain the first search result.
And 206, searching in a preset sparse vector library according to the sparse vector representation to obtain a second search result, wherein the sparse vector library comprises sparse vector representations corresponding to candidate contents.
The sparse vector library is a vector database which is constructed for candidate contents in advance, and the sparse vector library can comprise sparse vector representations corresponding to the candidate contents. The second search result is a result obtained by searching in the sparse vector library, and specifically may include sparse vector representations corresponding to candidate contents matched with the sparse vector representations of the content search request.
For example, the server may query a pre-constructed sparse vector library and search in the sparse vector library according to a sparse vector representation to obtain the second search result. In some embodiments, the server may map each candidate content in advance to obtain sparse vector representations corresponding to the candidate content, and the server may construct a sparse vector library based on the sparse vector representations of each candidate content. The server may match the sparse vector representation of the content search request with the sparse vector representation corresponding to the candidate content in the sparse vector library, and obtain a second search result according to the matching result, for example, the sparse vector representation corresponding to the candidate content matched with the matching result representation may be used as the second search result.
Step 208, obtaining a content retrieval result corresponding to the content retrieval request based on the first retrieval result and the second retrieval result.
Wherein the content retrieval result is a desired target content retrieved from the respective candidate contents based on the content retrieval request. Alternatively, the server may determine the respective retrieved contents based on the first retrieval result and the second retrieval result, and obtain the content retrieval result corresponding to the content retrieval request for the retrieved contents. In some embodiments, the server may determine respective search contents based on the first search result and the second search result, and perform processes such as merging, deduplication, sorting, and the like on the respective search contents, so as to obtain a content search result corresponding to the content search request, where the content search result may include target content that is searched from the respective candidate contents and matches the content search request.
In some embodiments, different search weights may be set for the first search result and the second search result, and the first search result and the second search result may be weighted and combined according to the weights of the first search result and the second search result, so as to obtain a content search result corresponding to the content search request. In some embodiments, the respective weights of the first search result and the second search result may be configured based on the content search request, that is, different search requirements may set different weights, so that the first search result and the second search result are combined according to the corresponding weights of the content search request, and accuracy of the content search result may be further ensured.
In the content retrieval method, a high-dimensional vector representation and a sparse vector representation are respectively constructed based on a content retrieval request, a first retrieval result is retrieved in a preset high-dimensional vector library based on an approximate retrieval mode according to the high-dimensional vector representation, a second retrieval result is retrieved in the preset sparse vector library according to the sparse vector representation, and a content retrieval result is obtained based on the first retrieval result and the second retrieval result. In the content retrieval processing process, a high-dimensional vector representation and a sparse vector representation are constructed for the content retrieval request to be respectively retrieved, so that multi-dimensional content retrieval of the high-dimensional vector and the coefficient vector can be realized, and the accuracy of content retrieval is improved.
In an exemplary embodiment, the high-dimensional vector library comprises at least two content vector libraries, searching is conducted in the preset high-dimensional vector library based on an approximate searching mode according to the high-dimensional vector representation to obtain a first searching result, the method comprises the steps of respectively determining the approximate searching algorithms of the at least two content vector libraries according to the respective database sizes of the at least two content vector libraries, respectively conducting searching according to the high-dimensional vector representation based on the respective approximate searching algorithms of the at least two content vector libraries to obtain the respective searching results of the at least two content vector libraries, and obtaining the first searching result according to the respective searching results of the at least two content vector libraries.
The high-dimensional vector library comprises at least two content vector libraries, and each content vector library can have different scales, namely the content vector library comprises high-dimensional vector representations corresponding to different amounts of candidate contents. Content vector libraries of different sizes may be searched for according to different approximate search algorithms.
For example, the server may determine the respective approximate search algorithm for each content vector library based on the respective database size for each content vector library. For example, for a small-scale database of content vectors, the corresponding approximate search algorithm may include a FLAT indexing algorithm, and the search may be performed in the corresponding database of content vectors based on the FLAT indexing algorithm. As another example, for a large-scale database of content vectors, the corresponding approximate retrieval algorithm may include at least one of SCANN (Scalable Nearest Neighbor ) or HNSW (HIERARCHICAL NAVIGABLE SMALL WORLD, hierarchical navigable small world) indexing algorithms, i.e., the retrieval may be performed in the corresponding database of content vectors based on SCANN or HNSW indexing algorithms. In some embodiments, after determining the respective approximate search algorithm of each content vector library, the server may perform searching according to the high-dimensional vector representation based on the respective approximate search algorithm to obtain the respective search result of each content vector library, and the server may obtain the first search result based on the respective search result of each content vector library. For example, the server may perform processing such as merging and deduplication on the search results of each content vector library, so as to obtain the first search result.
In this embodiment, when the high-dimensional vector library includes at least two content vector libraries, the server performs the retrieval respectively through the corresponding approximate retrieval algorithm based on the respective database sizes of the content vector libraries, so that the retrieval accuracy and the processing efficiency can be effectively balanced.
In an exemplary embodiment, according to the high-dimensional vector representation, searching is performed based on respective approximate search algorithms of at least two content vector libraries to obtain respective search results of the at least two content vector libraries, and the method comprises the steps of respectively calculating first vector matching parameters between the high-dimensional vector representation and high-dimensional vector representations included in the target content vector libraries according to the corresponding approximate search algorithms of the target content vector libraries aiming at the target content vector libraries in the at least two content vector libraries, determining first matching vector representations matched with the high-dimensional vector representations from the target content vector libraries according to the first vector matching parameters, and obtaining respective search results of the at least two content vector libraries according to the respective first matching vector representations determined from the at least two content vector libraries.
The target content vector library is a content vector library which is currently subjected to search processing, and the server can traverse each content vector library, namely, the server can respectively perform search by taking each content vector library as the target content vector library. The first vector matching parameter is used to characterize a degree of matching between the high-dimensional vector representation of the content retrieval request and the high-dimensional vector representations included in the target content vector library, and may specifically include, but is not limited to, various metric parameters including similarity, vector distance, and the like. The first matching vector representation is a vector representation determined from the target content vector library that matches the high-dimensional vector representation of the content retrieval request.
Optionally, the server may traverse each content vector library to perform a search process, and for a target content vector library currently targeted for the search process, the server may match, according to a corresponding approximate search algorithm, the high-dimensional vector representation of the content search request with the high-dimensional vector representations included in the target content vector library, respectively, so as to obtain a corresponding first vector matching parameter. The server may determine a vector representation from the target content vector library that matches the high-dimensional vector representation based on the first vector matching parameter, resulting in a first matching vector representation. For example, the first vector matching parameter may include a vector distance, and the server may determine a vector representation corresponding to a vector distance less than a preset distance threshold in the content vector library as a first matching vector representation matching the high-dimensional vector representation of the content retrieval request. After traversing each content vector library, the server may obtain respective first matching vector representations of each content vector library, and the server may obtain respective search results of each content vector library based on each first matching vector representation. For example, the server may directly use the first matching vector as a search result of the corresponding content vector library. In some embodiments, the server may also perform further filtering on the first matching vector representation, for example, may perform combining, sorting, etc. on the first matching vector representation, so as to obtain the first search result.
In this embodiment, the server calculates the first vector matching parameter to determine, from the content vector library, a first matching vector representation that matches the high-dimensional vector representation of the content search request, so that accurate matching of the content vector library can be achieved based on the first vector matching parameter, thereby improving accuracy of the search result.
In an exemplary embodiment, the second search result is obtained by searching in a preset sparse vector library according to the sparse vector representation, wherein the second search result comprises the steps of respectively calculating second vector matching parameters between the sparse vector representation and the sparse vector representations included in the preset sparse vector library, determining a second matching vector representation matched with the sparse vector representation from the sparse vector library based on the second vector matching parameters, and obtaining the second search result according to the second matching vector representation.
The second vector matching parameter is used for representing the matching degree between the sparse vector representation of the content retrieval request and the sparse vector representation included in the sparse vector library, and specifically may include, but is not limited to, various metric parameters including similarity, vector distance and the like. The second matching vector representation is a vector representation determined from the sparse vector library that matches the sparse vector representation of the content retrieval request.
For example, the server may match the sparse vector representations of the content retrieval request with the sparse vector representations included in the sparse vector library, respectively, and may specifically calculate a second vector matching parameter between the sparse vector representations of the content retrieval request and each sparse vector representation in the sparse vector library, where the second vector matching parameter may specifically include at least one of a vector similarity and a vector distance. The server may determine a vector representation from the sparse vector library that matches the sparse vector representation of the content retrieval request based on the second vector matching parameter, resulting in a second matched vector representation. For example, the second vector matching parameter may include a vector similarity, and the server may determine a sparse vector representation corresponding to a vector similarity less than a similarity threshold in the sparse vector library as a second matching vector representation that matches the sparse vector representation of the content retrieval request. The server may derive the second search result based on the determined second matching vector representation, e.g., the server may directly treat the second matching vector representation as the second search result. In some embodiments, the server may also perform further filtering on the second matching vector representation, such as combining, sorting, etc. the second matching vector representation, to obtain a second search result.
In this embodiment, the server calculates the second vector matching parameter to determine, from the sparse vector library, a second matching vector representation that matches the sparse vector representation of the content retrieval request, so that accurate matching of the sparse vector library can be achieved based on the second vector matching parameter, thereby improving accuracy of the retrieval result.
In an exemplary embodiment, as shown in fig. 3, the processing of determining the content search result, that is, based on the first search result and the second search result, obtains the content search result corresponding to the content search request, includes steps 302 to 310. Wherein:
Step 302, obtaining first search content based on a first search result and a first vector mapping relation, wherein the first vector mapping relation is a mapping relation between candidate content included in a high-dimensional vector library and corresponding high-dimensional vector representation.
The high-dimensional vector library comprises high-dimensional vector representations corresponding to candidate contents, and a first vector mapping relation can be established between the candidate contents and the corresponding high-dimensional vector representations, so that the corresponding candidate contents can be determined through the first vector mapping relation and the high-dimensional vector representations. The first search content is a corresponding search-derived candidate content determined based on the high-dimensional vector representation in the first search result.
Alternatively, the server may obtain a first vector mapping relationship between candidate content included in the high-dimensional vector library and the corresponding high-dimensional vector representation, which may be generated when constructing the high-dimensional vector library. The server may determine the corresponding first retrieved content based on the first search result and the first vector mapping relationship, e.g., the server may perform a candidate content query based on the high-dimensional vector representation in the first search result and the first vector mapping relationship, thereby determining the corresponding first retrieved content.
Step 304, obtaining second search content based on a second search result and a second vector mapping relationship, wherein the second vector mapping relationship is a mapping relationship between candidate content included in the sparse vector library and corresponding sparse vector representation.
The sparse vector library comprises sparse vector representations corresponding to candidate contents, and a second vector mapping relation can be established between the candidate contents and the corresponding sparse vector representations, so that the corresponding candidate contents can be determined through the second vector mapping relation and the sparse vector representations. The second search content is a candidate content which is obtained by corresponding search and is determined based on sparse vector representation in the second search result.
Alternatively, the server may obtain a second vector mapping relationship between candidate content included in the sparse vector library and the corresponding sparse vector representation, which may be generated when the sparse vector library is constructed. The server may determine the corresponding second search content based on the second search result and the second vector mapping relationship, e.g., the server may perform candidate content query based on the sparse vector representation in the second search result and the second vector mapping relationship, thereby determining the corresponding second search content.
And 306, screening based on the first search content and the second search content to obtain preliminary screening content.
For example, the server may perform filtering based on the first search content and the second search content, such as the server performing filtering processing such as combining, deduplication, etc. on the first search content and the second search content, thereby obtaining the preliminary filtering content.
In step 308, a correlation parameter between the preliminary screening content and the content retrieval request is determined, and the target content is determined from the preliminary screening content based on the correlation parameter.
The correlation parameter is used for representing the correlation degree between the primary screening content and the content retrieval request, so that whether the primary screening content obtained by retrieval meets the retrieval requirement of the content retrieval request can be reflected based on the correlation parameter. The target content is content obtained by further screening from the preliminary screening content based on the relevance parameter.
Alternatively, the preliminary screening content may include at least one, and the server may traverse each of the preliminary screening content and match each of the preliminary screening content with the content retrieval request, respectively, to determine a correlation parameter between the preliminary screening content and the content retrieval request. For example, the server may construct a correlation analysis model in advance, and the server may input the preliminary screening content and the content retrieval request into the correlation analysis model for correlation analysis, so that the correlation analysis model outputs the corresponding correlation parameters. The server may further filter each preliminary screening content based on the correlation parameters corresponding to each preliminary screening content, thereby determining the target content. For example, a correlation threshold may be preset, and the server may determine the preliminary screening content whose correlation parameter is greater than the correlation threshold as the retrieved target content.
Step 310, performing description conversion on the target content through the large language model to obtain a content retrieval result corresponding to the content retrieval request.
The large language model (Large Language Models, LLM) is an important technology in the field of natural language processing (Natural Language Processing, NLP), is based on deep learning technology, particularly a Transformer architecture, realizes understanding and generation of natural language through training of large-scale text data, and is widely applied to various tasks such as text generation, machine translation, emotion analysis, question-answering systems and the like. The description conversion can be performed on the target content through the large language model, for example, the target content can be converted into the description in the natural language form, so that the content retrieval result corresponding to the content retrieval request can be obtained.
The server may query the large language model and perform description conversion on the target content through the large language model, for example, the server may input the target content into the large language model, perform description conversion on the target content by the large language model, and the specific large language model may perform description conversion on the target content according to a required format and output a content retrieval result corresponding to the content retrieval request.
In this embodiment, the server obtains the first search content and the second search content according to respective mapping relationships based on the first search result and the second search result, filters based on the first search content and the second search content, determines the target content according to the relevance parameter, and performs description conversion on the target content through the large language model to obtain the content search result corresponding to the content search request, thereby optimizing the content search result and improving the accuracy of the content search result.
In an exemplary embodiment, the high-dimensional vector representation corresponding to the candidate content included in the high-dimensional vector library is obtained by mapping the candidate content in a high-dimensional vector mapping manner, the sparse vector representation corresponding to the candidate content included in the sparse vector library is obtained by mapping the candidate content in a sparse vector mapping manner, and the high-dimensional vector representation and the sparse vector representation are respectively constructed based on the content retrieval request.
The high-dimensional vector mapping mode is a mapping mode for mapping the target object into a corresponding high-dimensional vector representation, and can be realized through a mapping algorithm, for example, can be realized through embedding embedding the mapping algorithm. The sparse vector mapping mode is a mapping mode for mapping the target object into a corresponding sparse vector representation, and can be specifically realized through a mapping algorithm, for example, can be realized through embedding embedding the mapping algorithm.
For example, for a high-dimensional vector representation corresponding to candidate contents in the high-dimensional vector library, the server may map the candidate contents in advance according to a high-dimensional vector mapping manner for each candidate content, and for a sparse vector representation corresponding to candidate contents in the sparse vector library, the server may map the candidate contents in advance according to a sparse vector mapping manner for each candidate content. For the content retrieval request, the server can map according to a high-dimensional vector mapping mode and a sparse vector mapping mode, so that high-dimensional vector representation and sparse vector representation corresponding to the content retrieval request are obtained.
In this embodiment, the server maps the candidate content and the content retrieval request respectively by using a high-dimensional vector mapping mode and a sparse vector mapping mode, so that vector representations can be constructed based on the same mapping mode to perform matching retrieval, and the accuracy of content retrieval can be ensured.
The application also provides an application scene, which applies the content retrieval method. Specifically, the application of the content retrieval method in the application scene is as follows:
The application scene relates to the technical field of information retrieval and natural language processing, in particular to a knowledge base retrieval enhancement method in a retrieval enhancement generation system for a large model, which combines information retrieval and document base design to realize efficient management and efficient retrieval of mass documents, support processing and provide efficient, accurate and highly relevant content return based on tens of millions of documents.
With the rapid development of big data and artificial intelligence technology, knowledge base systems are increasingly used in information management and knowledge acquisition. Conventional knowledge base retrieval systems have efficiency and accuracy problems when dealing with large-scale documents. Particularly in the face of millions or even tens of millions of documents, retrieval systems often face performance bottlenecks, resulting in long response times and inaccurate results. Furthermore, while deep learning based natural language generation models, such as GPT (GENERATIVE PRE-trained Transformer, generating pre-training transformers), BERT (Bidirectional Encoder Representations from Transformers, transform-based bi-directional encoder representation), etc., demonstrate powerful capabilities in generating natural language text, the content generated by these models tends to lack relevance and accuracy without contextual support.
In order to make up for the defects of the traditional retrieval system and the generation model, the content retrieval method provided by the application supports the knowledge base retrieval enhancement of tens of millions of documents, and the knowledge base retrieval enhancement is to improve the retrieval efficiency, accuracy and relevance processing by improving and optimizing an information retrieval algorithm and strategy in the knowledge base system, and particularly can rapidly locate and extract the most relevant information when facing to massive documents. The content retrieval method provided by the application can quickly locate related information in a large-scale knowledge base by combining efficient information retrieval, and generate high-quality natural language answers based on the information. The content retrieval method provided by the application not only improves the retrieval efficiency and the generation quality of the system, but also expands the application range of the knowledge base system, and can meet the requirements of users on efficient, accurate and intelligent information service.
The content retrieval method provided by the application can support the retrieval enhancement of the knowledge base of tens of millions of documents, is used for solving the problems of performance bottleneck and inaccurate retrieval result in the process of processing large-scale documents in the prior art, and improves the relevance and accuracy of natural language generation. By the content retrieval method provided by the application, a user can obtain more efficient, accurate and intelligent knowledge retrieval service when processing tens of millions of large-scale documents, and the user experience and application effect are obviously improved. The information retrieved by the knowledge base supports the generation of more accurate and contextual answers using external information sources to help the LLM generate more accurate and contextual results while reducing the illusion.
In the content retrieval method provided by the application, as shown in fig. 4, in the document warehouse-in process, a user can upload a document, and the uploaded document can be stored in a target storage mode. The server can receive and parse the document uploaded by the user in the object storage, support multiple file formats (such as PDF, word, TXT, etc.), and perform optical character recognition (OCR, optical Character Recognition) on the scanned document, so as to implement file parsing, specifically convert the image content into retrievable text, and finally return a Json (JavaScript Object Notation ) text type. After the creation of the data table of the database and the vector library is completed, the text content is cut, the long document is cut into a plurality of text fragments, the obtained text cut data is embedded and mapped to the document fragments by utilizing a pre-trained Embedding (embedded) model, so that vectorization processing is realized, and each text fragment is converted into high-dimensional vector representation or sparse vector representation. Further, text vector data are inserted into a vector table in a vector library in batches, the text data are inserted into a database table, the segmented data, the original information of the document and the corresponding vector ID are stored into the database, the corresponding relation between the document and the vector is established, efficient management of the data is achieved, and warehousing of the document is completed.
Further, for ANNS searches, the vector table may employ different indexing strategies to store data of different magnitudes. Wherein ANNS search algorithm is an algorithm for quickly finding the closest point to the query point in a large-scale dataset. Because performing an accurate nearest neighbor search in high-dimensional data is very time consuming and computationally complex, approximating a nearest neighbor search significantly increases search speed by sacrificing some accuracy. As shown in fig. 5, the document library may be classified into an own document library and a client self-built document library. The self-built document library of the client is divided into a billion-level document library and a tens of thousands-level document library, and aiming at the self-built billion-level document library of the client, the user can not store a large amount of data at one time, so that two actual vector libraries are adopted, namely a billion-data vector library and a tens of thousands-level vector library, wherein the tens of thousands-level vector library is used for incrementally uploading documents, and when the size of the document library reaches 1 ten thousand, the incrementally documents are moved into the billion-data vector library.
In the document retrieval flow, after a user inputs a query, the problem is vectorized again by utilizing Embedding models, and high-dimensional vector representation and sparse vector representation of the problem are obtained. The Sparse Vector (Sparse Vector) is a Vector with most elements being zero. Sparse vectors are very common in the fields of computer science, machine learning, and data processing, as they can efficiently represent and process large-scale, sparse data. A ANNS search is performed for a high-dimensional vector, a BM25 (information retrieval algorithm) or a sparse matrix search is performed for a sparse vector, and then the searched ID is acquired from the database as the original document information. In the vector searching process, multi-path recall is carried out on information searched by various strategies, namely semantic representation provided by vector searching and accurate query results of sparse vectors aiming at full-text searching are combined, so that hybrid searching (hybrid search) is realized, primary screening and recall are carried out on search results, and high relevance of the results is ensured. And then the retrieval result is reordered through ReRanker (reordering algorithm), so that the accuracy and the user experience of the result are improved. And finally, final de-duplication and sequencing are carried out on the re-sequenced results, and the retrieved high-correlation document fragments can be sent into LLM to complete the natural language answer of the large model.
In particular, as shown in FIG. 5, vector retrieval is the use of an approximate nearest neighbor search algorithm, and the key idea of ANNS is no longer limited to returning the most accurate results, but rather to searching only for neighbors of the target, as compared to the exact retrieval that is typically very time consuming. The ANNS algorithm quickly finds the approximate nearest neighbor vector of the query point by constructing an index structure. For similarity between vectors, an Inner Product (IP) is selected to calculate a distance between two vectors, which can be specifically expressed as follows:
Wherein, Is the distance between the two vectors and,, ,Is the dimension of the vector.
For the index ANNS, a FLAT index and SCANN or HNSW index are selected. The FLAT index is used for linearly scanning data points to find the nearest neighbor of the query point, so that the recall rate can be ensured to be 100%, the SCANN index is used for dividing a large-scale data set into a plurality of subsets which are easy to process through a hierarchical index structure and an efficient search strategy, the search efficiency is greatly improved under the condition of less loss of precision, the good recall rate can be kept, the large-scale data set is suitable for a large-scale data scene, and the HNSW index is used for providing higher accuracy and being suitable for the large-scale data scene while keeping high efficiency through the hierarchical structure and the small-world network characteristic. Thus, for vector libraries, the following categories can be distinguished:
the self document library supports the storage and retrieval of billions of data, is maintained by an administrator, and provides a user with a prefabricated mass of document data. By adopting SCANN or HNSW indexes, high-efficiency and accurate similar vector retrieval is supported;
The client self-built document library is divided into a billion-level client self-built document library and a tens of thousands-level client self-built document library;
The billions of clients build the document library themselves, considering that users will not store large amounts of data at a time, two kinds of actual vector libraries are divided, one is billions of data vector libraries and the other is billions of vector libraries. Aiming at a vector library of hundred thousand grades, a FLAT index is adopted to ensure 100 percent of recall rate, store the latest uploaded file of a user, ensure 100 percent of the latest file content to be accurately searched, provide accurate content for clients, simultaneously design the size of the vector library to be 1 ten thousand, ensure high-efficiency search, and when the vector library reaches 1 ten thousand data, the previous data is moved into the vector library of hundred thousand grades, thereby completing incremental updating. Note that the ten thousand level vector library and the hundred million level vector library are retrieved simultaneously, and then the results are combined, de-duplicated, and ordered.
Aiming at the situation that the data volume of the client is not large, the self-built document library of the client with the ten thousand levels is provided, the FLAT index is adopted, the uploading and the searching of the client are convenient, and the 100% recall rate is provided.
And acquiring vector ID information through the retrieval of the vector library, and returning the original data content of the document corresponding to the acquired vector through the database.
Further, the content retrieval method provided by the application supports a multi-way recall mode of ANNS and BM 25/sparse matrix search, then performs merging and de-duplication, performs coarse row to obtain TopN (first N) candidate results, and then performs ReRanker rearrangement to obtain TopK (first K, K < N) final results, so that the recall rate of the retrieval results can be improved, and the accuracy is ensured. As shown in fig. 6, when the user triggers the input problem, the input problem is subjected to embedded mapping, and approximate search and sparse search are respectively performed based on the database and the vector library, so as to obtain respective database query results, and further obtain a multi-path recall result. The server may reorder (ReRanker) and reorder the multiple recall results to obtain a final content retrieval result.
Particularly ANNS is mainly used for approximate nearest neighbor search of high-dimensional vectors, which can quickly find the vector closest to the query vector in a larger vector space. Assuming that there is one query vector q and document vector set D, ANNS points are calculated as:
Wherein, For vector q and vector in document vector set DA ANNS fraction in between; is the L2 norm. ANNS are high-dimensional, dense vectors that can capture deep, non-linear feature relationships.
Sparse vectors are the use of vector embedding to represent words or phrases, where most elements are zero and only one non-zero element represents the presence of a particular word. Sparse vectors are superior to dense vectors in knowledge search, keyword perception, and interpretability. The score calculation for sparse vectors can be as follows:
Wherein, For vector q and vector in document vector set DThe sparse vector score between them,Is a termAt document vectorIs used to determine the number of occurrences of the pattern,Is a document vectorIs provided for the length of (a),Is the average length of the collection of documents,AndIs the adjustment parameter of the device, which is used for adjusting the parameters,Is the calculated inverse document frequency (Inverse Document Frequency, IDF).
Sparse vector search generally uses sparse representation methods such as word Frequency and inverse document Frequency (TF-IDF, term Frequency-Inverse Document Frequency) to measure the relevance of documents by calculating the matching degree of query words and document words.
By combining ANNS and BM 25/sparse matrix search, the advantages of the ANNS and BM25 can be fully utilized, and more efficient and accurate retrieval can be realized. And combining the candidate information obtained in a plurality of modes, performing de-duplication, and sequencing the combined candidate documents according to predefined weights.
ReRanker is used for fine sorting in candidate results obtained in the preliminary retrieval stage so as to improve the quality of the final retrieval result. It can use more complex features and models to capture finer correlations between questions and documents. Therefore, after the rough ranking is completed, a candidate data result of TopN is obtained, then the result of TopN and a question Query (Query statement) are sent to ReRanker deep learning model, a complex relation between the question and the candidate result is captured through the model, a correlation score is calculated, and TopK (first K) results are obtained. Finally, the TopK result is subjected to final de-reordering post-processing, and the retrieval result is input into the LLM large language model to obtain the final answer of the question.
The content retrieval method provided by the embodiment supports the knowledge base retrieval enhancement of tens of millions of documents, and can realize the efficient management and efficient retrieval of massive documents by combining the information retrieval and the document base design. The design of different document libraries and multiple index strategies can support efficient management of tens of millions of documents. In addition, in the multi-path recall and screening processing in the content retrieval method provided by the embodiment, the retrieval of dense vectors and sparse vectors is performed, and coarse rearrangement is performed, so that the high recall rate of the retrieval result is provided. The content retrieval method provided by the embodiment supports billion-level document management, and achieves quick warehousing and efficient retrieval of massive documents through efficient indexing and retrieval technology. And according to the scale of the document library and the application scene, the combination of different indexes is adopted, so that the retrieval efficiency is ensured, and the high recall rate is ensured. In addition, by utilizing various search algorithms and coarse ranking and reordering technologies, high correlation and accuracy of the retrieval result are ensured.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a content retrieval device for realizing the content retrieval method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the content retrieval device provided below may refer to the limitation of the content retrieval method described above, and will not be repeated here.
In one exemplary embodiment, as shown in FIG. 7, a content retrieval device 700 is provided, comprising a retrieval request acquisition module 702, a high-dimensional vector retrieval module 704, a sparse vector retrieval module 706, and a retrieval result acquisition module 708, wherein:
A search request acquisition module 702, configured to acquire a content search request, and construct a high-dimensional vector representation and a sparse vector representation based on the content search request, respectively;
the high-dimensional vector retrieval module 704 is configured to retrieve in a preset high-dimensional vector library based on an approximate retrieval manner according to the high-dimensional vector representation, so as to obtain a first retrieval result;
The sparse vector retrieval module 706 is configured to retrieve in a preset sparse vector library according to a sparse vector representation to obtain a second retrieval result;
the search result obtaining module 708 is configured to obtain a content search result corresponding to the content search request based on the first search result and the second search result.
In an exemplary embodiment, the high-dimensional vector library includes at least two content vector libraries, the high-dimensional vector retrieval module 704 is further configured to determine respective approximate retrieval algorithms of the at least two content vector libraries according to respective database sizes of the at least two content vector libraries, perform retrieval according to the high-dimensional vector representation based on the respective approximate retrieval algorithms of the at least two content vector libraries to obtain respective retrieval results of the at least two content vector libraries, and obtain a first retrieval result according to the respective retrieval results of the at least two content vector libraries.
In an exemplary embodiment, the high-dimensional vector search module 704 is further configured to, for a target content vector library of the at least two content vector libraries, calculate first vector matching parameters between the high-dimensional vector representation and a high-dimensional vector representation included in the target content vector library according to an approximate search algorithm corresponding to the target content vector library, determine a first matching vector representation matching the high-dimensional vector representation from the target content vector library according to the first vector matching parameters, and obtain respective search results of the at least two content vector libraries according to the first matching vector representations determined from the at least two content vector libraries, respectively.
In an exemplary embodiment, the sparse vector retrieval module 706 is further configured to calculate second vector matching parameters between the sparse vector representation and sparse vector representations included in the preset sparse vector library, determine a second matching vector representation matching the sparse vector representation from the sparse vector library based on the second vector matching parameters, and obtain a second retrieval result according to the second matching vector representation.
In an exemplary embodiment, the search result obtaining module 708 is further configured to obtain a first search content based on the first search result and a first vector mapping relationship, where the first vector mapping relationship is a mapping relationship between candidate content included in the high-dimensional vector library and a corresponding high-dimensional vector representation, obtain a second search content based on the second search result and the second vector mapping relationship, where the second vector mapping relationship is a mapping relationship between candidate content included in the sparse vector library and a corresponding sparse vector representation, perform screening based on the first search content and the second search content to obtain a preliminary screening content, determine a correlation parameter between the preliminary screening content and a content search request, determine a target content from the preliminary screening content based on the correlation parameter, and perform description conversion on the target content through a large language model to obtain a content search result corresponding to the content search request.
In an exemplary embodiment, the high-dimensional vector representation corresponding to the candidate content included in the high-dimensional vector library is obtained by mapping the candidate content in a high-dimensional vector mapping manner, the sparse vector representation corresponding to the candidate content included in the sparse vector library is obtained by mapping the candidate content in a sparse vector mapping manner, the search request obtaining module 702 is further configured to map the content search request in a high-dimensional vector mapping manner to obtain the high-dimensional vector representation corresponding to the content search request, and map the content search request in a sparse vector mapping manner to obtain the sparse vector representation corresponding to the content search request.
The respective modules in the above-described content retrieval device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one exemplary embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data involved in the content retrieval method. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a content retrieval method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile memory and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (RESISTIVE RANDOM ACCESS MEMORY, reRAM), magneto-resistive Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computation, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present application.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411782174.8A CN119669443A (en) | 2024-12-05 | 2024-12-05 | Content retrieval method, device, computer equipment and readable storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411782174.8A CN119669443A (en) | 2024-12-05 | 2024-12-05 | Content retrieval method, device, computer equipment and readable storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN119669443A true CN119669443A (en) | 2025-03-21 |
Family
ID=94992178
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411782174.8A Pending CN119669443A (en) | 2024-12-05 | 2024-12-05 | Content retrieval method, device, computer equipment and readable storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119669443A (en) |
-
2024
- 2024-12-05 CN CN202411782174.8A patent/CN119669443A/en active Pending
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Liu et al. | Robust and scalable graph-based semisupervised learning | |
| KR20200078576A (en) | Methods for calculating relevance, devices for calculating relevance, data query devices, and non-transitory computer-readable storage media | |
| CN100530192C (en) | Text searching method and device | |
| KR101467707B1 (en) | Method for instance-matching in knowledge base and device therefor | |
| CN119066172A (en) | Question and answer processing method, device, computer equipment, readable storage medium and program product | |
| CN120929585A (en) | Retrieval method, device, computer equipment and medium based on static word embedding | |
| CN117009599A (en) | Data retrieval methods, devices, processors and electronic equipment | |
| CN120162400A (en) | A large-scale model-based sewage treatment question-and-answer method, equipment, medium and product | |
| CN120353939A (en) | Engineering knowledge hybrid retrieval method, device, equipment, medium and product | |
| CN115238117A (en) | Image retrieval method based on attention fusion local super feature and global feature | |
| CN109472282B (en) | Depth image hashing method based on few training samples | |
| CN117992575A (en) | Text matching method, device, computer equipment, storage medium, program product | |
| CN117743549A (en) | Information query method, device and computer equipment | |
| CN117435685A (en) | Document retrieval methods, devices, computer equipment, storage media and products | |
| CN115129871B (en) | Text category determining method, apparatus, computer device and storage medium | |
| Hou et al. | Remote sensing image retrieval with deep features encoding of Inception V4 and largevis dimensionality reduction | |
| CN115238034A (en) | Data search method, apparatus, computer equipment and storage medium | |
| CN120123491A (en) | A semantic retrieval method, device, equipment and storage medium based on reordering | |
| CN117648495B (en) | Data pushing method and system based on cloud primary vector data | |
| US12229179B1 (en) | Mitigating bias in multimodal models via query transformation | |
| CN119669443A (en) | Content retrieval method, device, computer equipment and readable storage medium | |
| CN119848250A (en) | Text classification, clustering and retrieval method, device and computer equipment | |
| CN117390013A (en) | Data storage methods, retrieval methods, systems, equipment and storage media | |
| CN107944045A (en) | Image search method and system based on t distribution Hash | |
| JP2011159100A (en) | Successive similar document retrieval apparatus, successive similar document retrieval method and program |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |