CN120123491A

CN120123491A - A semantic retrieval method, device, equipment and storage medium based on reordering

Info

Publication number: CN120123491A
Application number: CN202510194607.6A
Authority: CN
Inventors: 蒋涉权; 张志彪
Original assignee: Bigo Technology Pte Ltd
Current assignee: Bigo Technology Pte Ltd
Priority date: 2025-02-21
Filing date: 2025-02-21
Publication date: 2025-06-10

Abstract

The embodiment of the application provides a semantic retrieval method, a semantic retrieval device, semantic retrieval equipment and a storage medium based on reordering. According to the technical scheme provided by the embodiment of the application, the first matching score of dense vectors of a plurality of structured data in a vector library and the first matching score corresponding to the query text and the second matching score of sparse vectors of a plurality of structured data and the query text are determined, fusion processing is carried out on the first matching score and the second matching score to obtain fusion scores corresponding to the structured data, a first number of candidate data are screened out from the structured data according to the fusion scores, semantic processing is carried out on the candidate data to obtain semantic data corresponding to each candidate data, the first number of candidate data are reordered according to the query text and the semantic data, a second number of search results are screened out from the first number of candidate data according to the reordering results, balance between search efficiency and accuracy is achieved, and semantic search effect is improved.

Description

Semantic retrieval method, device, equipment and storage medium based on reordering

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a semantic retrieval method, device and equipment based on reordering and a storage medium.

Background

With the development of artificial intelligence and big data technology, semantic retrieval technology has become an important direction in the field of information retrieval, such as a semantic retrieval scheme based on sparse vectors and a semantic retrieval scheme based on dense vectors. The semantic retrieval scheme based on sparse vectors can obtain better retrieval effect when short text or clear keyword query is processed by mapping the text to a high-dimensional vector space, can better capture semantic information of the text by mapping the text to the high-dimensional vector space, and can obtain better retrieval effect when long text query is processed.

However, the recall quality of the semantic retrieval scheme of the sparse vector is poor when capturing semantic information and processing long text or ambiguity, and the semantic retrieval scheme of the dense vector has poor capability of understanding semantic fine granularity and poor semantic retrieval effect.

Disclosure of Invention

The embodiment of the application provides a semantic retrieval method, a semantic retrieval device, semantic retrieval equipment and a storage medium based on reordering, which are used for solving the technical problems of poor recall quality and semantic fine granularity understanding capability of semantic retrieval and poor semantic retrieval effect in the related technology, and can effectively improve the recall quality and semantic fine granularity understanding capability of semantic retrieval and improve the semantic retrieval effect.

In a first aspect, an embodiment of the present application provides a semantic retrieval method based on reordering, including:

acquiring a query text, and determining first matching scores of dense vectors of a plurality of structured data in a vector library and the query text, and second matching scores of sparse vectors of the plurality of structured data and the query text;

Determining fusion scores corresponding to the structured data according to the first matching score and the second matching score, and screening a first number of candidate data from the structured data according to the fusion scores;

carrying out semantic processing on the candidate data to obtain semantic data corresponding to each candidate data;

And reordering the first quantity of candidate data according to the query text and the semantical data, and screening the second quantity of retrieval results from the first quantity of candidate data according to the reordered results.

In a second aspect, an embodiment of the present application provides a semantic retrieval apparatus based on reordering, including a vector matching module, a candidate matching module, a semantic processing module, and a reordering module, wherein:

The vector matching module is configured to acquire a query text, and determine first matching scores of dense vectors of a plurality of structured data in a vector library and the query text, and second matching scores of sparse vectors of the plurality of structured data and the query text;

the candidate matching module is configured to determine fusion scores corresponding to the plurality of structured data according to the first matching score and the second matching score, and screen a first number of candidate data from the plurality of structured data according to the fusion scores;

the semantic processing module is configured to perform semantic processing on the candidate data to obtain semantic data corresponding to each candidate data;

The reordering module is configured to reorder the first number of candidate data according to the query text and the semantical data, and screen the second number of retrieval results from the first number of candidate data according to the reordered results.

In a third aspect, embodiments of the present application provide a reorder-based semantic retrieval apparatus comprising a memory and one or more processors;

The memory is used for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the reordering-based semantic retrieval method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a non-volatile storage medium storing computer-executable instructions which, when executed by a computer processor, are used to perform the reordering-based semantic retrieval method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program stored in a computer readable storage medium, from which at least one processor of a device reads and executes the computer program, causing the device to perform the reordering-based semantic retrieval method according to the first aspect.

According to the embodiment of the application, the first matching score corresponding to the dense vector of the plurality of structured data and the query text in the vector library and the second matching score corresponding to the query text are determined, the first matching score and the second matching score are fused to obtain the fusion score corresponding to the plurality of structured data, the first quantity of candidate data is screened out from the plurality of structured data according to the fusion score, the candidate data is semantically processed to obtain the semantical data corresponding to each candidate data, the first quantity of candidate data is reordered according to the query text and the semantical data, the second quantity of retrieval results are screened out from the first quantity of candidate data according to the reordering results, the mixed search and the reordering mode are combined, and a fusion scoring mechanism is introduced to the sparse vector and the dense vector in the mixed search, so that recall quality of semantic retrieval and semantic fine granularity understanding capability can be effectively improved, balance between retrieval efficiency and semantic retrieval accuracy is realized, and semantic retrieval effect is improved.

Drawings

FIG. 1 is a flow chart of a semantic retrieval method based on reordering provided by an embodiment of the present application;

FIG. 2 is a flow chart of another reorder-based semantic retrieval method provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a semantic retrieval apparatus based on reordering according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a semantic retrieval apparatus based on reordering according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the following detailed description of specific embodiments of the present application is given with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the matters related to the present application are shown in the accompanying drawings. Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The above-described process may be terminated when its operations are completed, but may have additional steps not included in the drawings. The processes described above may correspond to methods, functions, procedures, subroutines, and the like.

The semantic retrieval method based on reordering provided by the application aims at combining a mixed search and a reordering mode, introducing a fusion scoring mechanism to sparse vectors and dense vectors in the mixed search, improving recall quality of semantic retrieval and capability of understanding semantic granularity, realizing balance between retrieval efficiency and precision, and improving semantic retrieval effect.

Conventional retrieval methods generally rely on keyword matching, such as based on sparse vector search algorithms, which, while better performing in processing short text or explicit keyword queries, suffer from significant drawbacks in capturing semantic information, processing long text, or ambiguity. With the rapid development of large models, dense vector retrieval technology based on deep neural networks (such as Bert, roBERTa, etc.) and similarity calculation is emerging, which can better capture semantic information of text by mapping the text to a high-dimensional vector space. However, dense vector retrieval is inefficient in handling large scale indexes, and dense vector-based retrieval schemes have limitations on the ability to understand semantic fine granularity. Hybrid Search combines the advantages of sparse and dense vectors, both keyword matching is considered and semantic information of text is captured, however, in practical application, recall results of Hybrid Search have poor data recall quality in fine-grained requirements. In addition, for a retrieval scheme based on a sorting model, such as a large model pre-training language model based on a transducer architecture, particularly a model using an interactive similarity scoring structure (Cross-Encoder), although the data sorting precision can be improved and the semantics of fine granularity are considered, the calculation cost is huge, the speed of recalling accurate data in a large-data-volume scene is too slow, and the retrieval scheme is difficult to be applied to high-efficiency retrieval in massive data, and can only be used for small-scale data retrieval. Based on the above, the semantic retrieval method based on the reordering is provided, so that the technical problems of poor recall quality and poor semantic fine granularity understanding capability of the existing semantic retrieval and poor semantic retrieval effect are solved.

Fig. 1 shows a flowchart of a semantic retrieval method based on reordering provided by an embodiment of the present application, where the semantic retrieval method based on reordering provided by the embodiment of the present application may be implemented by a semantic retrieval device based on reordering, and the semantic retrieval device based on reordering may be implemented by means of hardware and/or software and integrated in a semantic retrieval device based on reordering.

The following description will be made taking, as an example, a semantic retrieval method based on reordering performed by a semantic retrieval apparatus based on reordering. Referring to fig. 1, the semantic retrieval method based on reordering includes:

s110, acquiring a query text, and determining first matching scores of dense vectors of a plurality of structured data in a vector library and the query text, and second matching scores of sparse vectors of the plurality of structured data and the query text.

The vector library provided by the application stores dense vectors and sparse vectors corresponding to a plurality of structured data (such as tables), wherein the dense vectors and the sparse vectors can be obtained by carrying out Embedding embedding processing on the structured data, and each structured data corresponds to the dense vectors and the sparse vectors.

Illustratively, query text is obtained for query data, which may be entered by a user, which may be described in natural language, or in a standard language (e.g., SQL) used to manage and operate relational databases, such as the natural language text "wish to obtain details or statistics of the XX Table".

In one embodiment, after the query text is obtained, a first matching score for dense vectors of the plurality of structured data in the vector library corresponding to the query text and a second matching score for sparse vectors of the plurality of structured data corresponding to the query text may be determined. Wherein the first matching score may be used to reflect the degree of matching of the dense vector to the query text and the second matching score may be used to reflect the degree of matching of the sparse vector to the query text.

S120, determining fusion scores corresponding to the plurality of structured data according to the first matching scores and the second matching scores, and screening a first number of candidate data from the plurality of structured data according to the fusion scores.

For example, for the first matching score and the second matching score corresponding to each structured data, fusion processing may be performed on the first matching score and the second matching score to obtain fusion scores corresponding to each structured data, where the fusion scores may be used to reflect the matching degree of the structured data and the query text.

Alternatively, the fusion processing of the first matching score and the second matching score may be multiplication, addition or weighted summation processing of the first matching score and the second matching score, or may be first performing amplification and/or reduction processing on the first matching score and the second matching score, and then multiplying, adding or weighted summation processing of the first matching score and the second matching score after the amplification and/or reduction processing, or may be fusion processing of the first matching score and the second matching score by a preset fusion function (for example, a nonlinear function).

In one embodiment, after determining the fusion score corresponding to each structured data, a first number of candidate data may be selected from the plurality of structured data according to the fusion score, e.g., the first number of structured data with the fusion score sorted from large to small is used as the candidate data.

And S130, carrying out semantic processing on the candidate data to obtain semantic data corresponding to each candidate data.

The semantical processing is performed on the determined first number of candidate data to obtain semantical data corresponding to each candidate data. For example, according to the characteristics of the candidate data, descriptive text corresponding to the candidate data can be generated, and context semantics can be supplemented in the descriptive text to obtain semantic data, so that the integrity and fluency of the structured data expression can be effectively enhanced.

It should be explained that structured data lacks sufficiently explicit semantic relationships, has insufficient semantic relevance, whereas query text in natural language is typically unstructured, the application converts the structured data into descriptive text, so that the model can understand the semantic relationship between the structured data and the descriptive text and make up the semantic difference between the structured data and the descriptive text. Alternatively, descriptive text conversion may be converted using a large language model LLM or based on preconfiguring descriptions of different fields and relationship descriptions between fields according to empirical rules. For example, the field "UID" in the structured data may be converted to "record and unique UID information identifying the user, using long-shaped type storage", the field "chinese name" in the structured data may be converted to "chinese description information of record table, strongly correlated with table data content", and the relationship "there is a dependency relationship" convertible to "field a depends on field B" between field a and field B in the structured data.

And S140, reordering the first quantity of candidate data according to the query text and the semantical data, and screening the second quantity of search results from the first quantity of candidate data according to the reordered results.

Illustratively, the first number of candidate data is reordered according to the query text and the semantical data, and the second number of search results can be screened from the first number of candidate data according to the reorder results, wherein the higher the matching degree between the semantical data and the query text is, the more the reorder results of the corresponding candidate data are before, and the second number of structured data with the before reorder results can be used as the search results.

Optionally, the query text and the semantical data may be input into a trained ranking model, and the query text and the semantical data are analyzed and processed by the ranking model to obtain a reordered result of the first number of candidate data. For example, a pre-trained ranking model based on a Cross-Encoder structure may be used to capture semantic relationships between query text and descriptive text (i.e., semantically data) and generate precise reordered results based thereon.

According to the method, the first matching scores and the second matching scores of the dense vectors of the plurality of structured data and the query text in the vector library are determined, the first matching scores and the second matching scores are fused to obtain the fusion scores corresponding to the plurality of structured data, the first quantity of candidate data is screened out of the plurality of structured data according to the fusion scores, semantic processing is carried out on the candidate data to obtain semantic data corresponding to each candidate data, the first quantity of candidate data is reordered according to the query text and the semantic data, a second quantity of retrieval results are screened out from the first quantity of candidate data according to the reordering results, a mixed search and reordering mode is combined, a fusion scoring mechanism is introduced to the sparse vectors and the dense vectors in the mixed search, recall quality of semantic retrieval and semantic fine granularity understanding capability can be effectively improved, balance between retrieval efficiency and semantic retrieval accuracy is achieved, and semantic retrieval effect is improved.

On the basis of the above embodiment, fig. 2 shows a flowchart of another semantic retrieval method based on reordering, which is a specific implementation of the semantic retrieval method based on reordering. Referring to fig. 2, the semantic retrieval method based on reordering includes:

and S210, converting the structured data into dense vectors through a trained first vector extraction model.

S220, performing word segmentation processing on the structured data to obtain a plurality of data words, determining importance scores of the data words, performing word segmentation screening processing on the structured data according to the importance scores, and converting the structured data subjected to the word segmentation screening processing into sparse vectors through a trained second vector extraction model.

And S230, storing the dense vector and the sparse vector into a vector library.

Illustratively, structured data is acquired and input into a trained first vector extraction model (e.g., embedding model), by which the structured data is converted to a high-dimensional vector space representation, thereby converting the structured data into dense vectors. The first vector extraction model may be a vector extraction model built based on a convolutional neural network, a cyclic neural network, a neural network based on a self-attention mechanism, or the like.

In one embodiment, word segmentation is performed on each structured data to obtain a plurality of data words corresponding to each structured data (for example, the structured data is divided into a plurality of field words), and importance scores of each data word are determined, where the importance scores of the data words can be determined according to word frequencies (TF) and/or Inverse Document Frequencies (IDFs) corresponding to the data words, and the importance scores can be used to reflect importance degrees of the data words in the structured data.

In one possible embodiment, the semantic retrieval method based on reordering provided by the application determines the importance score of each data word, and comprises the steps of determining the word frequency and the inverse document frequency of the data word, and multiplying the word frequency and the inverse document frequency to obtain the importance score of the data word.

The word frequency can be used for representing the frequency of occurrence of the data word in the structured data, and reflects the relative importance of the data word in the document. Alternatively, the word frequency of the t-th data word may be determined by the following formula:

the inverse document frequency may be used to represent how prevalent the data word is in all data sets (structured data sets). Alternatively, the inverse document frequency of the nth data word may be determined by the following formula:

Wherein N is the total number of data words of the structured data set, N _t is the number of structured data containing the t-th data word, and 1 is added to avoid 0 as the denominator in the formula.

In one embodiment, the term frequency corresponding to the data word and the inverse document frequency may be multiplied to obtain an importance score of the data word, i.e., the importance score TF-IDF (t) =tf (t) ×idf (t). According to the application, the importance score of the data word segmentation is obtained by multiplying the word frequency of the data word segmentation and the inverse document frequency, so that the importance of the data word segmentation in the whole structured data can be correctly reflected, the accuracy of word segmentation screening processing of the structured data can be effectively improved, and the quality of sparse vectors is improved.

After determining the importance scores of the data word segments, word segment screening processing can be performed on the structured data according to the importance scores, for example, data word segments with importance scores lower than a preset score threshold in the structured data are screened out. Optionally, a preset domain dictionary (such as a stop word list) may be used to filter irrelevant field words in the data word segmentation.

After the word segmentation screening processing on each structured data is completed, the structured data after the word segmentation screening processing can be input into a trained second vector extraction model (e.g., embedding model), and the structured data is converted into a high-dimensional vector space representation through the second vector extraction model, so that the structured data after the word segmentation screening processing is converted into sparse vectors. In one embodiment, after obtaining the dense vector and the sparse vector corresponding to the structured data, the dense vector and the sparse vector may be saved to a vector library. The second vector extraction model may be a vector extraction model built based on a convolutional neural network, a cyclic neural network, a neural network based on a self-attention mechanism, or the like. The first vector extraction model and the second vector extraction model may be the same neural network model or may be different neural network expansion models.

According to the scheme, structured data are converted into dense vectors, word segmentation processing is conducted on the structured data to obtain a plurality of data word segments, importance scores of the data word segments are determined, word segmentation screening processing is conducted on the structured data according to the importance scores, the structured data after the word segmentation screening processing are converted into sparse vectors, the dense vectors and the sparse vectors of the structured data are accurately extracted, influences of redundant field words on recall precision and speed are reduced according to statistical information aiming at the sparse vectors, optimal balance of recall precision and speed of the structured data is achieved, and complex scenes such as long text relevance assessment and ambiguity processing are effectively adapted.

And S240, acquiring the query text, and determining first matching scores of dense vectors of the plurality of structured data in the vector library and the query text, and second matching scores of sparse vectors of the plurality of structured data and the query text.

In one possible embodiment, the semantic retrieval method based on reordering provided by the application determines a first matching score corresponding to a query text and a second matching score corresponding to a dense vector of a plurality of structured data in a vector library, wherein the first matching score comprises calculating similarity information of the dense vector of the plurality of structured data in the vector library and the query text, determining the first matching score according to the similarity information, and calculating the second matching score corresponding to the query text and the sparse vector of the plurality of structured data in the vector library according to an inverted ordering index mode.

For example, after the query text is obtained, for dense vectors of respective structured data in the vector library, similarity information of the dense vectors to the query text may be calculated (e.g., similarity of the dense vectors to the query text is determined according to cosine distance, euclidean distance, hamming distance, etc.), and a first matching score may be determined according to the similarity information, wherein the greater the degree of acquisitions reflected by the similarity information of the dense vectors to the query text, the greater the first matching score. Alternatively, the similarity information of the dense vector to the query text may be calculated based on HNSW (Hierarchical Navigable Small World) algorithm and the first matching score may be determined from the similarity information.

For sparse vectors of each structured data in the vector library, a second matching score corresponding to the query text of the sparse vectors of the plurality of structured data in the vector library can be calculated according to an inverted sequence index mode based on a BM25 algorithm (a text retrieval algorithm based on a probability model). According to the application, the first matching score of the dense vector and the query text is determined according to the similarity information of the dense vector and the query text, and the second matching score of the sparse vector and the query text is calculated according to the reverse ordering index mode, so that the matching degree of the dense vector and the sparse vector and the query text is accurately reflected, and the accuracy of semantic retrieval is improved.

S250, determining fusion scores corresponding to the plurality of structured data according to the first matching scores and the second matching scores, and screening a first number of candidate data from the plurality of structured data according to the fusion scores.

In a possible embodiment, the semantic retrieval method based on reordering provided by the application determines fusion scores corresponding to a plurality of structured data according to a first matching score and a second matching score, wherein the semantic retrieval method based on reordering comprises the steps of carrying out logarithmic scaling on the first matching score to obtain a scaling result, carrying out exponential amplification on the second matching score to obtain an amplification result, and carrying out weighted summation on the scaling result and the amplification result to obtain the fusion score corresponding to the structured data.

Illustratively, after the first matching score and the second matching score corresponding to the structured data are obtained, the first matching score may be subjected to logarithmic scaling to obtain a scaling result, the second matching score may be subjected to exponential amplification to obtain an amplification result, and the scaling result and the amplification result may be subjected to weighted summation to obtain a fusion score corresponding to the structured data.

It should be explained that in the related art, the method of merging the sparse vector and the dense vector usually adopts a simple linear weighting or independent sorting and then merging, but the simple linear weighting cannot fully utilize the nonlinear characteristics of the sparse vector and the dense vector, so that the final merging result has limited precision, and the sparse vector and the dense vector in the merging result obtained by independent sorting and merging lack depth synergy, so that the merging effect is poor, and the optimal recall quality cannot be realized. Because the distribution characteristics of the sparse vector and the dense vector are different, the sparse vector is highly sparse, the partial position score value in the sparse vector is far higher than other position score values, and the dense vector is generally uniform in distribution and small in difference. And the contributions of the different vector spaces are non-linear, e.g. maxima in the sparse vector may have a greater influence, their value ranges may be compressed by logarithmic scaling, while dense vectors may be exponentially scaled to promote differentiation. The logarithmic scaling of the first matching score and the exponential scaling of the second matching score may be implemented by a nonlinear function to dynamically adjust the influence of sparse and dense vectors, such as the influence of reinforcing sparse vectors on short queries (BM 25 algorithms are typically more sensitive to short queries), the effect of scaling dense vectors on long queries (long queries are better at capturing semantic similarity in vector models). According to the application, the fusion score is obtained by carrying out weighted summation on the scaling processing result of the first matching score and the amplifying processing result of the second matching score, so that the dynamic adjustment of the influence of the sparse vector and the dense vector can be realized, the different advantages of the dense vector and the sparse vector are fully exerted, and the capability and effect of simultaneously matching semantic relativity and text matching degree in the search result are improved.

In a possible embodiment, the semantic retrieval method based on reordering provided by the application performs weighted summation processing on the scaling processing result and the amplifying processing result to obtain the fusion score corresponding to the structured data, and may perform weighted summation processing on the scaling processing result, the amplifying processing result and the preset service characteristic value by using a preset weighting coefficient to obtain the fusion score corresponding to the structured data. The preset service characteristic value can be configured into a plurality of service types. Alternatively, the fusion score provided by the present application may be determined based on the following formula:

Wherein ω ₁,ω₂,…,ω_n is a preset weighting coefficient corresponding to the sparse vector, the dense vector, the preset service feature value, and the like, s _sparse is a second matching score, s _dense is a first matching score, and x _i is an ith preset service feature value. According to the application, the sparse vector, the dense vector and the preset service characteristic value are fused, the service characteristics are fused while the different advantages of the dense vector and the sparse vector are fully exerted, the effect of important information in the service in data retrieval recall is enhanced, the importance degree of each piece of structured data in the service is more accurately judged, and the capability and effect of simultaneously matching semantic relativity and text matching degree are improved.

And S260, carrying out semantic processing on the candidate data to obtain semantic data corresponding to each candidate data.

S270, reordering the first quantity of candidate data according to the query text and the semantical data, and screening the second quantity of search results from the first quantity of candidate data according to the reordered results.

In one possible embodiment, the semantic retrieval method based on reordering provided by the application further comprises the steps of screening a third number of target retrieval results from the second number of retrieval results after screening a second number of retrieval results from the first number of candidate data according to the reordering results, generating a structured query text according to the query text and the target retrieval results, and carrying out data query according to the structured query text to obtain the query results. Wherein the first number, the second number and the third number provided by the application are sequentially reduced.

Illustratively, a third number of target search results is selected from the second number of search results, e.g., a third number of search results preceding the second number of search result reordering results is used as the target search result. And inputting the target search result and the query text into a trained large language model, analyzing and processing the target search result and the query text through the large language model, outputting a structured query text (for example, the structured query text expressed by SQL language), and carrying out data query by utilizing the structured query text to obtain the query result. Optionally, the large language model may further perform data analysis processing on the query result according to the query text to obtain a data analysis result, and return the data analysis result to the user. According to the scheme, the structured query text is generated according to the query text and the target search result, the data query is carried out according to the structured query text to obtain the query result, the accurate query result is returned to the user, and the user experience is improved.

For example, in a natural language TEXT to structured query TEXT (TEXT 2 SQL) scenario, the reordered semantic retrieval method provided by the application can assist a user in generating SQL through natural language and acquiring desired statistics or detail data through a data query system (such as OLAP system, a technology for quickly querying and analyzing multidimensional data, which is commonly used in a data warehouse, to assist the user in performing complex data analysis from different angles), reduce the learning cost of the user for the SQL technology, and reduce the requirement of the user for the degree of knowledge of the business information of the stored data. For example, the operation user is not familiar with writing SQL, and does not know which table (such as HIVE table) the data he wants to acquire is in, query text can be input in a natural language description mode, the reordering-based semantic retrieval method provided based on the scheme can accurately search the table of the data storage, generate SQL and execute query, finally return data to the user, and further provide analysis such as data report and the like to the user.

According to the method, the first matching scores and the second matching scores of the dense vectors of the plurality of structured data and the query text in the vector library are determined, the first matching scores and the second matching scores are fused to obtain the fusion scores corresponding to the plurality of structured data, the first quantity of candidate data is screened out of the plurality of structured data according to the fusion scores, semantic processing is carried out on the candidate data to obtain semantic data corresponding to each candidate data, the first quantity of candidate data is reordered according to the query text and the semantic data, a second quantity of retrieval results are screened out from the first quantity of candidate data according to the reordering results, a mixed search and reordering mode is combined, a fusion scoring mechanism is introduced to the sparse vectors and the dense vectors in the mixed search, recall quality of semantic retrieval and semantic fine granularity understanding capability can be effectively improved, balance between retrieval efficiency and semantic retrieval accuracy is achieved, and semantic retrieval effect is improved. The structured data is converted into dense vectors, word segmentation processing is carried out on the structured data to obtain a plurality of data word segments, importance scores of the data word segments are determined, word segmentation screening processing is carried out on the structured data according to the importance scores, the structured data after the word segmentation processing is converted into sparse vectors, the dense vectors and the sparse vectors of the structured data are accurately extracted, influences of redundant field words on recall precision and speed are reduced according to statistical information aiming at the sparse vectors, optimal balance of the recall precision and speed of the structured data is achieved, and complex scenes such as long text relevance assessment, ambiguity processing and the like are effectively adapted.

Fig. 3 is a schematic structural diagram of a semantic retrieval apparatus based on reordering according to an embodiment of the present application. Referring to fig. 3, the reordering-based semantic retrieval apparatus includes a vector matching module 31, a candidate matching module 32, a semantic processing module 33, and a reordering module 34.

The vector matching module 31 is configured to acquire a query text and determine a first matching score corresponding to the query text and a second matching score corresponding to the query text and a sparse vector of a plurality of structured data in a vector library, the candidate matching module 32 is configured to determine a fusion score corresponding to the structured data according to the first matching score and the second matching score and screen out a first number of candidate data in the structured data according to the fusion score, the semantic processing module 33 is configured to semantically process the candidate data to obtain semantic data corresponding to each candidate data, and the reordering module 34 is configured to reorder the first number of candidate data according to the query text and the semantic data and screen out a second number of search results from the first number of candidate data according to the reordering results.

In one possible embodiment, the reordered-based semantic retrieval apparatus further comprises a vector generation module configured to:

converting the structured data into dense vectors by the trained first vector extraction model;

Performing word segmentation processing on the structured data to obtain a plurality of data words, determining importance scores of the data words, performing word segmentation screening processing on the structured data according to the importance scores, and converting the structured data subjected to the word segmentation screening processing into sparse vectors through a trained second vector extraction model;

The dense vectors and sparse vectors are saved to a vector library.

In one possible embodiment, the vector generation module determines importance scores for the respective data tokens configured to:

Determining word frequency and inverse document frequency of the data word segmentation, and multiplying the word frequency and the inverse document frequency to obtain importance scores of the data word segmentation.

In one possible embodiment, the vector matching module 31 determines a first matching score corresponding to the query text for dense vectors of the plurality of structured data in the vector library and a second matching score corresponding to the query text for sparse vectors of the plurality of structured data, configured to:

calculating similarity information of dense vectors of a plurality of structured data in a vector library and a query text, and determining a first matching score according to the similarity information;

And calculating second matching scores corresponding to sparse vectors of the plurality of structured data in the vector library and the query text according to the reverse ordering index mode.

In one possible embodiment, the candidate matching module 32 determines a fusion score corresponding to the plurality of structured data according to the first matching score and the second matching score, and is configured to:

carrying out logarithmic scaling on the first matching score to obtain a scaling result, and carrying out exponential amplification on the second matching score to obtain an amplification result;

and carrying out weighted summation on the zooming processing result and the amplifying processing result to obtain a fusion score corresponding to the structured data.

In one possible embodiment, the candidate matching module 32 performs weighted summation processing on the scaling processing result and the amplifying processing result to obtain a fusion score corresponding to the structured data, where the fusion score is configured to:

and carrying out weighted summation on the scaling processing result, the amplifying processing result and the preset service characteristic value to obtain the fusion score corresponding to the structured data.

In one possible embodiment, the fusion score is determined based on the following formula:

Wherein ω ₁,ω₂,…,ω_n is a preset weighting coefficient, s _sparse is a second matching score, s _dense is a first matching score, and x _i is a preset service characteristic value.

In one possible embodiment, the reordered-based semantic retrieval apparatus further comprises a query processing module configured to:

Screening a third number of target search results from the second number of search results, and generating a structured query text according to the query text and the target search results;

And carrying out data query according to the structured query text to obtain a query result.

It should be noted that, in the embodiment of the semantic retrieval apparatus based on reordering, each unit and module included are only divided according to the functional logic, but not limited to the above division, as long as the corresponding functions can be implemented, and in addition, specific names of each functional unit are only for facilitating mutual distinction, and are not used for limiting the protection scope of the embodiment of the present application.

The embodiment of the application also provides a semantic retrieval device based on the reordering, which can integrate the semantic retrieval device based on the reordering provided by the embodiment of the application. Fig. 4 is a schematic structural diagram of a semantic retrieval apparatus based on reordering according to an embodiment of the present application. Referring to fig. 4, the reordering-based semantic retrieval apparatus includes an input device 43, an output device 44, a memory 42, and one or more processors 41, the memory 42 storing one or more programs, which when executed by the one or more processors 41, cause the one or more processors 41 to implement the reordering-based semantic retrieval method as provided in the above embodiments. The semantic retrieval device, the semantic retrieval device and the semantic retrieval computer based on the reordering provided by the embodiment can be used for executing the semantic retrieval method based on the reordering provided by any embodiment, and have corresponding functions and beneficial effects.

Embodiments of the present application also provide a non-volatile storage medium storing computer-executable instructions that, when executed by a computer processor, are used to perform a reordering-based semantic retrieval method as provided by the above embodiments. Of course, the non-volatile storage medium storing the computer executable instructions provided in the embodiments of the present application is not limited to the reordering-based semantic retrieval method provided above, and may also perform related operations in the reordering-based semantic retrieval method provided in any embodiment of the present application. The semantic retrieval apparatus, the apparatus and the storage medium based on reordering provided in the above embodiments may perform the semantic retrieval method based on reordering provided in any embodiment of the present application, and technical details not described in detail in the above embodiments may be referred to the semantic retrieval method based on reordering provided in any embodiment of the present application.

On the basis of the above embodiments, the present embodiment further provides a computer program product, where the technical solution of the present application is essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product, and the computer program product is stored in a storage medium, and includes several instructions to cause a computer device, a mobile terminal or a processor therein to execute all or part of the steps of the reordering-based semantic retrieval method provided in the various embodiments of the present application.

Claims

1. A semantic retrieval method based on re-ranking, characterized by comprising:

Acquire a query text, and determine a first matching score between a dense vector of a plurality of structured data in a vector library and the query text, and a second matching score between a sparse vector of a plurality of structured data and the query text;

Determining fusion scores corresponding to the plurality of structured data according to the first matching score and the second matching score, and screening out a first number of candidate data from the plurality of structured data according to the fusion scores;

Performing semantic processing on the candidate data to obtain semantic data corresponding to each of the candidate data;

The first number of candidate data is reordered according to the query text and the semantic data, and a second number of search results are screened out from the first number of candidate data according to the reordering result.

2. The semantic retrieval method based on re-ranking according to claim 1, characterized in that before determining the first matching scores of the dense vectors of the plurality of structured data in the vector library and the query text, and the second matching scores of the sparse vectors of the plurality of structured data and the query text, it further comprises:

The structured data is converted into dense vectors through the trained first vector extraction model;

Performing word segmentation processing on the structured data to obtain multiple data word segments, determining the importance score of each of the data word segments, performing word segmentation screening processing on the structured data according to the importance score, and converting the structured data after the word segmentation screening processing into a sparse vector through a trained second vector extraction model;

The dense vector and the sparse vector are saved in a vector library.

3. The semantic retrieval method based on re-ranking according to claim 2, characterized in that the step of determining the importance score of each of the data segmentations comprises:

The word frequency and the inverse document frequency of the data segment are determined, and the importance score of the data segment is obtained by multiplying the word frequency and the inverse document frequency.

4. The semantic retrieval method based on re-ranking according to claim 1, characterized in that the step of determining a first matching score between a dense vector of a plurality of structured data in a vector library and the query text, and a second matching score between a sparse vector of a plurality of structured data and the query text comprises:

Calculating similarity information between dense vectors of multiple structured data in a vector library and the query text, and determining a first matching score according to the similarity information;

A second matching score corresponding to the query text and the sparse vectors of the plurality of structured data in the vector library is calculated according to the inverted index method.

5. The semantic retrieval method based on re-ranking according to claim 1, characterized in that the step of determining the fusion scores corresponding to the plurality of structured data according to the first matching scores and the second matching scores comprises:

Performing logarithmic scaling on the first matching score to obtain a scaling result, and performing exponential scaling on the second matching score to obtain a scaling result;

A weighted sum process is performed on the scaling process result and the magnification process result to obtain a fusion score corresponding to the structured data.

6. The semantic retrieval method based on reordering according to claim 5, characterized in that the step of performing weighted sum processing on the scaling processing result and the magnification processing result to obtain the fusion score corresponding to the structured data comprises:

A weighted sum process is performed on the scaling result, the amplification result and the preset service characteristic value to obtain a fusion score corresponding to the structured data.

7. The semantic retrieval method based on re-ranking according to claim 1, characterized in that the fusion score is determined based on the following formula:

Among them, ω ₁ ,ω ₂ ,…,ω _n are preset weighting coefficients, s _sparse is the second matching score, s _dense is the first matching score, and _xi is a preset service feature value.

8. The semantic search method based on re-ranking according to claim 1, characterized in that after selecting a second number of search results from the first number of candidate data according to the re-ranking result, it further comprises:

Screening out a third number of target search results from the second number of search results, and generating a structured query text according to the query text and the target search results;

A data query is performed according to the structured query text to obtain a query result.

9. A semantic search device based on re-ranking, characterized in that it includes a vector matching module, a candidate matching module, a semantic processing module and a re-ranking module, wherein:

The vector matching module is configured to obtain a query text, and determine a first matching score between a dense vector of a plurality of structured data in a vector library and the query text, and a second matching score between a sparse vector of a plurality of structured data and the query text;

The candidate matching module is configured to determine fusion scores corresponding to the plurality of structured data according to the first matching score and the second matching score, and screen out a first number of candidate data from the plurality of structured data according to the fusion scores;

The reordering module is configured to reorder the first number of candidate data according to the query text and the semantic data, and filter out a second number of retrieval results from the first number of candidate data according to the reordering result.

10. A semantic retrieval device based on reordering, characterized by comprising: a memory and one or more processors;

The memory is used to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the re-ranking-based semantic retrieval method as described in any one of claims 1 to 8.

11. A non-volatile storage medium storing computer executable instructions, wherein the computer executable instructions are used to execute the reordering-based semantic retrieval method according to any one of claims 1 to 8 when executed by a computer processor.

12. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the re-ranking-based semantic retrieval method according to any one of claims 1 to 8 is implemented.