CN117951251A

CN117951251A - An efficient code search method based on product quantization

Info

Publication number: CN117951251A
Application number: CN202410052881.5A
Authority: CN
Inventors: 印莹; 宫玉琪; 王驰; 徐久强; 赵宇海
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-04-30

Abstract

The present invention provides an efficient code search method based on product quantization, and relates to the field of code search technology. The method first obtains a code snippet and a natural language description corresponding to the code snippet; and pre-processes the obtained code snippet and the corresponding natural language description to extract multiple feature information of the code snippet; then constructs a code snippet feature vector; and quantizes the code snippet feature vector into a low-dimensional vector; and then pre-processes the natural language query input by the user to obtain a natural language sequence; and then constructs a natural language query vector; and dimensionalizes the natural language query vector; and then constructs a query table for each natural language query subvector; finally, finds the code snippet most similar to the natural language query vector in the code database; and uses rearrangement to perform secondary screening on the query results. The method can greatly improve the code search speed without sacrificing too much accuracy of the baseline code search model.

Description

Efficient code searching method based on product quantization

Technical Field

The invention relates to the technical field of code searching, in particular to a product quantization-based efficient code searching method.

Background

Code searching is a core task in the field of software engineering, and is defined as a natural language sentence of a given description program language, and a code segment with the highest similarity with natural language description can be found in a massive code segment database to return. With the continuous development of internet technology, software has been applied to aspects of human life, and the requirements of users on functions of the software are also increasing. In the process of software development, a program developer usually searches code segments with similar functions on the internet by using natural language sentences describing the code functions so as to reuse the code segments and improve the efficiency of software development. Along with the increase of the database scale of the open source codes on the network by exponential times, the process of software development can be greatly accelerated by improving the efficiency of code retrieval.

The traditional code searching technology regards source codes as plain texts, and returns code searching results by matching text similarity between codes and natural language queries one by one in a code dataset, and the code searching efficiency of the method is low because the code fragments contain text information, rich semantic grammar information and structured programming ideas. In order to fully understand the information in the source code, most research is currently done to embed the code fragments and natural language descriptions into the vector space using deep learning methods, and to return the results of the code search by comparing the cosine similarity of the code vector to the natural language description vector. Deep learning based code search models typically divide code search into two phases, a code segment and natural language query embedding into vector phase, and a high-dimensional vector retrieval phase, respectively.

Aiming at the code and natural language description embedding stage, since the code fragments contain rich code semantic information and structural feature information, the current research is to consider the characteristic information specific to various codes, such as code fragment method names, token mark sequences, code diagram structural information and the like, in the code feature mapping stage.

The Deep Code Search of paper proposes a Code embedding network CoNN for embedding source codes as vectors and a description embedding network DeNN for embedding natural language descriptions as vectors, in the Code embedding network CoNN, three aspects of feature information of the codes, function names, API call sequences and Token sequences are considered, the source codes of each method level are extracted, the Code feature information of the three aspects is extracted, corresponding feature vectors are generated, the three feature vectors are fused into feature vectors of Code fragments and mapped into a vector space, cosine similarity comparison is carried out between the feature vectors and the description vectors generated by DeNN, and the higher the similarity degree is, the more similar the semantics of the feature vectors and the description vectors are. According to the method, only text semantic features of codes are considered when the code segments are embedded into the code feature vectors, and abundant structural features of the codes are ignored, so that the code feature vectors do not contain comprehensive code feature information, and the accuracy of the code searching method is reduced.

In order to consider richer code structure information, the Chinese patent CN 115268869A uses a graph sequence converter G2SC to learn rich data dependency information and control dependency information in codes, obtains a program dependency graph sequence of the codes, and uses an attention mechanism to fuse the program dependency graph sequence characteristics with semantic characteristic information of other code fragments to obtain code characteristic vectors containing semantic characteristic information and structural characteristic information of rich source codes. The method can improve the code searching accuracy, but the code searching result is obtained by calculating the similarity degree between the natural language query vector and each code feature vector in the code library one by one when the code searching is executed, and the process has higher calculation cost and higher calculation complexity, so that the code searching time is longer.

For the high-dimensional vector retrieval stage, the nearest neighbor retrieval method of the high-dimensional vector is used when the similarity between the code vector and the natural language description vector is calculated at present, because the number of code fragments in a code database grows in an exponential level, the vector data set after being converted into the vector has huge scale, when accurate nearest neighbor retrieval is used, if the code fragment vector and the natural language query vector are both 512-dimensional vectors, 512 multiplications and 512 additions are needed when the similarity between the two vectors is calculated, the calculation complexity is too high, and the corresponding retrieval time is too long. Therefore, approximate nearest neighbor search is generally used to replace accurate nearest neighbor search, but in the process of high-dimensional vector search, the search accuracy of the approximate nearest neighbor search is lower, and current research can improve the speed of code search by using a method for reducing some search accuracy.

In the chinese patent CN 116701698a, a clustering-based vector retrieval technique is proposed, in which, the method maps the original data into feature vectors by acquiring the feature combination of the data, and performs clustering processing on the feature vectors to determine the category attribution of each feature vector, and compares the feature vector of the query with the center point of each category during the query, thereby determining the category attribution of the next query, and continuing the query in the category, and returning the query result. The method can reduce the search range, thereby reducing the number of times of vector calculation and reducing the calculation cost. When the method is used for classifying the high-dimensional vector, although the calculation complexity of the subsequent vector can be reduced, the calculation cost is reduced, and the code searching speed is improved, the problem of search result errors caused by classification errors is easily caused because the high-dimensional vector is directly classified due to more noise information and information irrelevant to classification characteristics in the high-dimensional vector, and the code searching precision is reduced.

The paper EFFICIENT K-nearest neighbor search based on clustering AND ADAPTIVE K values proposes a pre-calculated KD tree structure caKD + integrating a clustering algorithm and feature learning to solve the problem of low efficiency in nearest neighbor searching, wherein the clustering algorithm is used to reduce the range of query during searching and improve the code searching efficiency, and the method further reduces the possibility of false generation of search result due to classification of query results at cluster boundaries by enhancing a clustering method. Although the method reduces the searching range and considers the situation that the searching result is positioned at the cluster boundary point and possibly leads to searching cluster errors, the method uses a KD tree structure to occupy a large amount of index space, consumes a large amount of calculation cost when adding the cluster boundary point for each cluster, has better performance when processing low-dimensional vector searching, and reduces the searching performance when processing code searching high-dimensional vector searching problems.

A model for improving the code search speed by using a clustering algorithm and a hash technique is proposed in paper ACCELERATING CODE SEARCH WITH DEEP HASHING AND code classification, the model performs clustering processing on the high-dimensional vectors of the codes, maps the high-dimensional vectors in each class to a low-dimensional hash code representation by using the hash technique, maps the natural language query into the hash representation by using the hash technique when generating a natural language query, determines the possible classes of the search result by comparing with the class center point, and finds the code fragments similar to the query hash semantics in the corresponding classes, and returns the code fragments as the code search result. The model can promote the code searching speed, shorten the code searching time and reduce the code searching range. In the model, since the code segments have complex structural information and semantic information, various code features are considered when the code segments are converted into high-dimensional vectors so that the vector representation can contain rich semantic information and structural information of codes, but when the hash technology is used for mapping high-dimensional continuous vectors into low-dimensional hash codes, partial information loss can be caused, and for the code segments containing complex control flow information, data flow information and the like, the complex structural information is difficult to capture by deep hash, so that the code searching speed is greatly improved when the code searching model performance is evaluated, and the code searching accuracy is greatly reduced.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a high-efficiency code searching method based on product quantization, which improves the code searching speed, shortens the code searching time and realizes the high-efficiency code searching.

In order to solve the technical problems, the invention adopts the following technical scheme: an efficient code search method based on product quantization comprises the following steps:

step 1: acquiring a code segment and natural language description corresponding to the code segment; constructing a code segment set codebase and a natural language description set descbase by collecting an original code segment data set and natural language descriptions corresponding to the code segments;

Step 2: preprocessing the code segments and the corresponding natural language descriptions obtained in the step 1, and extracting various characteristic information of the code segments;

preprocessing the code segments and the natural language description, removing repeated code segments in the data set, and deleting the method body with annotation content less than the set word number; removing special characters and redundant fields in the natural language description of each code segment; extracting function method name information, API information, token mark information and various characteristic information of a program dependency graph of the code segment;

Dividing function method names in the code fragments into a plurality of word combinations by adopting a hump naming rule; analyzing the source code file into an abstract syntax tree, and extracting API information in the code fragments by using an abstract syntax tree method of a traversing method body; in order to extract token mark information, a keyword list is established, and a StanfordCore NLP method is used for word segmentation of the code fragments in combination with related constraint rules, so that code token mark information is obtained; extracting node information and data flow information in the codes, converting the codes into graph structure representation, and generating a program dependency graph sequence of the codes by using a depth-first traversal mode so as to obtain program dependency graph characteristic information of the codes;

Step 3: constructing a code segment feature vector; acquiring multiple items of code feature information and natural language description information generated in the step 2, converting the multiple items of code feature information into corresponding code feature sequences, learning semantic feature information and structural feature information in different feature sequences by using a deep learning model, mapping the code feature sequences into feature vectors, and fusing the multiple feature vectors to obtain feature vectors V _C of code segments to obtain a code segment feature vector set V _C;

Step 3-1: converting function method name information extracted from the code segment into a method name sequence MSeq, converting API information extracted from the code segment into an API sequence APISeq, converting Token mark information extracted from the code segment into a Token mark sequence TSeq, converting a program dependency graph of the code segment into a corresponding graph sequence GPDGSeq, and performing word segmentation on the preprocessed natural language description obtained from the step 2 by using a word segmentation tool to obtain a natural language description sequence DSeq corresponding to each code segment;

Step 3-2: respectively learning various semantic features and structural features in the code segments by using the current 4 code search models, fusing feature vectors of different features, and converting the code segments into corresponding high-dimensional feature vectors v _C;

training a code search model by using the code segment feature vectors and the corresponding natural language description vectors, calculating a loss function, and optimizing model parameters to obtain a final code segment feature vector set;

The 4 code search models used include a code search algorithm DeepCS model and a GSMM model based on multi-feature fusion, and a code search algorithm CodeBert model and a GraphCodeBert model based on a pre-training model;

Step 4: quantizing the high-dimensional feature vector V _C of each code segment in the code segment feature vector set V _C obtained in step 3 into a low-dimensional vector q _C representation; the process needs to go through two stages, namely a code segment feature vector data training stage and a code segment feature vector data quantization stage;

Step 4-1: in a code segment feature vector data training stage, dividing each code segment feature vector v _C obtained in the step 3 into M low-dimensional sub-vector segments, and respectively storing the M low-dimensional sub-vector segments in different subspaces; i.e. for one code segment feature vector v _C∈R^MD, it is segmented into M sub-vectors of the same length D, i.e. v _C＝[v_C1,v_C2,...,v_CM ], wherein the i-th sub-vector v _Ci of v _C is stored in the i-th subspace, i=1, 2, M; k-means clustering is carried out on the sub-vectors in each subspace during training, and the classification parameters are optimized through continuous training of the data set, so that an optimal group of clustering center points in each subspace is obtained; the class center points of the clusters are represented by an ID number, the center points are also called code points, and the cluster center sets in each subspace are code point sets; the process of constructing the code point set is as follows;

(1) Initializing a clustering center: clustering is carried out in each subspace, and K clustering centers are initialized;

(2) Assigning sub-vector data: for a certain subspace, assigning all the subvectors in the subspace to the nearest cluster center according to the cluster center determined in the step (1);

(3) Training to update data centers: calculating a loss function, and updating a clustering center by using an average value of all sub-vectors distributed to the certain clustering center;

(4) Checking convergence: after updating the clustering center, if the reduction amplitude of the loss function is lower than a preset threshold value, or the change of the clustering center is smaller than the preset threshold value, or the updating iteration number reaches one of three conditions of a preset maximum iteration number, stopping updating, wherein the obtained clustering center is an optimal group of clustering centers in each subspace;

(5) Constructing a code point set: the cluster center set in each subspace is a code point set corresponding to the code fragment data set;

Step 4-2: in the code segment feature vector data quantization stage, each code segment feature vector v _C is quantized into a short vector q _c with M dimensions by using the code point set obtained in the step 4-1;

Obtaining M segments of low-dimensional subvectors v _Ci generated in the step 4-1 after feature vector segmentation, wherein the quantization process is to replace the subvector v _Ci stored in the ith subspace with the center code point number of the class closest to the corresponding partition, and the low-dimensional subvector v _Ci is represented by a symbol q _ci after quantization, v _Ci∈v_C,q_ci∈q_c;

All the high-dimensional vectors of the codes need to be quantized into low-dimensional vector representation at the stage, and the quantized code vector dataset is represented as Q _C,q_c∈Q_C;

step 5: preprocessing natural language query input by a user to obtain a natural language sequence Q;

Performing word segmentation on natural language query input by a user by using a word segmentation tool, and removing special characters and redundant fields contained in the query to obtain a natural language sequence Q;

Step 6: constructing a natural language query vector; converting the natural language sequence Q obtained in the step 5 into a natural language query vector D _Q by using a vector embedding mode by using a natural language feature extraction method;

Step 7: carrying out dimension segmentation on the natural language query vector; performing dimension segmentation on the natural language query vector obtained in the step 6 by using a method of performing dimension segmentation on the code segment high-dimension feature vector, namely performing vector dimension segmentation on the natural language query vector D _Q, and dividing the natural language query vector D _Q into M segments of sub-vectors, wherein the M segments of sub-vectors are represented as [ D _Q1,D_Q2,...,D_QM ];

Step 8: finding the code segment most similar to the natural language query vector obtained in the step 7 in the code database, and returning the result; the process is executed in two stages, namely a query table constructing stage and a natural language query stage;

Step 8-1: executing a query table construction stage, namely constructing a query table for the query vectors, wherein the query table stores the distances between each query sub-vector and all clustering center points in the corresponding subspace; acquiring M segments of low-dimensional subvectors [ D _Q1,D_Q2,...,D_QM ] of the natural language query vector generated in the step 7 and a code point set generated in the step 4; creating a lookup table for each query sub-vector D _Qi, and aiming at the lookup table of the ith sub-vector of the natural language query vector, wherein the distance from each code point in the ith sub-space to the query sub-vector D _Qi is stored, each query sub-vector needs to construct the lookup table, and all the lookup tables are spliced to obtain a lookup table for storing vector distances;

Step 8-2: executing a natural language query stage, obtaining a query table constructed in the step 8-1, and performing natural language query; when inquiring, comparing the natural language inquiry vector with all code segment characteristic vectors in the code database to obtain code segment search results, wherein the specific method comprises the following steps:

1) Obtaining a certain code segment feature vector v _X in a code database, and extracting quantized representations [ q _X1ID,q_X2ID,...,q_XM ID ] corresponding to M sub-vectors v _X1,v_X2,...,v_XM of the vector, namely cluster center point codes nearest to the M sub-vectors;

2) Calculating the distance between the natural language query vector D _Q and the code segment feature vector v _X, namely calculating the sum of the distances between each query sub-vector D _Qi and the code segment quantized representation q _Xi ID, and when the distance between the ith query sub-vector and the ith code segment sub-vector quantized representation is calculated, acquiring the corresponding distance in a query table, and recording as dis _i; calculating the distance between each section of natural language query subvector and each section of code subvector to obtain a vector distance list dis= { dis ₁,dis₂,...,dis_M };

3) Accumulating the distances in the vector distance list dis to obtain a distance dis _X-Q between the natural language query vector D _Q and the code segment feature vector v _X;

Using the steps 1) -3) to calculate the distance between the natural language query vector D _Q and the vectors corresponding to all codes in the code database, and storing the feature vectors V _C-TopK of the K code fragments with the smallest distance as search results;

Step 9: executing a search result rearrangement stage, and carrying out secondary screening on search results returned by the inquiry stage; and 8, obtaining feature vectors V _C-TopK of the top K code segments returned in the natural language query stage in the step, calculating cosine similarity between the natural language query vector and all vectors in V _C-TopK, sequencing the feature vectors of the code segments from high to low according to cosine similarity results, and returning the top n code segments as code search results.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the invention provides a product quantization-based efficient code searching method, which comprises the following steps of (1) carrying out dimension segmentation and clustering on high-dimensional vectors so as to quantize each high-dimensional vector into a low-dimensional vector. In the method, a high-dimensional vector is divided into a plurality of low-dimensional sub-vectors in a quantization stage, the plurality of low-dimensional sub-vectors are mapped into a plurality of subspaces, clustering training is carried out on the sub-vectors in each subspace, so that a group of optimal cluster center point code point representations are obtained in each subspace, the cluster center code points closest to the ion vector in each subspace are used for quantizing the segment of sub-vectors, and the plurality of sub-vectors are spliced to obtain the low-dimensional quantized representation of the high-dimensional vector. In quantizing the high-dimensional vector representation, there is no loss of information contained in the excessive vectors. When the vector is segmented, rich semantic and structural information in the source code vector is reserved to a great extent, and the condition that the code searching accuracy is reduced due to information loss is reduced;

(2) In the query stage, the distances between each query sub-vector and all clustering center code points in the corresponding subspace are stored in a manner of calculating a query table, and when the similarity between vectors is calculated, namely the sum of vector distances after the quantization of each natural language query sub-vector and each code segment sub-vector is calculated, the distances can be directly obtained in the query table, so that the calculation cost is reduced, the code searching speed is improved, and the code searching time is greatly shortened;

(3) In order to further improve the accuracy of code search results, a rearrangement mechanism is introduced to carry out secondary screening on the code search results returned by the query stage, namely, the cosine similarity degree between the natural language query vector and a plurality of code segment feature vectors returned by the query stage is recalculated, and the code segments corresponding to the code segment feature vectors of the n top ranking are used as code search final results according to the sequence from high to low of the results.

Compared with the existing code searching method, the method can greatly improve the code searching speed under the condition of not sacrificing excessive accuracy of the baseline code searching model. In the method, the high-dimensional vector segmentation and clustering method is used, the problems of information loss, high calculation cost and the like possibly generated when the high-dimensional vectors are clustered directly are solved, and when the distances between the natural language query vector and all code segment vectors in a code library are calculated, the distances between the quantized query vector and the quantized code segment feature vectors are calculated, so that the problem of high calculation cost when the distances between the high-dimensional vectors are calculated is solved.

Drawings

FIG. 1 is a flow chart of a method for efficient code search based on product quantization according to an embodiment of the present invention;

FIG. 2 is a diagram of a set of code segments and information examples preprocessed by natural language description according to an embodiment of the present invention;

FIG. 3 is a flowchart of a training phase of feature vector data of a code segment according to an embodiment of the present invention;

FIG. 4 is a flow chart of a natural language query phase provided by an embodiment of the present invention;

FIG. 5 is a flowchart of a rearrangement phase according to an embodiment of the present invention;

FIG. 6 is a graph of comparison results of search results of different code search techniques when different code search models are used as baseline models, according to an embodiment of the present invention, (a) is a change in accuracy of search results of applying different code search techniques to a baseline model of DeepCS, (b) is a change in accuracy of search results of applying different code search techniques to a baseline model of CodeBert, (c) is a change in accuracy of search results of applying different code search techniques to a baseline model of GraphCodeBert, and (d) is a change in accuracy of search results of applying different code search techniques to a baseline model of GSMM;

fig. 7 is a diagram illustrating code searching examples of applying a product quantization-based efficient code searching method to a baseline code searching model according to an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

In this embodiment, a product quantization-based efficient code search method, as shown in fig. 1, includes the following steps:

Preprocessing the code segments and the natural language description, removing repeated code segments in the data set, and deleting the method body with annotation content less than the set word number; removing special characters and redundant fields in the natural language description of each code segment; extracting function method name information, API (application programming interface) information, token mark information and various characteristic information of a program dependency graph of the code segment;

The codes in the data set of code search are function levels, the method names of the functions can roughly describe the functions of one function, and are important characteristics in understanding code semantic information; for extracting the code API information, using DeepAPI extraction method, using Eclipse JDT compiler to analyze the source code file into abstract syntax tree, and using abstract syntax tree method of traversing method to extract the API information in the code segment; the token mark vocabulary of the code consists of keywords, operators and identifiers, but new identifier names are often customized due to programming habits of different programmers, so that in order to better extract token mark information, a keyword table is established, and a StanfordCore NLP method is used for word segmentation of Java code fragments by combining with related constraint rules such as grammar structure rules, type constraint rules and the like, so that code token mark information is obtained; the code program dependency graph comprises rich data dependency information and control dependency information, the tool TinyPDG is used for extracting node information and data flow information in the code aiming at the extraction of the program dependency graph, the code is converted into graph structure representation, and a depth-first traversal mode is used for generating a code program dependency graph sequence, so that code program dependency graph characteristic information is obtained;

In fig. 2, characteristic information and natural language description information of a set of code fragments and natural language description data thereof in a data set are shown, wherein in fig. 2 (a), a set of code fragments and natural language description thereof in the data set is shown, the natural language description is "Copies bytes from a large (over 2 GB) InputStream to an outputstream.", meaning that bytes are copied from a large input stream exceeding 2GB into an output stream, the function described by the code fragments is "copy data from one input stream into one output stream", and in fig. 2 (b), the preprocessed natural language description is shown, and method name information, API information, token mark information and program dependency graph information of the code fragments are shown;

Step 3-1: converting function method name information extracted from the code segment into a method name sequence MSeq, converting API (application programming interface) information extracted from the code segment into an API sequence APISeq, converting Token mark information extracted from the code segment into a Token mark sequence TSeq, converting a program dependency graph of the code segment into a corresponding graph sequence GPDGSeq, and performing word segmentation on natural language descriptions corresponding to the code segments obtained from the step 2 by using a word segmentation tool to obtain a natural language description sequence DSeq corresponding to each code segment;

Step 3-2: respectively learning various semantic features and structural features in the code segments by using 4 current representative code search models, fusing feature vectors of different features, and converting the code segments into corresponding high-dimensional feature vectors v _C;

The 4 code search models used in the present embodiment include a code search algorithm DeepCS model and a GSMM model based on multi-feature fusion, and a code search algorithm CodeBert model and a GraphCodeBert model based on a pre-training model;

Step 4-1: a schematic flow chart of the code segment feature vector data training phase is shown in fig. 3. In a code segment feature vector data training stage, embedding the code segment feature vector obtained in the step 3 into Cartesian products of M disjoint partitions, and quantizing each partition into K clusters, namely dividing each high-dimensional code segment feature vector v _C into M low-dimensional sub-vector segments, wherein the sub-vector segments are respectively stored in different subspaces; for one code segment feature vector v _C∈R^MD, it is segmented into M equal length D sub-vectors, i.e., v _C＝[v_c1,v_C2,...,v_CM ], where the i-th sub-vector v _Ci of v _c is stored in the i-th subspace; outputting the segmented code segment feature vector to the step 4-2;

K-means clustering is needed to be carried out on the sub-vectors in each subspace during training, a data set is continuously trained, so that classification parameters are optimized, and an optimal group of clustering center points in each subspace are obtained, and the optimal clustering center points in each subspace enable the square sum of the distances from each vector in the space to the nearest clustering center to the vector to be minimum; each class center point of the cluster is represented by an ID number, the center point is also called a code point, and the cluster center in each subspace is stored to form a code point set corresponding to the code fragment data set; in this embodiment, the number K of clusters is 256, and the corresponding ID number indicates a range of 0 to 255;

Outputting the obtained clustering center in each subspace to a code segment feature vector data quantization stage in the step 4-2 and a query stage in the step 8; the implementation details of constructing the code point set in this embodiment are as follows:

(1) Initializing a clustering center: k classes of clustering are carried out in each subspace, K clustering centers are initialized, and because the high-dimensional vector is divided into M segments of low-dimensional vectors when the high-dimensional vector is segmented, the class number K of the clustering is set to 256 corresponding to the M subspaces, and 256 classes of clustering are carried out in each subspace, 256 code points are arranged in each subspace, namely 256 ⁸ possible product quantization low-dimensional vector codes are formed in total;

(3) Training to update data centers: after sub-vector data are distributed to the clustering centers in each subspace, a trained loss function is calculated, the loss function represents the square sum of distances from all sub-vectors to the nearest clustering center, and after the loss function is calculated, the average value of all sub-vectors distributed to a certain clustering center is used for updating the clustering center;

the loss function formula is shown below;

Wherein C _j represents the j-th cluster, x is a sub-vector data point belonging to the cluster, C _j is the cluster center code point, and iix-C _j‖² is the square of euclidean distance, and since the sub-vectors are clustered into 256 types in each subspace, the value of j is from 0 to 255;

(4) Checking convergence: after each updating of the cluster center, if the reduction amplitude of the calculation loss function is lower than a preset threshold value, or the change between the current cluster center and the cluster center before updating is smaller than the preset threshold value, or the updating iteration number reaches one of three conditions of the preset maximum iteration number, stopping updating, wherein the obtained cluster center is the optimal cluster center group in each subspace;

(5) Constructing a code point set: storing the cluster centers in each subspace to form a code point set corresponding to the code fragment data set;

step 4-2: in the code segment feature vector data quantization stage, quantizing the code segment feature vector using the code point set obtained in the code segment feature vector data training stage in step 4-1;

In this process, each sample v _C in the code segment feature vector obtained in step 3 is quantized into one short vector q _c of M dimensions; obtaining M-segment low-dimensional subvectors v _Ci, i=1, 2 after segmentation of the code segment feature vector generated in step 4-1, M, replacing the subvector v _Ci stored in the i-th subspace with the nearest class center code point number in the corresponding partition, namely for the i-th partition (M partitions are all provided, and i epsilon {1,2, M }) there is one cluster center c _ik∈R^D for each partition (K epsilon {1,2,., K }) and if the i-th subvector v _Ci of the vector v _C is c _ik from the nearest cluster center, the quantized value of v _Ci is c _ik, denoted by symbol q _ci, wherein q _ci∈q_c; the formula for finding the nearest class center code point of the clustering sub-vector is as follows:

the formula shows that the value of q _ci is the value of c when the objective function v _Ci-c||² is minimum, wherein centre _i represents the set of all class center points in the ith subspace partition, v _Ci is the ith subvector of the code fragment feature vector, and c is a certain class center point in the ith subspace partition; each element in the short vector of M dimension is the serial number of the nearest cluster center point of different sub-vectors in the corresponding subspace;

In the stage, all the high-dimensional characteristic vectors of the code segments are required to be quantized into low-dimensional vector representation, and the quantized code vector dataset vector is represented as Q _C,q_c∈Q_C; outputting the quantized code vector data set to the query stage in the step 8;

Performing word segmentation on the natural language query input by the user by using a word segmentation tool, removing special characters and redundant fields contained in the query to obtain a natural language sequence Q, and outputting the natural language sequence Q to a natural language query vector construction stage in the step 6;

Step 6: constructing a natural language query vector; converting the natural language sequence Q obtained in the step 5 into a natural language query vector D _Q by using a vector embedding mode by using a natural language feature extraction method, and transmitting the natural language query vector D _Q to a natural language query vector dimension segmentation stage in the step 7;

Step 7: carrying out dimension segmentation on the natural language query vector; performing dimension segmentation on the natural language query vector obtained in the step 6 by using a method of performing dimension segmentation on the code segment high-dimension feature vector, namely performing vector dimension segmentation on the natural language query vector D _Q, dividing the natural language query vector D _Q into M segments of sub-vectors, representing the M segments of sub-vectors as [ D _Q1,D_Q2,...,D_QM ], and outputting a natural language query sub-vector [ D _Q1,D_Q2,...,D_QM ] to a query stage of the step 8;

step 8: finding the code segment most similar to the natural language query vector obtained in the step 7 in the code database, and returning the result; the query phase is executed in two phases, namely a query table construction phase and a natural language query phase, and the flow diagram of the query phase is shown in fig. 4, and specific implementation details are as follows.

Step 8-1: in the stage of constructing a lookup table, a lookup table is required to be constructed for the query vector, and the distance between each query sub-vector and all clustering center points in the corresponding subspace is stored in the lookup table; acquiring M segments of low-dimensional subvectors [ D _Q1,D_Q2,...,D_QM ] of the natural language query vector generated in the natural language query vector dimension segmentation stage in the step 7 and a code point set generated in the code segment feature vector data training stage in the step 4; for each query sub-vector D _Qi, i=1, 2,., M creates a lookup table for the i-th sub-vector of the natural language query vector, wherein the distance from the 256 code points in the i-th sub-space to the query sub-vector D _Qi is stored, each query sub-vector needs to construct a lookup table, and all the lookup tables are spliced to obtain a lookup table storing M-th 256 vector distances; outputting the lookup table to the natural language query stage of the step 8-2;

Step 8-2: in the natural language query stage, acquiring a constructed query table from the step 8-1, and performing natural language query; when inquiring, comparing the natural language inquiry vector with all code segment characteristic vectors in a code database to obtain code segment search results, wherein the specific method comprises the following steps:

2) The distance between the natural language query vector D _Q and the code segment feature vector v _X is calculated, i.e., the respective query sub-vectors D _Qi, i=1, 2,.. the sum of the distances between M, when calculating the distances of the i-th query sub-vector and the quantized representation of the i-th code segment sub-vector, the corresponding distances are obtained in a lookup table, noted as dis _i; calculating the distance between each section of natural language query subvector and each section of code subvector to obtain a vector distance list dis= { dis ₁,dis₂,...,dis_M };

Using the steps 1) -3) to calculate the distance between the natural language query vector D _Q and the vectors corresponding to all codes in the code database, and storing the feature vectors V _C-TopK of the K code fragments with the smallest distance as search results; outputting the current search result V _C-TopK to a search result rearrangement stage in the step 9;

Step 9: and (3) carrying out secondary screening on the search results returned by the query stage in the search result rearrangement stage, obtaining feature vectors V _C-TopK of K code fragments before ranking returned by the natural language query stage in the step (8), calculating cosine similarity between the natural language query vector and all vectors in V _C-TopK, sequencing the feature vectors of the code fragments from high to low according to the cosine similarity results, and returning n code fragments before ranking as the code search results.

The flow diagram of the rearrangement phase is shown in fig. 5. The formula for calculating cosine similarity between the natural language query vector and the code segment feature vector is as follows:

Wherein code represents a high-dimensional vector representation corresponding to the code segment, desc represents a high-dimensional vector representation corresponding to the natural language query, and cos (code) represents cosine similarity between the code segment feature vector and the natural language query vector.

In order to demonstrate the superiority of the code search method provided by the invention in improving the code search speed, the embodiment designs a plurality of comparison experiments to evaluate the performance of executing code search on the same data set by using different code search algorithms. The better the code searching performance, the less the code searching time, and the higher the code searching result accuracy.

The product quantization-based code search (PQ-CS) method of the present invention is experimentally applied to several of the most advanced and representative code search methods at present to show the superiority of PQ-CS in improving code search speed. In designing the comparison experiment, in addition to comparing the code search initial performance of the above 4 code search algorithms, a locality sensitive hashing LSH and a clustering classification algorithm K-means are applied to the above 4 baseline code search models, and the code search performance is analyzed to compare with the code search performance applied to the baseline code search model using PQ-CS. The local sensitive hash LSH contrast experiment is to map the high-dimensional vector representation of the code segment to the low-dimensional hash representation by using a hash function, map the natural language query vector to a hash code by using the hash function in the search stage, and return the result of the top 10 which is found in the hash representation of the code segment and has the most similar meaning with the natural language query hash representation as a code search result by calculating one by one. The clustering classification algorithm K-means comparison experiment is to classify the high-dimensional vectors of the code segments in advance, so that each vector is distributed into the class containing the cluster center point closest to the vector, record all the cluster center point vectors, compare the query vector with all the cluster center point vectors when code searching is executed, thereby finding the closest cluster center, further calculate the cosine distance between the vector corresponding to each code segment in the class and the query vector, and return the result of the ranking of top 10 as the code searching result.

In this embodiment, on CodeSearchNet datasets, the code search effectiveness of the baseline algorithm and the baseline algorithm applying the LSH, K-means and PQ-CS technologies are respectively evaluated, and indexes of the accuracy of the search results of the 4 evaluation code search models used in the experiment are sr@1 (success rate@1), sr@5 (success rate@5), sr@10 (success rate@10) and MRR (Mean Reciprocal Rank), and different technologies are applied to the index pairs in the 4 code search baseline model experiments as shown in table 1.

From the experimental results in table 1, it is shown that on the CodeSearchNet dataset, for DeepCS, codeBert, graphCodeBert, GSMM four baseline algorithms, the code search model using the suboptimal LSH method was respectively reduced by 36.9%,37.5%,36.2%,40.2% on the sr@1 evaluation index, respectively reduced by 27.9%,26.2%,23.9%,28.9%, on the sr@5 evaluation index, respectively reduced by 23.6%,21.8%,19.7%,24.4% on the sr@10 evaluation index, respectively reduced by 31.6%,31.7%,30.0%,33.9% on the MRR evaluation index, and the accuracy drop was excessive. Compared with the original baseline code search model, the code search model using the PQ-CS method is reduced by 5.10%,6.86%,5.68%,4.26% on the SR@1 evaluation index, 3.90%,3.55%,3.21%,2.98% on the SR@5 evaluation index, 2.99%,2.41%,2.36%,2.49% on the SR@10 evaluation index, 3.95%,5.12%,4.65%,3.70% on the MRR evaluation index, and the accuracy of the code search model is similar to that of the baseline code search model, namely, excessive search accuracy of the baseline code search model is not lost. Compared with a search model using an LSH method, the code search model using the PQ-CS provided by the invention can be respectively increased by 50.29%,48.90%,47.78% on an SR@1 evaluation index, can be respectively increased by 33.43%,30.74%,27.29%,36.45% on an SR@5 evaluation index, can be respectively increased by 26.89%,24.75%,21.51%,29.01% on an SR@10 evaluation index, and can be respectively increased by 40.50%,38.98%,36.24% and 45.87% on an MRR evaluation index, compared with the code search model using the PQ-CS method, the code search model using the PQ-CS method has higher search accuracy. According to index changes of various evaluation code search accuracy rates, it can be seen that the application of the PQ-CS technology to 4 kinds of baseline code search models can enable the models to maintain the accuracy rate of the baseline code search models to be 96.05% or more on various indexes, wherein the average accuracy rate of the baseline code search models can be 96.59% on the SR@5 index.

Table 1 code search result accuracy vs. results for different techniques applied to 4-term baseline model

The impact of different code search techniques on search result accuracy when using different code search models as baseline models is illustrated in fig. 6. Fig. 6 (a), 6 (b), 6 (c) and 6 (d) show the change of search result accuracy by applying different code search techniques when DeepCS, codeBert, graphCodeBert, GSMM is taken as a baseline model, wherein the change of search accuracy by using the PQ-CS technique on the baseline model, the LSH technique on the baseline model and the K-means technique on the baseline model is respectively from left to right, and different evaluation indexes in 4 are used to show the change of search accuracy of the different code search techniques, the circle represents the evaluation index sr@1, the cross represents the evaluation index sr@5, the triangle represents the evaluation index sr@10, and the square represents the evaluation index MRR. It can be seen from 4 graphs that, on each evaluation index, the accuracy of the search result of the code search model applying the PQ-CS technology is similar to that of the original code search model, the connecting line of the corresponding marks in the graphs is similar to the horizontal line, and the accuracy value corresponding to the marks in the graphs of the comparative experiments applying the other two technologies is obviously reduced compared with that of the baseline model. Through the experimental results, compared with a code search model using an LSH technology and a K-means technology, the code search model using the PQ-CS technology can be seen to have greatly improved search result accuracy and similar search accuracy to a baseline model.

In order to demonstrate the superiority of the PQ-CS in improving the code search speed, the code search speed is compared with the code search speed of the baseline model by the code search model applying the PQ-CS, and the index of the estimated model search speed is the code search time. Wherein the code search time is defined as the sum of the similarity calculation time of the code segment vector and the natural language query vector and the time of ranking the search results. The experimental results are shown in Table 2, which demonstrates the comparison of search times for the baseline model and the code search model using PQ-CS.

TABLE 2 comparison of search time for baseline model and code search model applying PQ-CS

As can be seen from the experimental results of the table, the application of the PQ-CS technology to DeepCS, codeBert, graphCodeBert, GSMM baseline code search models greatly reduces the code search time, reduces the code search time by about 80.0%, and greatly improves the code search speed.

The efficient code search model can greatly improve the code search speed and shorten the search time when the accuracy of the search result is similar to that of the original code search model. As can be seen from the experimental results, the application of the PQ-CS technology in the baseline code search model can greatly shorten the code search time, reduce the code search time by about 80.0% and improve the code search speed when the model keeps the accuracy of at least 96.05% of the original code search model.

An example of code search in which PQ-CS technology is applied to a GSMM code search baseline model is shown in fig. 7. The user query sentence "removes whitespace from strings" in the example is derived from a classical query sentence commonly used for Code Search in the Deep Code Search, the query sentence meaning is "delete space characters in character strings", code Search is performed, and a first-ranking Search result is obtained, the returned Code segment is shown in the figure, the function of the Code is to delete space characters in character strings, the semantic similarity with the query sentence "removes whitespace from strings" is achieved, and the effectiveness of applying the PQ-CS technology to the Search result in the Code Search model is proved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. An efficient code search method based on product quantization, characterized in that it comprises the following steps:

Step 1: Obtain code snippets and natural language descriptions corresponding to the code snippets; by collecting the original code snippet dataset and the natural language descriptions corresponding to the code snippets, a code snippet set and a natural language description set are constructed;

Step 2: Preprocess the acquired code snippets and corresponding natural language descriptions, and extract various feature information of the code snippets;

Step 3: Construct a code snippet feature vector; obtain multiple pieces of code feature information and natural language description information generated in step 2, and convert the multiple pieces of code feature information into corresponding code feature sequences, use a deep learning model to learn the semantic feature information and structural feature information in different feature sequences, map the code feature sequences into feature vectors, fuse multiple feature vectors to obtain a feature vector v _C of the code snippet, and then obtain a code snippet feature vector set V _C ;

Step 4: quantize the high-dimensional feature vector v _{C of each code snippet in the code snippet feature vector set V C} _obtained in step 3 into a low-dimensional vector q _C ;

Step 5: Preprocess the natural language query input by the user to obtain a natural language sequence Q;

Step 6: Natural language query vector construction: Using the natural language feature extraction method, the natural language sequence Q obtained in step 5 is converted into a natural language query vector D _Q using vector embedding;

Step 7: Dimension segmentation of the natural language query vector: Dimension segmentation of the natural language query vector obtained in step 6 is performed in the same way as dimension segmentation of the high-dimensional feature vector of the code snippet, that is, the natural language query vector D _Q is dimensionally segmented into M sub-vectors, represented as [D _Q1 ,D _Q2 ,…,D _QM ];

Step 8: Find the code snippet that is most similar to the natural language query vector obtained in step 7 in the code database and return the result;

Step 9: Execute the search result re-ranking phase, and perform secondary screening on the search results returned in the query phase; obtain the feature vectors of the top K code snippets returned in the natural language query phase in step 8, and calculate the cosine similarity between the natural language query vector and the K code snippet feature vectors, sort the code snippet feature vectors from high to low according to the cosine similarity results, and return the top n code snippets as the code search results.

2. According to claim 1, an efficient code search method based on product quantization is characterized in that: step 2 preprocesses the code snippets and natural language descriptions, removes repeated code snippets in the data set, and deletes the method bodies whose comments are less than the set number of words; extracts the function method name information, API information, Token token tag information and program dependency graph of the code snippet.

3. According to claim 2, an efficient code search method based on product quantization is characterized in that: the step 2 uses the camel case naming rule to split the function method name in the code snippet into a combination of multiple words; the source code file is parsed into an abstract syntax tree, and the API information in the code snippet is extracted using the abstract syntax tree method of traversing the method body; in order to extract token marking information, a keyword table is established, and the Stanford Core NLP method is combined with relevant constraint rules to segment the code snippet, thereby obtaining code token marking information; node information and data flow information in the code are extracted, the code is converted into a graph structure representation, and a program dependency graph sequence of the code is generated using a depth-first traversal method, thereby obtaining the program dependency graph feature information of the code.

4. The efficient code search method based on product quantization according to claim 3 is characterized in that: the specific method of step 3 is:

Step 3-1: Convert the function method name information extracted from the code snippet into the method name sequence MSeq, convert the API information extracted from the code snippet into the API sequence APISeq, convert the Token token information extracted from the code snippet into the Token token sequence TSeq, convert the program dependency graph of the code snippet into the corresponding graph sequence GPDGSeq, use the word segmentation tool to perform word segmentation on the natural language description corresponding to the code snippet obtained in step 2, and remove the special characters and redundant fields contained in the query to obtain the natural language description sequence DSeq corresponding to each code snippet;

Step 3-2: Use the current four code search models to learn various semantic features and structural features in the code snippet, fuse the feature vectors of different features, and transform the code snippet into the corresponding high-dimensional feature vector v _C ;

In this step, the code snippet feature vector and its corresponding natural language description vector need to be used to train the code search model, calculate the loss function, and optimize the model parameters to obtain the final code snippet feature vector set.

5. The efficient code search method based on product quantization according to claim 4 is characterized in that: the step 4 includes two stages, namely, a code fragment feature vector data training stage and a code fragment feature vector data quantization stage;

In the code snippet feature vector data training phase, each code snippet feature vector v _C obtained in step 3 is evenly divided into M low-dimensional sub-vector segments, which are stored in different subspaces; that is, for a code snippet feature vector v _C ∈ R ^MD , it is segmented into M sub-vectors of the same length D, that is, v _C = [v _C1 ,v _C2 ,…,v _CM ], where the i-th sub-vector v _Ci of v _C is stored in the i-th subspace, i = 1, 2,…, M; during training, K-means clustering is performed on the sub-vectors in each subspace, and the classification parameters are optimized by continuously training the data set to obtain the optimal set of cluster center points in each subspace; the cluster category center point is represented by an ID number, and the center point is also called a code point, and the set of cluster centers in each subspace is the code point set;

In the code snippet feature vector data quantization stage, each code snippet feature vector v _C is quantized into an M-dimensional short vector q _c using the code point set obtained in the code snippet feature vector data training stage;

Obtain the M low-dimensional subvectors v _Ci of the code snippet feature vector segmented in the code snippet feature vector data training phase. The quantization process is to replace the subvector v _Ci stored in the i-th subspace with the nearest class center code point number in its corresponding partition. The low-dimensional subvector v _Ci is quantized and represented by the symbol q _ci , v _Ci ∈v _C , q _ci ∈q _c ;

In the code snippet feature vector data quantization stage, all high-dimensional code vectors need to be quantized into low-dimensional vector representations. The quantized code vector data set is represented as Q _C , q _c ∈Q _C .

6. The efficient code search method based on product quantization according to claim 5, characterized in that: the process of constructing the code point set in the code fragment feature vector data training phase is as follows:

(1) Initialize cluster centers: Perform clustering in each subspace and initialize K cluster centers;

(2) Assigning subvector data: For a certain subspace, according to the cluster center determined in (1), all subvectors in the subspace are assigned to the nearest cluster center;

(3) Training to update the data center: Calculate the loss function and update the cluster center using the average value of all subvectors assigned to a cluster center;

(4) Checking convergence: After updating the cluster centers, if the decrease in the loss function is lower than the preset threshold, or the change in the cluster centers is less than the preset threshold, or the number of update iterations reaches the preset maximum number of iterations, then stop updating. The obtained cluster centers are the optimal set of cluster centers in each subspace.

(5) Constructing code point sets: The set of cluster centers in each subspace is the code point set corresponding to the code snippet dataset.

7. According to claim 6, an efficient code search method based on product quantization is characterized in that: step 5 uses a word segmentation tool to segment the natural language query input by the user, and removes special characters and redundant fields contained in the query to obtain a natural language sequence Q.

8. The method of efficient code search based on product quantization according to claim 7 is characterized in that: step 8 is performed in two stages, namely, a query table construction stage and a natural language query stage, specifically:

Step 8-1: Execute the query table construction phase to construct a query table for the query vector. The query table stores the distances between each query subvector and all cluster center points in its corresponding subspace.

Step 8-2: Execute the natural language query phase, obtain the query table constructed in step 8-1, and perform a natural language query; during the query, compare the natural language query vector with all the code snippet feature vectors in the code database to obtain the code snippet search results.

9. The efficient code search method based on product quantization according to claim 8, characterized in that: the specific method of step 8-1 is:

Obtain M low-dimensional subvectors [D _Q1 ,D _Q2 ,…,D _QM ] of the natural language query vector generated in step 7, and the code point set generated in step 4; create a query table for each query subvector D _Qi , and a query table for the i-th subvector of the natural language query vector, which stores the distance from each code point in the i-th subspace to the query subvector D _Qi . A query table needs to be constructed for each query subvector, and all query tables are concatenated to obtain a query table that stores vector distances.

10. The efficient code search method based on product quantization according to claim 9, characterized in that: the specific method of step 8-2 is:

1) Obtain a code snippet feature vector v _X in the code database, and extract the quantized representations [q _X1 ID,q _X2 ID,…,q _XM ID] corresponding to the M sub-vectors v _X1 ,v _X2 ,…,v _XM of the vector, that is, the cluster center code closest to the M sub-vectors;

2) Calculate the distance between the natural language query vector D _Q and the code snippet feature vector v _X , that is, calculate the sum of the distances between each query subvector D _Qi and the code snippet quantized representation q _Xi ID. When calculating the distance between the i-th query subvector and the i-th code snippet subvector quantized representation, the corresponding distance is obtained in the query table and recorded as dis _i ; calculate the distance between each natural language query subvector and each code subvector to obtain a vector distance list dis = {dis ₁ , dis ₂ , …, dis _M };

3) Accumulate the distances in the vector distance list dis to obtain the distance dis _XQ between the natural language query vector D _Q and the code snippet feature vector v _X ;

Using the above steps 1)-3), the distance between the natural language query vector D _Q and the vectors corresponding to all codes in the code database is calculated, and the feature vectors V _C-TopK of the top K code snippets with the smallest distance are stored as search results.