CN117951251A - An efficient code search method based on product quantization - Google Patents

An efficient code search method based on product quantization Download PDF

Info

Publication number
CN117951251A
CN117951251A CN202410052881.5A CN202410052881A CN117951251A CN 117951251 A CN117951251 A CN 117951251A CN 202410052881 A CN202410052881 A CN 202410052881A CN 117951251 A CN117951251 A CN 117951251A
Authority
CN
China
Prior art keywords
code
vector
natural language
query
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410052881.5A
Other languages
Chinese (zh)
Inventor
印莹
宫玉琪
王驰
徐久强
赵宇海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202410052881.5A priority Critical patent/CN117951251A/en
Publication of CN117951251A publication Critical patent/CN117951251A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种基于乘积量化的高效代码搜索方法,涉及代码搜索技术领域。该方法首先获取代码片段和代码片段对应的自然语言描述;并对获取的代码片段和对应的自然语言描述,进行预处理,提取代码片段的多种特征信息;然后构建代码片段特征向量;并将代码片段特征向量量化成低维向量;再针对用户输入的自然语言查询进行预处理,得到自然语言序列;进而构建自然语言查询向量;并对自然语言查询向量进行维度分段;进而为每个自然语言查询子向量构建查询表;最后在代码数据库中查找到与自然语言查询向量最为相似的代码片段;并使用重排对查询结果进行二次筛选。该方法可以在不牺牲基线代码搜索模型过多准确率的情况下,大幅度提升代码搜索速度。

The present invention provides an efficient code search method based on product quantization, and relates to the field of code search technology. The method first obtains a code snippet and a natural language description corresponding to the code snippet; and pre-processes the obtained code snippet and the corresponding natural language description to extract multiple feature information of the code snippet; then constructs a code snippet feature vector; and quantizes the code snippet feature vector into a low-dimensional vector; and then pre-processes the natural language query input by the user to obtain a natural language sequence; and then constructs a natural language query vector; and dimensionalizes the natural language query vector; and then constructs a query table for each natural language query subvector; finally, finds the code snippet most similar to the natural language query vector in the code database; and uses rearrangement to perform secondary screening on the query results. The method can greatly improve the code search speed without sacrificing too much accuracy of the baseline code search model.

Description

Efficient code searching method based on product quantization
Technical Field
The invention relates to the technical field of code searching, in particular to a product quantization-based efficient code searching method.
Background
Code searching is a core task in the field of software engineering, and is defined as a natural language sentence of a given description program language, and a code segment with the highest similarity with natural language description can be found in a massive code segment database to return. With the continuous development of internet technology, software has been applied to aspects of human life, and the requirements of users on functions of the software are also increasing. In the process of software development, a program developer usually searches code segments with similar functions on the internet by using natural language sentences describing the code functions so as to reuse the code segments and improve the efficiency of software development. Along with the increase of the database scale of the open source codes on the network by exponential times, the process of software development can be greatly accelerated by improving the efficiency of code retrieval.
The traditional code searching technology regards source codes as plain texts, and returns code searching results by matching text similarity between codes and natural language queries one by one in a code dataset, and the code searching efficiency of the method is low because the code fragments contain text information, rich semantic grammar information and structured programming ideas. In order to fully understand the information in the source code, most research is currently done to embed the code fragments and natural language descriptions into the vector space using deep learning methods, and to return the results of the code search by comparing the cosine similarity of the code vector to the natural language description vector. Deep learning based code search models typically divide code search into two phases, a code segment and natural language query embedding into vector phase, and a high-dimensional vector retrieval phase, respectively.
Aiming at the code and natural language description embedding stage, since the code fragments contain rich code semantic information and structural feature information, the current research is to consider the characteristic information specific to various codes, such as code fragment method names, token mark sequences, code diagram structural information and the like, in the code feature mapping stage.
The Deep Code Search of paper proposes a Code embedding network CoNN for embedding source codes as vectors and a description embedding network DeNN for embedding natural language descriptions as vectors, in the Code embedding network CoNN, three aspects of feature information of the codes, function names, API call sequences and Token sequences are considered, the source codes of each method level are extracted, the Code feature information of the three aspects is extracted, corresponding feature vectors are generated, the three feature vectors are fused into feature vectors of Code fragments and mapped into a vector space, cosine similarity comparison is carried out between the feature vectors and the description vectors generated by DeNN, and the higher the similarity degree is, the more similar the semantics of the feature vectors and the description vectors are. According to the method, only text semantic features of codes are considered when the code segments are embedded into the code feature vectors, and abundant structural features of the codes are ignored, so that the code feature vectors do not contain comprehensive code feature information, and the accuracy of the code searching method is reduced.
In order to consider richer code structure information, the Chinese patent CN 115268869A uses a graph sequence converter G2SC to learn rich data dependency information and control dependency information in codes, obtains a program dependency graph sequence of the codes, and uses an attention mechanism to fuse the program dependency graph sequence characteristics with semantic characteristic information of other code fragments to obtain code characteristic vectors containing semantic characteristic information and structural characteristic information of rich source codes. The method can improve the code searching accuracy, but the code searching result is obtained by calculating the similarity degree between the natural language query vector and each code feature vector in the code library one by one when the code searching is executed, and the process has higher calculation cost and higher calculation complexity, so that the code searching time is longer.
For the high-dimensional vector retrieval stage, the nearest neighbor retrieval method of the high-dimensional vector is used when the similarity between the code vector and the natural language description vector is calculated at present, because the number of code fragments in a code database grows in an exponential level, the vector data set after being converted into the vector has huge scale, when accurate nearest neighbor retrieval is used, if the code fragment vector and the natural language query vector are both 512-dimensional vectors, 512 multiplications and 512 additions are needed when the similarity between the two vectors is calculated, the calculation complexity is too high, and the corresponding retrieval time is too long. Therefore, approximate nearest neighbor search is generally used to replace accurate nearest neighbor search, but in the process of high-dimensional vector search, the search accuracy of the approximate nearest neighbor search is lower, and current research can improve the speed of code search by using a method for reducing some search accuracy.
In the chinese patent CN 116701698a, a clustering-based vector retrieval technique is proposed, in which, the method maps the original data into feature vectors by acquiring the feature combination of the data, and performs clustering processing on the feature vectors to determine the category attribution of each feature vector, and compares the feature vector of the query with the center point of each category during the query, thereby determining the category attribution of the next query, and continuing the query in the category, and returning the query result. The method can reduce the search range, thereby reducing the number of times of vector calculation and reducing the calculation cost. When the method is used for classifying the high-dimensional vector, although the calculation complexity of the subsequent vector can be reduced, the calculation cost is reduced, and the code searching speed is improved, the problem of search result errors caused by classification errors is easily caused because the high-dimensional vector is directly classified due to more noise information and information irrelevant to classification characteristics in the high-dimensional vector, and the code searching precision is reduced.
The paper EFFICIENT K-nearest neighbor search based on clustering AND ADAPTIVE K values proposes a pre-calculated KD tree structure caKD + integrating a clustering algorithm and feature learning to solve the problem of low efficiency in nearest neighbor searching, wherein the clustering algorithm is used to reduce the range of query during searching and improve the code searching efficiency, and the method further reduces the possibility of false generation of search result due to classification of query results at cluster boundaries by enhancing a clustering method. Although the method reduces the searching range and considers the situation that the searching result is positioned at the cluster boundary point and possibly leads to searching cluster errors, the method uses a KD tree structure to occupy a large amount of index space, consumes a large amount of calculation cost when adding the cluster boundary point for each cluster, has better performance when processing low-dimensional vector searching, and reduces the searching performance when processing code searching high-dimensional vector searching problems.
A model for improving the code search speed by using a clustering algorithm and a hash technique is proposed in paper ACCELERATING CODE SEARCH WITH DEEP HASHING AND code classification, the model performs clustering processing on the high-dimensional vectors of the codes, maps the high-dimensional vectors in each class to a low-dimensional hash code representation by using the hash technique, maps the natural language query into the hash representation by using the hash technique when generating a natural language query, determines the possible classes of the search result by comparing with the class center point, and finds the code fragments similar to the query hash semantics in the corresponding classes, and returns the code fragments as the code search result. The model can promote the code searching speed, shorten the code searching time and reduce the code searching range. In the model, since the code segments have complex structural information and semantic information, various code features are considered when the code segments are converted into high-dimensional vectors so that the vector representation can contain rich semantic information and structural information of codes, but when the hash technology is used for mapping high-dimensional continuous vectors into low-dimensional hash codes, partial information loss can be caused, and for the code segments containing complex control flow information, data flow information and the like, the complex structural information is difficult to capture by deep hash, so that the code searching speed is greatly improved when the code searching model performance is evaluated, and the code searching accuracy is greatly reduced.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a high-efficiency code searching method based on product quantization, which improves the code searching speed, shortens the code searching time and realizes the high-efficiency code searching.
In order to solve the technical problems, the invention adopts the following technical scheme: an efficient code search method based on product quantization comprises the following steps:
step 1: acquiring a code segment and natural language description corresponding to the code segment; constructing a code segment set codebase and a natural language description set descbase by collecting an original code segment data set and natural language descriptions corresponding to the code segments;
Step 2: preprocessing the code segments and the corresponding natural language descriptions obtained in the step 1, and extracting various characteristic information of the code segments;
preprocessing the code segments and the natural language description, removing repeated code segments in the data set, and deleting the method body with annotation content less than the set word number; removing special characters and redundant fields in the natural language description of each code segment; extracting function method name information, API information, token mark information and various characteristic information of a program dependency graph of the code segment;
Dividing function method names in the code fragments into a plurality of word combinations by adopting a hump naming rule; analyzing the source code file into an abstract syntax tree, and extracting API information in the code fragments by using an abstract syntax tree method of a traversing method body; in order to extract token mark information, a keyword list is established, and a StanfordCore NLP method is used for word segmentation of the code fragments in combination with related constraint rules, so that code token mark information is obtained; extracting node information and data flow information in the codes, converting the codes into graph structure representation, and generating a program dependency graph sequence of the codes by using a depth-first traversal mode so as to obtain program dependency graph characteristic information of the codes;
Step 3: constructing a code segment feature vector; acquiring multiple items of code feature information and natural language description information generated in the step 2, converting the multiple items of code feature information into corresponding code feature sequences, learning semantic feature information and structural feature information in different feature sequences by using a deep learning model, mapping the code feature sequences into feature vectors, and fusing the multiple feature vectors to obtain feature vectors V C of code segments to obtain a code segment feature vector set V C;
Step 3-1: converting function method name information extracted from the code segment into a method name sequence MSeq, converting API information extracted from the code segment into an API sequence APISeq, converting Token mark information extracted from the code segment into a Token mark sequence TSeq, converting a program dependency graph of the code segment into a corresponding graph sequence GPDGSeq, and performing word segmentation on the preprocessed natural language description obtained from the step 2 by using a word segmentation tool to obtain a natural language description sequence DSeq corresponding to each code segment;
Step 3-2: respectively learning various semantic features and structural features in the code segments by using the current 4 code search models, fusing feature vectors of different features, and converting the code segments into corresponding high-dimensional feature vectors v C;
training a code search model by using the code segment feature vectors and the corresponding natural language description vectors, calculating a loss function, and optimizing model parameters to obtain a final code segment feature vector set;
The 4 code search models used include a code search algorithm DeepCS model and a GSMM model based on multi-feature fusion, and a code search algorithm CodeBert model and a GraphCodeBert model based on a pre-training model;
Step 4: quantizing the high-dimensional feature vector V C of each code segment in the code segment feature vector set V C obtained in step 3 into a low-dimensional vector q C representation; the process needs to go through two stages, namely a code segment feature vector data training stage and a code segment feature vector data quantization stage;
Step 4-1: in a code segment feature vector data training stage, dividing each code segment feature vector v C obtained in the step 3 into M low-dimensional sub-vector segments, and respectively storing the M low-dimensional sub-vector segments in different subspaces; i.e. for one code segment feature vector v C∈RMD, it is segmented into M sub-vectors of the same length D, i.e. v C=[vC1,vC2,...,vCM ], wherein the i-th sub-vector v Ci of v C is stored in the i-th subspace, i=1, 2, M; k-means clustering is carried out on the sub-vectors in each subspace during training, and the classification parameters are optimized through continuous training of the data set, so that an optimal group of clustering center points in each subspace is obtained; the class center points of the clusters are represented by an ID number, the center points are also called code points, and the cluster center sets in each subspace are code point sets; the process of constructing the code point set is as follows;
(1) Initializing a clustering center: clustering is carried out in each subspace, and K clustering centers are initialized;
(2) Assigning sub-vector data: for a certain subspace, assigning all the subvectors in the subspace to the nearest cluster center according to the cluster center determined in the step (1);
(3) Training to update data centers: calculating a loss function, and updating a clustering center by using an average value of all sub-vectors distributed to the certain clustering center;
(4) Checking convergence: after updating the clustering center, if the reduction amplitude of the loss function is lower than a preset threshold value, or the change of the clustering center is smaller than the preset threshold value, or the updating iteration number reaches one of three conditions of a preset maximum iteration number, stopping updating, wherein the obtained clustering center is an optimal group of clustering centers in each subspace;
(5) Constructing a code point set: the cluster center set in each subspace is a code point set corresponding to the code fragment data set;
Step 4-2: in the code segment feature vector data quantization stage, each code segment feature vector v C is quantized into a short vector q c with M dimensions by using the code point set obtained in the step 4-1;
Obtaining M segments of low-dimensional subvectors v Ci generated in the step 4-1 after feature vector segmentation, wherein the quantization process is to replace the subvector v Ci stored in the ith subspace with the center code point number of the class closest to the corresponding partition, and the low-dimensional subvector v Ci is represented by a symbol q ci after quantization, v Ci∈vC,qci∈qc;
All the high-dimensional vectors of the codes need to be quantized into low-dimensional vector representation at the stage, and the quantized code vector dataset is represented as Q C,qc∈QC;
step 5: preprocessing natural language query input by a user to obtain a natural language sequence Q;
Performing word segmentation on natural language query input by a user by using a word segmentation tool, and removing special characters and redundant fields contained in the query to obtain a natural language sequence Q;
Step 6: constructing a natural language query vector; converting the natural language sequence Q obtained in the step 5 into a natural language query vector D Q by using a vector embedding mode by using a natural language feature extraction method;
Step 7: carrying out dimension segmentation on the natural language query vector; performing dimension segmentation on the natural language query vector obtained in the step 6 by using a method of performing dimension segmentation on the code segment high-dimension feature vector, namely performing vector dimension segmentation on the natural language query vector D Q, and dividing the natural language query vector D Q into M segments of sub-vectors, wherein the M segments of sub-vectors are represented as [ D Q1,DQ2,...,DQM ];
Step 8: finding the code segment most similar to the natural language query vector obtained in the step 7 in the code database, and returning the result; the process is executed in two stages, namely a query table constructing stage and a natural language query stage;
Step 8-1: executing a query table construction stage, namely constructing a query table for the query vectors, wherein the query table stores the distances between each query sub-vector and all clustering center points in the corresponding subspace; acquiring M segments of low-dimensional subvectors [ D Q1,DQ2,...,DQM ] of the natural language query vector generated in the step 7 and a code point set generated in the step 4; creating a lookup table for each query sub-vector D Qi, and aiming at the lookup table of the ith sub-vector of the natural language query vector, wherein the distance from each code point in the ith sub-space to the query sub-vector D Qi is stored, each query sub-vector needs to construct the lookup table, and all the lookup tables are spliced to obtain a lookup table for storing vector distances;
Step 8-2: executing a natural language query stage, obtaining a query table constructed in the step 8-1, and performing natural language query; when inquiring, comparing the natural language inquiry vector with all code segment characteristic vectors in the code database to obtain code segment search results, wherein the specific method comprises the following steps:
1) Obtaining a certain code segment feature vector v X in a code database, and extracting quantized representations [ q X1ID,qX2ID,...,qXM ID ] corresponding to M sub-vectors v X1,vX2,...,vXM of the vector, namely cluster center point codes nearest to the M sub-vectors;
2) Calculating the distance between the natural language query vector D Q and the code segment feature vector v X, namely calculating the sum of the distances between each query sub-vector D Qi and the code segment quantized representation q Xi ID, and when the distance between the ith query sub-vector and the ith code segment sub-vector quantized representation is calculated, acquiring the corresponding distance in a query table, and recording as dis i; calculating the distance between each section of natural language query subvector and each section of code subvector to obtain a vector distance list dis= { dis 1,dis2,...,disM };
3) Accumulating the distances in the vector distance list dis to obtain a distance dis X-Q between the natural language query vector D Q and the code segment feature vector v X;
Using the steps 1) -3) to calculate the distance between the natural language query vector D Q and the vectors corresponding to all codes in the code database, and storing the feature vectors V C-TopK of the K code fragments with the smallest distance as search results;
Step 9: executing a search result rearrangement stage, and carrying out secondary screening on search results returned by the inquiry stage; and 8, obtaining feature vectors V C-TopK of the top K code segments returned in the natural language query stage in the step, calculating cosine similarity between the natural language query vector and all vectors in V C-TopK, sequencing the feature vectors of the code segments from high to low according to cosine similarity results, and returning the top n code segments as code search results.
The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the invention provides a product quantization-based efficient code searching method, which comprises the following steps of (1) carrying out dimension segmentation and clustering on high-dimensional vectors so as to quantize each high-dimensional vector into a low-dimensional vector. In the method, a high-dimensional vector is divided into a plurality of low-dimensional sub-vectors in a quantization stage, the plurality of low-dimensional sub-vectors are mapped into a plurality of subspaces, clustering training is carried out on the sub-vectors in each subspace, so that a group of optimal cluster center point code point representations are obtained in each subspace, the cluster center code points closest to the ion vector in each subspace are used for quantizing the segment of sub-vectors, and the plurality of sub-vectors are spliced to obtain the low-dimensional quantized representation of the high-dimensional vector. In quantizing the high-dimensional vector representation, there is no loss of information contained in the excessive vectors. When the vector is segmented, rich semantic and structural information in the source code vector is reserved to a great extent, and the condition that the code searching accuracy is reduced due to information loss is reduced;
(2) In the query stage, the distances between each query sub-vector and all clustering center code points in the corresponding subspace are stored in a manner of calculating a query table, and when the similarity between vectors is calculated, namely the sum of vector distances after the quantization of each natural language query sub-vector and each code segment sub-vector is calculated, the distances can be directly obtained in the query table, so that the calculation cost is reduced, the code searching speed is improved, and the code searching time is greatly shortened;
(3) In order to further improve the accuracy of code search results, a rearrangement mechanism is introduced to carry out secondary screening on the code search results returned by the query stage, namely, the cosine similarity degree between the natural language query vector and a plurality of code segment feature vectors returned by the query stage is recalculated, and the code segments corresponding to the code segment feature vectors of the n top ranking are used as code search final results according to the sequence from high to low of the results.
Compared with the existing code searching method, the method can greatly improve the code searching speed under the condition of not sacrificing excessive accuracy of the baseline code searching model. In the method, the high-dimensional vector segmentation and clustering method is used, the problems of information loss, high calculation cost and the like possibly generated when the high-dimensional vectors are clustered directly are solved, and when the distances between the natural language query vector and all code segment vectors in a code library are calculated, the distances between the quantized query vector and the quantized code segment feature vectors are calculated, so that the problem of high calculation cost when the distances between the high-dimensional vectors are calculated is solved.
Drawings
FIG. 1 is a flow chart of a method for efficient code search based on product quantization according to an embodiment of the present invention;
FIG. 2 is a diagram of a set of code segments and information examples preprocessed by natural language description according to an embodiment of the present invention;
FIG. 3 is a flowchart of a training phase of feature vector data of a code segment according to an embodiment of the present invention;
FIG. 4 is a flow chart of a natural language query phase provided by an embodiment of the present invention;
FIG. 5 is a flowchart of a rearrangement phase according to an embodiment of the present invention;
FIG. 6 is a graph of comparison results of search results of different code search techniques when different code search models are used as baseline models, according to an embodiment of the present invention, (a) is a change in accuracy of search results of applying different code search techniques to a baseline model of DeepCS, (b) is a change in accuracy of search results of applying different code search techniques to a baseline model of CodeBert, (c) is a change in accuracy of search results of applying different code search techniques to a baseline model of GraphCodeBert, and (d) is a change in accuracy of search results of applying different code search techniques to a baseline model of GSMM;
fig. 7 is a diagram illustrating code searching examples of applying a product quantization-based efficient code searching method to a baseline code searching model according to an embodiment of the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
In this embodiment, a product quantization-based efficient code search method, as shown in fig. 1, includes the following steps:
step 1: acquiring a code segment and natural language description corresponding to the code segment; constructing a code segment set codebase and a natural language description set descbase by collecting an original code segment data set and natural language descriptions corresponding to the code segments;
Step 2: preprocessing the code segments and the corresponding natural language descriptions obtained in the step 1, and extracting various characteristic information of the code segments;
Preprocessing the code segments and the natural language description, removing repeated code segments in the data set, and deleting the method body with annotation content less than the set word number; removing special characters and redundant fields in the natural language description of each code segment; extracting function method name information, API (application programming interface) information, token mark information and various characteristic information of a program dependency graph of the code segment;
The codes in the data set of code search are function levels, the method names of the functions can roughly describe the functions of one function, and are important characteristics in understanding code semantic information; for extracting the code API information, using DeepAPI extraction method, using Eclipse JDT compiler to analyze the source code file into abstract syntax tree, and using abstract syntax tree method of traversing method to extract the API information in the code segment; the token mark vocabulary of the code consists of keywords, operators and identifiers, but new identifier names are often customized due to programming habits of different programmers, so that in order to better extract token mark information, a keyword table is established, and a StanfordCore NLP method is used for word segmentation of Java code fragments by combining with related constraint rules such as grammar structure rules, type constraint rules and the like, so that code token mark information is obtained; the code program dependency graph comprises rich data dependency information and control dependency information, the tool TinyPDG is used for extracting node information and data flow information in the code aiming at the extraction of the program dependency graph, the code is converted into graph structure representation, and a depth-first traversal mode is used for generating a code program dependency graph sequence, so that code program dependency graph characteristic information is obtained;
In fig. 2, characteristic information and natural language description information of a set of code fragments and natural language description data thereof in a data set are shown, wherein in fig. 2 (a), a set of code fragments and natural language description thereof in the data set is shown, the natural language description is "Copies bytes from a large (over 2 GB) InputStream to an outputstream.", meaning that bytes are copied from a large input stream exceeding 2GB into an output stream, the function described by the code fragments is "copy data from one input stream into one output stream", and in fig. 2 (b), the preprocessed natural language description is shown, and method name information, API information, token mark information and program dependency graph information of the code fragments are shown;
Step 3: constructing a code segment feature vector; acquiring multiple items of code feature information and natural language description information generated in the step 2, converting the multiple items of code feature information into corresponding code feature sequences, learning semantic feature information and structural feature information in different feature sequences by using a deep learning model, mapping the code feature sequences into feature vectors, and fusing the multiple feature vectors to obtain feature vectors V C of code segments to obtain a code segment feature vector set V C;
Step 3-1: converting function method name information extracted from the code segment into a method name sequence MSeq, converting API (application programming interface) information extracted from the code segment into an API sequence APISeq, converting Token mark information extracted from the code segment into a Token mark sequence TSeq, converting a program dependency graph of the code segment into a corresponding graph sequence GPDGSeq, and performing word segmentation on natural language descriptions corresponding to the code segments obtained from the step 2 by using a word segmentation tool to obtain a natural language description sequence DSeq corresponding to each code segment;
Step 3-2: respectively learning various semantic features and structural features in the code segments by using 4 current representative code search models, fusing feature vectors of different features, and converting the code segments into corresponding high-dimensional feature vectors v C;
training a code search model by using the code segment feature vectors and the corresponding natural language description vectors, calculating a loss function, and optimizing model parameters to obtain a final code segment feature vector set;
The 4 code search models used in the present embodiment include a code search algorithm DeepCS model and a GSMM model based on multi-feature fusion, and a code search algorithm CodeBert model and a GraphCodeBert model based on a pre-training model;
Step 4: quantizing the high-dimensional feature vector V C of each code segment in the code segment feature vector set V C obtained in step 3 into a low-dimensional vector q C representation; the process needs to go through two stages, namely a code segment feature vector data training stage and a code segment feature vector data quantization stage;
Step 4-1: a schematic flow chart of the code segment feature vector data training phase is shown in fig. 3. In a code segment feature vector data training stage, embedding the code segment feature vector obtained in the step 3 into Cartesian products of M disjoint partitions, and quantizing each partition into K clusters, namely dividing each high-dimensional code segment feature vector v C into M low-dimensional sub-vector segments, wherein the sub-vector segments are respectively stored in different subspaces; for one code segment feature vector v C∈RMD, it is segmented into M equal length D sub-vectors, i.e., v C=[vc1,vC2,...,vCM ], where the i-th sub-vector v Ci of v c is stored in the i-th subspace; outputting the segmented code segment feature vector to the step 4-2;
K-means clustering is needed to be carried out on the sub-vectors in each subspace during training, a data set is continuously trained, so that classification parameters are optimized, and an optimal group of clustering center points in each subspace are obtained, and the optimal clustering center points in each subspace enable the square sum of the distances from each vector in the space to the nearest clustering center to the vector to be minimum; each class center point of the cluster is represented by an ID number, the center point is also called a code point, and the cluster center in each subspace is stored to form a code point set corresponding to the code fragment data set; in this embodiment, the number K of clusters is 256, and the corresponding ID number indicates a range of 0 to 255;
Outputting the obtained clustering center in each subspace to a code segment feature vector data quantization stage in the step 4-2 and a query stage in the step 8; the implementation details of constructing the code point set in this embodiment are as follows:
(1) Initializing a clustering center: k classes of clustering are carried out in each subspace, K clustering centers are initialized, and because the high-dimensional vector is divided into M segments of low-dimensional vectors when the high-dimensional vector is segmented, the class number K of the clustering is set to 256 corresponding to the M subspaces, and 256 classes of clustering are carried out in each subspace, 256 code points are arranged in each subspace, namely 256 8 possible product quantization low-dimensional vector codes are formed in total;
(2) Assigning sub-vector data: for a certain subspace, assigning all the subvectors in the subspace to the nearest cluster center according to the cluster center determined in the step (1);
(3) Training to update data centers: after sub-vector data are distributed to the clustering centers in each subspace, a trained loss function is calculated, the loss function represents the square sum of distances from all sub-vectors to the nearest clustering center, and after the loss function is calculated, the average value of all sub-vectors distributed to a certain clustering center is used for updating the clustering center;
the loss function formula is shown below;
Wherein C j represents the j-th cluster, x is a sub-vector data point belonging to the cluster, C j is the cluster center code point, and iix-C j2 is the square of euclidean distance, and since the sub-vectors are clustered into 256 types in each subspace, the value of j is from 0 to 255;
(4) Checking convergence: after each updating of the cluster center, if the reduction amplitude of the calculation loss function is lower than a preset threshold value, or the change between the current cluster center and the cluster center before updating is smaller than the preset threshold value, or the updating iteration number reaches one of three conditions of the preset maximum iteration number, stopping updating, wherein the obtained cluster center is the optimal cluster center group in each subspace;
(5) Constructing a code point set: storing the cluster centers in each subspace to form a code point set corresponding to the code fragment data set;
step 4-2: in the code segment feature vector data quantization stage, quantizing the code segment feature vector using the code point set obtained in the code segment feature vector data training stage in step 4-1;
In this process, each sample v C in the code segment feature vector obtained in step 3 is quantized into one short vector q c of M dimensions; obtaining M-segment low-dimensional subvectors v Ci, i=1, 2 after segmentation of the code segment feature vector generated in step 4-1, M, replacing the subvector v Ci stored in the i-th subspace with the nearest class center code point number in the corresponding partition, namely for the i-th partition (M partitions are all provided, and i epsilon {1,2, M }) there is one cluster center c ik∈RD for each partition (K epsilon {1,2,., K }) and if the i-th subvector v Ci of the vector v C is c ik from the nearest cluster center, the quantized value of v Ci is c ik, denoted by symbol q ci, wherein q ci∈qc; the formula for finding the nearest class center code point of the clustering sub-vector is as follows:
the formula shows that the value of q ci is the value of c when the objective function v Ci-c||2 is minimum, wherein centre i represents the set of all class center points in the ith subspace partition, v Ci is the ith subvector of the code fragment feature vector, and c is a certain class center point in the ith subspace partition; each element in the short vector of M dimension is the serial number of the nearest cluster center point of different sub-vectors in the corresponding subspace;
In the stage, all the high-dimensional characteristic vectors of the code segments are required to be quantized into low-dimensional vector representation, and the quantized code vector dataset vector is represented as Q C,qc∈QC; outputting the quantized code vector data set to the query stage in the step 8;
step 5: preprocessing natural language query input by a user to obtain a natural language sequence Q;
Performing word segmentation on the natural language query input by the user by using a word segmentation tool, removing special characters and redundant fields contained in the query to obtain a natural language sequence Q, and outputting the natural language sequence Q to a natural language query vector construction stage in the step 6;
Step 6: constructing a natural language query vector; converting the natural language sequence Q obtained in the step 5 into a natural language query vector D Q by using a vector embedding mode by using a natural language feature extraction method, and transmitting the natural language query vector D Q to a natural language query vector dimension segmentation stage in the step 7;
Step 7: carrying out dimension segmentation on the natural language query vector; performing dimension segmentation on the natural language query vector obtained in the step 6 by using a method of performing dimension segmentation on the code segment high-dimension feature vector, namely performing vector dimension segmentation on the natural language query vector D Q, dividing the natural language query vector D Q into M segments of sub-vectors, representing the M segments of sub-vectors as [ D Q1,DQ2,...,DQM ], and outputting a natural language query sub-vector [ D Q1,DQ2,...,DQM ] to a query stage of the step 8;
step 8: finding the code segment most similar to the natural language query vector obtained in the step 7 in the code database, and returning the result; the query phase is executed in two phases, namely a query table construction phase and a natural language query phase, and the flow diagram of the query phase is shown in fig. 4, and specific implementation details are as follows.
Step 8-1: in the stage of constructing a lookup table, a lookup table is required to be constructed for the query vector, and the distance between each query sub-vector and all clustering center points in the corresponding subspace is stored in the lookup table; acquiring M segments of low-dimensional subvectors [ D Q1,DQ2,...,DQM ] of the natural language query vector generated in the natural language query vector dimension segmentation stage in the step 7 and a code point set generated in the code segment feature vector data training stage in the step 4; for each query sub-vector D Qi, i=1, 2,., M creates a lookup table for the i-th sub-vector of the natural language query vector, wherein the distance from the 256 code points in the i-th sub-space to the query sub-vector D Qi is stored, each query sub-vector needs to construct a lookup table, and all the lookup tables are spliced to obtain a lookup table storing M-th 256 vector distances; outputting the lookup table to the natural language query stage of the step 8-2;
Step 8-2: in the natural language query stage, acquiring a constructed query table from the step 8-1, and performing natural language query; when inquiring, comparing the natural language inquiry vector with all code segment characteristic vectors in a code database to obtain code segment search results, wherein the specific method comprises the following steps:
1) Obtaining a certain code segment feature vector v X in a code database, and extracting quantized representations [ q X1ID,qX2ID,...,qXM ID ] corresponding to M sub-vectors v X1,vX2,...,vXM of the vector, namely cluster center point codes nearest to the M sub-vectors;
2) The distance between the natural language query vector D Q and the code segment feature vector v X is calculated, i.e., the respective query sub-vectors D Qi, i=1, 2,.. the sum of the distances between M, when calculating the distances of the i-th query sub-vector and the quantized representation of the i-th code segment sub-vector, the corresponding distances are obtained in a lookup table, noted as dis i; calculating the distance between each section of natural language query subvector and each section of code subvector to obtain a vector distance list dis= { dis 1,dis2,...,disM };
3) Accumulating the distances in the vector distance list dis to obtain a distance dis X-Q between the natural language query vector D Q and the code segment feature vector v X;
Using the steps 1) -3) to calculate the distance between the natural language query vector D Q and the vectors corresponding to all codes in the code database, and storing the feature vectors V C-TopK of the K code fragments with the smallest distance as search results; outputting the current search result V C-TopK to a search result rearrangement stage in the step 9;
Step 9: and (3) carrying out secondary screening on the search results returned by the query stage in the search result rearrangement stage, obtaining feature vectors V C-TopK of K code fragments before ranking returned by the natural language query stage in the step (8), calculating cosine similarity between the natural language query vector and all vectors in V C-TopK, sequencing the feature vectors of the code fragments from high to low according to the cosine similarity results, and returning n code fragments before ranking as the code search results.
The flow diagram of the rearrangement phase is shown in fig. 5. The formula for calculating cosine similarity between the natural language query vector and the code segment feature vector is as follows:
Wherein code represents a high-dimensional vector representation corresponding to the code segment, desc represents a high-dimensional vector representation corresponding to the natural language query, and cos (code) represents cosine similarity between the code segment feature vector and the natural language query vector.
In order to demonstrate the superiority of the code search method provided by the invention in improving the code search speed, the embodiment designs a plurality of comparison experiments to evaluate the performance of executing code search on the same data set by using different code search algorithms. The better the code searching performance, the less the code searching time, and the higher the code searching result accuracy.
The product quantization-based code search (PQ-CS) method of the present invention is experimentally applied to several of the most advanced and representative code search methods at present to show the superiority of PQ-CS in improving code search speed. In designing the comparison experiment, in addition to comparing the code search initial performance of the above 4 code search algorithms, a locality sensitive hashing LSH and a clustering classification algorithm K-means are applied to the above 4 baseline code search models, and the code search performance is analyzed to compare with the code search performance applied to the baseline code search model using PQ-CS. The local sensitive hash LSH contrast experiment is to map the high-dimensional vector representation of the code segment to the low-dimensional hash representation by using a hash function, map the natural language query vector to a hash code by using the hash function in the search stage, and return the result of the top 10 which is found in the hash representation of the code segment and has the most similar meaning with the natural language query hash representation as a code search result by calculating one by one. The clustering classification algorithm K-means comparison experiment is to classify the high-dimensional vectors of the code segments in advance, so that each vector is distributed into the class containing the cluster center point closest to the vector, record all the cluster center point vectors, compare the query vector with all the cluster center point vectors when code searching is executed, thereby finding the closest cluster center, further calculate the cosine distance between the vector corresponding to each code segment in the class and the query vector, and return the result of the ranking of top 10 as the code searching result.
In this embodiment, on CodeSearchNet datasets, the code search effectiveness of the baseline algorithm and the baseline algorithm applying the LSH, K-means and PQ-CS technologies are respectively evaluated, and indexes of the accuracy of the search results of the 4 evaluation code search models used in the experiment are sr@1 (success rate@1), sr@5 (success rate@5), sr@10 (success rate@10) and MRR (Mean Reciprocal Rank), and different technologies are applied to the index pairs in the 4 code search baseline model experiments as shown in table 1.
From the experimental results in table 1, it is shown that on the CodeSearchNet dataset, for DeepCS, codeBert, graphCodeBert, GSMM four baseline algorithms, the code search model using the suboptimal LSH method was respectively reduced by 36.9%,37.5%,36.2%,40.2% on the sr@1 evaluation index, respectively reduced by 27.9%,26.2%,23.9%,28.9%, on the sr@5 evaluation index, respectively reduced by 23.6%,21.8%,19.7%,24.4% on the sr@10 evaluation index, respectively reduced by 31.6%,31.7%,30.0%,33.9% on the MRR evaluation index, and the accuracy drop was excessive. Compared with the original baseline code search model, the code search model using the PQ-CS method is reduced by 5.10%,6.86%,5.68%,4.26% on the SR@1 evaluation index, 3.90%,3.55%,3.21%,2.98% on the SR@5 evaluation index, 2.99%,2.41%,2.36%,2.49% on the SR@10 evaluation index, 3.95%,5.12%,4.65%,3.70% on the MRR evaluation index, and the accuracy of the code search model is similar to that of the baseline code search model, namely, excessive search accuracy of the baseline code search model is not lost. Compared with a search model using an LSH method, the code search model using the PQ-CS provided by the invention can be respectively increased by 50.29%,48.90%,47.78% on an SR@1 evaluation index, can be respectively increased by 33.43%,30.74%,27.29%,36.45% on an SR@5 evaluation index, can be respectively increased by 26.89%,24.75%,21.51%,29.01% on an SR@10 evaluation index, and can be respectively increased by 40.50%,38.98%,36.24% and 45.87% on an MRR evaluation index, compared with the code search model using the PQ-CS method, the code search model using the PQ-CS method has higher search accuracy. According to index changes of various evaluation code search accuracy rates, it can be seen that the application of the PQ-CS technology to 4 kinds of baseline code search models can enable the models to maintain the accuracy rate of the baseline code search models to be 96.05% or more on various indexes, wherein the average accuracy rate of the baseline code search models can be 96.59% on the SR@5 index.
Table 1 code search result accuracy vs. results for different techniques applied to 4-term baseline model
The impact of different code search techniques on search result accuracy when using different code search models as baseline models is illustrated in fig. 6. Fig. 6 (a), 6 (b), 6 (c) and 6 (d) show the change of search result accuracy by applying different code search techniques when DeepCS, codeBert, graphCodeBert, GSMM is taken as a baseline model, wherein the change of search accuracy by using the PQ-CS technique on the baseline model, the LSH technique on the baseline model and the K-means technique on the baseline model is respectively from left to right, and different evaluation indexes in 4 are used to show the change of search accuracy of the different code search techniques, the circle represents the evaluation index sr@1, the cross represents the evaluation index sr@5, the triangle represents the evaluation index sr@10, and the square represents the evaluation index MRR. It can be seen from 4 graphs that, on each evaluation index, the accuracy of the search result of the code search model applying the PQ-CS technology is similar to that of the original code search model, the connecting line of the corresponding marks in the graphs is similar to the horizontal line, and the accuracy value corresponding to the marks in the graphs of the comparative experiments applying the other two technologies is obviously reduced compared with that of the baseline model. Through the experimental results, compared with a code search model using an LSH technology and a K-means technology, the code search model using the PQ-CS technology can be seen to have greatly improved search result accuracy and similar search accuracy to a baseline model.
In order to demonstrate the superiority of the PQ-CS in improving the code search speed, the code search speed is compared with the code search speed of the baseline model by the code search model applying the PQ-CS, and the index of the estimated model search speed is the code search time. Wherein the code search time is defined as the sum of the similarity calculation time of the code segment vector and the natural language query vector and the time of ranking the search results. The experimental results are shown in Table 2, which demonstrates the comparison of search times for the baseline model and the code search model using PQ-CS.
TABLE 2 comparison of search time for baseline model and code search model applying PQ-CS
As can be seen from the experimental results of the table, the application of the PQ-CS technology to DeepCS, codeBert, graphCodeBert, GSMM baseline code search models greatly reduces the code search time, reduces the code search time by about 80.0%, and greatly improves the code search speed.
The efficient code search model can greatly improve the code search speed and shorten the search time when the accuracy of the search result is similar to that of the original code search model. As can be seen from the experimental results, the application of the PQ-CS technology in the baseline code search model can greatly shorten the code search time, reduce the code search time by about 80.0% and improve the code search speed when the model keeps the accuracy of at least 96.05% of the original code search model.
An example of code search in which PQ-CS technology is applied to a GSMM code search baseline model is shown in fig. 7. The user query sentence "removes whitespace from strings" in the example is derived from a classical query sentence commonly used for Code Search in the Deep Code Search, the query sentence meaning is "delete space characters in character strings", code Search is performed, and a first-ranking Search result is obtained, the returned Code segment is shown in the figure, the function of the Code is to delete space characters in character strings, the semantic similarity with the query sentence "removes whitespace from strings" is achieved, and the effectiveness of applying the PQ-CS technology to the Search result in the Code Search model is proved.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims (10)

1.一种基于乘积量化的高效代码搜索方法,其特征在于:包括以下步骤:1. An efficient code search method based on product quantization, characterized in that it comprises the following steps: 步骤1:获取代码片段和代码片段对应的自然语言描述;通过采集原始代码片段数据集和代码片段对应的自然语言描述,构建代码片段集合和自然语言描述集合;Step 1: Obtain code snippets and natural language descriptions corresponding to the code snippets; by collecting the original code snippet dataset and the natural language descriptions corresponding to the code snippets, a code snippet set and a natural language description set are constructed; 步骤2:对获取的代码片段和对应的自然语言描述,进行预处理,并提取代码片段的多种特征信息;Step 2: Preprocess the acquired code snippets and corresponding natural language descriptions, and extract various feature information of the code snippets; 步骤3:构建代码片段特征向量;获取步骤2中生成的多项代码特征信息和自然语言描述信息,并将多项代码特征信息转换为对应的代码特征序列,使用深度学习模型学习不同特征序列中的语义特征信息和结构特征信息,将代码特征序列映射为特征向量,将多个特征向量融合得到代码片段的特征向量vC,进而得到代码片段特征向量集合VCStep 3: Construct a code snippet feature vector; obtain multiple pieces of code feature information and natural language description information generated in step 2, and convert the multiple pieces of code feature information into corresponding code feature sequences, use a deep learning model to learn the semantic feature information and structural feature information in different feature sequences, map the code feature sequences into feature vectors, fuse multiple feature vectors to obtain a feature vector v C of the code snippet, and then obtain a code snippet feature vector set V C ; 步骤4:将步骤3中获得的代码片段特征向量集合VC中每个代码片段的高维特征向量vC量化成低维向量qC表示;Step 4: quantize the high-dimensional feature vector v C of each code snippet in the code snippet feature vector set V C obtained in step 3 into a low-dimensional vector q C ; 步骤5:针对用户输入的自然语言查询进行预处理,得到自然语言序列Q;Step 5: Preprocess the natural language query input by the user to obtain a natural language sequence Q; 步骤6:自然语言查询向量构建;使用自然语言特征提取方法,将步骤5中得到的自然语言序列Q使用向量嵌入的方式转换为自然语言查询向量DQStep 6: Natural language query vector construction: Using the natural language feature extraction method, the natural language sequence Q obtained in step 5 is converted into a natural language query vector D Q using vector embedding; 步骤7:自然语言查询向量进行维度分段;使用对代码片段高维特征向量维度分段的方式对步骤6中得到的自然语言查询向量进行维度分段,即将自然语言查询向量DQ进行向量维度分段,分为M段子向量,表示为[DQ1,DQ2,…,DQM];Step 7: Dimension segmentation of the natural language query vector: Dimension segmentation of the natural language query vector obtained in step 6 is performed in the same way as dimension segmentation of the high-dimensional feature vector of the code snippet, that is, the natural language query vector D Q is dimensionally segmented into M sub-vectors, represented as [D Q1 ,D Q2 ,…,D QM ]; 步骤8:在代码数据库中查找到与从步骤7中获得的自然语言查询向量最为相似的代码片段,并将结果返回;Step 8: Find the code snippet that is most similar to the natural language query vector obtained in step 7 in the code database and return the result; 步骤9:执行搜索结果重排阶段,对查询阶段返回的搜索结果进行二次筛选;获取在步骤8中的自然语言查询阶段返回的排名前K个代码片段的特征向量,并计算自然语言查询向量与K个代码片段特征向量的余弦相似度,将代码片段特征向量按照余弦相似度结果进行从高到低排序,返回排名前n个代码片段作为代码搜索的结果。Step 9: Execute the search result re-ranking phase, and perform secondary screening on the search results returned in the query phase; obtain the feature vectors of the top K code snippets returned in the natural language query phase in step 8, and calculate the cosine similarity between the natural language query vector and the K code snippet feature vectors, sort the code snippet feature vectors from high to low according to the cosine similarity results, and return the top n code snippets as the code search results. 2.根据权利要求1所述的一种基于乘积量化的高效代码搜索方法,其特征在于:所述步骤2对代码片段和自然语言描述进行预处理,去除数据集中重复的代码片段,并将注释内容少于设定单词数的方法体删除;提取代码片段的函数方法名信息、API信息、Token令牌标记信息和程序依赖图多种特征信息。2. According to claim 1, an efficient code search method based on product quantization is characterized in that: step 2 preprocesses the code snippets and natural language descriptions, removes repeated code snippets in the data set, and deletes the method bodies whose comments are less than the set number of words; extracts the function method name information, API information, Token token tag information and program dependency graph of the code snippet. 3.根据权利要求2所述的一种基于乘积量化的高效代码搜索方法,其特征在于:所述步骤2采用驼峰命名规则将代码片段中的函数方法名拆分成多个单词的组合;将源代码文件解析成抽象语法树,并使用遍历方法体的抽象语法树方法提取代码片段中API信息;为了提取令牌标记信息,建立关键词表,并使用StanfordCore NLP方法与相关约束规则相结合来对代码片段分词,从而得到代码令牌标记信息;提取代码中的节点信息和数据流信息,将代码转换为图结构表示,并使用深度优先遍历的方式生成代码的程序依赖图序列,从而得到代码的程序依赖图特征信息。3. According to claim 2, an efficient code search method based on product quantization is characterized in that: the step 2 uses the camel case naming rule to split the function method name in the code snippet into a combination of multiple words; the source code file is parsed into an abstract syntax tree, and the API information in the code snippet is extracted using the abstract syntax tree method of traversing the method body; in order to extract token marking information, a keyword table is established, and the Stanford Core NLP method is combined with relevant constraint rules to segment the code snippet, thereby obtaining code token marking information; node information and data flow information in the code are extracted, the code is converted into a graph structure representation, and a program dependency graph sequence of the code is generated using a depth-first traversal method, thereby obtaining the program dependency graph feature information of the code. 4.根据权利要求3所述的一种基于乘积量化的高效代码搜索方法,其特征在于:所述步骤3的具体方法为:4. The efficient code search method based on product quantization according to claim 3 is characterized in that: the specific method of step 3 is: 步骤3-1:将代码片段中提取的函数方法名信息转换为方法名序列MSeq,将代码片段中提取的API信息转换为API序列APISeq、将代码片段中提取的Token令牌标记信息转换为Token令牌标记序列TSeq,将代码片段的程序依赖图转换为对应的图序列GPDGSeq,使用分词工具将从步骤2中获得的代码片段对应的自然语言描述进行分词处理,并去除查询中包含的特殊字符和冗余字段,得到每一个代码片段对应的自然语言描述序列DSeq;Step 3-1: Convert the function method name information extracted from the code snippet into the method name sequence MSeq, convert the API information extracted from the code snippet into the API sequence APISeq, convert the Token token information extracted from the code snippet into the Token token sequence TSeq, convert the program dependency graph of the code snippet into the corresponding graph sequence GPDGSeq, use the word segmentation tool to perform word segmentation on the natural language description corresponding to the code snippet obtained in step 2, and remove the special characters and redundant fields contained in the query to obtain the natural language description sequence DSeq corresponding to each code snippet; 步骤3-2:使用目前4种代码搜索模型分别学习代码片段中的各种语义特征和结构特征,融合不同特征的特征向量,将代码片段转化对应的高维特征向量vCStep 3-2: Use the current four code search models to learn various semantic features and structural features in the code snippet, fuse the feature vectors of different features, and transform the code snippet into the corresponding high-dimensional feature vector v C ; 在该步骤中需要使用代码片段特征向量和其对应的自然语言描述向量进行代码搜索模型的训练,计算损失函数,并优化模型参数,从而得到最终代码片段特征向量集合。In this step, the code snippet feature vector and its corresponding natural language description vector need to be used to train the code search model, calculate the loss function, and optimize the model parameters to obtain the final code snippet feature vector set. 5.根据权利要求4所述的一种基于乘积量化的高效代码搜索方法,其特征在于:所述步骤4包括两个阶段,分别是代码片段特征向量数据训练阶段和代码片段特征向量数据量化阶段;5. The efficient code search method based on product quantization according to claim 4 is characterized in that: the step 4 includes two stages, namely, a code fragment feature vector data training stage and a code fragment feature vector data quantization stage; 在代码片段特征向量数据训练阶段中,将步骤3中获得的每个代码片段特征向量vC均分为M个低维度的子向量段,分别存储在不同的子空间中;即对于一个代码片段特征向量vC∈RMD,其被分段为M个相同长度D的子向量,即vC=[vC1,vC2,…,vCM],其中,vC的第i个子向量vCi存储在第i个子空间中,i=1,2,…,M;训练时对每一个子空间中的子向量进行K-means聚类,通过不断训练数据集,以优化分类参数,得到每个子空间中最优的一组聚类中心点;聚类的类别中心点用一个ID号码表示,中心点也称作码点,各子空间中的聚类中心集合即为码点集;In the code snippet feature vector data training phase, each code snippet feature vector v C obtained in step 3 is evenly divided into M low-dimensional sub-vector segments, which are stored in different subspaces; that is, for a code snippet feature vector v C ∈ R MD , it is segmented into M sub-vectors of the same length D, that is, v C = [v C1 ,v C2 ,…,v CM ], where the i-th sub-vector v Ci of v C is stored in the i-th subspace, i = 1, 2,…, M; during training, K-means clustering is performed on the sub-vectors in each subspace, and the classification parameters are optimized by continuously training the data set to obtain the optimal set of cluster center points in each subspace; the cluster category center point is represented by an ID number, and the center point is also called a code point, and the set of cluster centers in each subspace is the code point set; 在代码片段特征向量数据量化阶段,使用代码片段特征向量数据训练阶段获得的码点集将每一个代码片段特征向量vC均量化成一个M维度的短向量qcIn the code snippet feature vector data quantization stage, each code snippet feature vector v C is quantized into an M-dimensional short vector q c using the code point set obtained in the code snippet feature vector data training stage; 获取代码片段特征向量数据训练阶段产生的代码片段特征向量分段后的M段低维度子向量vCi,量化过程是将存储在第i个子空间中的子向量vCi用其对应的分区中距离最近的类中心码点编号代替,低维度子向量vCi量化后用符号qci表示,vCi∈vC,qci∈qcObtain the M low-dimensional subvectors v Ci of the code snippet feature vector segmented in the code snippet feature vector data training phase. The quantization process is to replace the subvector v Ci stored in the i-th subspace with the nearest class center code point number in its corresponding partition. The low-dimensional subvector v Ci is quantized and represented by the symbol q ci , v Ci ∈v C , q ci ∈q c ; 在代码片段特征向量数据量化阶段需要将所有的代码高维向量均量化为低维向量表示,量化后的代码向量数据集表示为QC,qc∈QCIn the code snippet feature vector data quantization stage, all high-dimensional code vectors need to be quantized into low-dimensional vector representations. The quantized code vector data set is represented as Q C , q c ∈Q C . 6.根据权利要求5所述的一种基于乘积量化的高效代码搜索方法,其特征在于:所述代码片段特征向量数据训练阶段构建码点集的过程如下:6. The efficient code search method based on product quantization according to claim 5, characterized in that: the process of constructing the code point set in the code fragment feature vector data training phase is as follows: (1)初始化聚类中心:在各子空间中进行聚类,初始化K个聚类中心;(1) Initialize cluster centers: Perform clustering in each subspace and initialize K cluster centers; (2)分配子向量数据:针对于某一子空间,根据(1)中确定的聚类中心,将该子空间中所有的子向量分配给最近的聚类中心;(2) Assigning subvector data: For a certain subspace, according to the cluster center determined in (1), all subvectors in the subspace are assigned to the nearest cluster center; (3)训练以更新数据中心:计算损失函数,使用分配给某聚类中心的所有子向量的平均值更新聚类中心;(3) Training to update the data center: Calculate the loss function and update the cluster center using the average value of all subvectors assigned to a cluster center; (4)检查收敛情况:更新聚类中心后,若满足损失函数的下降幅度低于预设的阈值,或者聚类中心变化小于预设的阈值,或者更新迭代次数达到预设的最大迭代次数三个条件之一,则停止更新,得到的聚类中心即为每个子空间中最优的一组聚类中心;(4) Checking convergence: After updating the cluster centers, if the decrease in the loss function is lower than the preset threshold, or the change in the cluster centers is less than the preset threshold, or the number of update iterations reaches the preset maximum number of iterations, then stop updating. The obtained cluster centers are the optimal set of cluster centers in each subspace. (5)构建码点集:每个子空间中的聚类中心集合即为代码片段数据集对应的码点集。(5) Constructing code point sets: The set of cluster centers in each subspace is the code point set corresponding to the code snippet dataset. 7.根据权利要求6所述的一种基于乘积量化的高效代码搜索方法,其特征在于:所述步骤5使用分词工具将用户输入的自然语言查询进行分词处理,并去除查询中包含的特殊字符和冗余字段,得到自然语言序列Q。7. According to claim 6, an efficient code search method based on product quantization is characterized in that: step 5 uses a word segmentation tool to segment the natural language query input by the user, and removes special characters and redundant fields contained in the query to obtain a natural language sequence Q. 8.根据权利要求7所述的一种基于乘积量化的高效代码搜索方法,其特征在于:所述步骤8分两个阶段执行,分别是构建查询表阶段和自然语言查询阶段,具体为:8. The method of efficient code search based on product quantization according to claim 7 is characterized in that: step 8 is performed in two stages, namely, a query table construction stage and a natural language query stage, specifically: 步骤8-1:执行构建查询表阶段,为查询向量构建查询表,查询表中存放着各查询子向量与其对应子空间中所有聚类中心点的距离;Step 8-1: Execute the query table construction phase to construct a query table for the query vector. The query table stores the distances between each query subvector and all cluster center points in its corresponding subspace. 步骤8-2:执行自然语言查询阶段,获取步骤8-1构建的查询表,进行自然语言查询;查询时将自然语言查询向量与代码数据库中所有的代码片段特征向量相比较,获得代码片段搜索结果。Step 8-2: Execute the natural language query phase, obtain the query table constructed in step 8-1, and perform a natural language query; during the query, compare the natural language query vector with all the code snippet feature vectors in the code database to obtain the code snippet search results. 9.根据权利要求8所述的一种基于乘积量化的高效代码搜索方法,其特征在于:所述步骤8-1的具体方法为:9. The efficient code search method based on product quantization according to claim 8, characterized in that: the specific method of step 8-1 is: 获取步骤7生成的自然语言查询向量的M段低维子向量[DQ1,DQ2,…,DQM],以及步骤4生成的码点集;对每一个查询子向量DQi创建一个查询表,针对于自然语言查询向量的第i个子向量的查询表,其中存放的是第i个子空间中每个码点到查询子向量DQi的距离,每一个查询子向量均需要构建查询表,将所有查询表拼接,得到一个存放向量距离的查询表。Obtain M low-dimensional subvectors [D Q1 ,D Q2 ,…,D QM ] of the natural language query vector generated in step 7, and the code point set generated in step 4; create a query table for each query subvector D Qi , and a query table for the i-th subvector of the natural language query vector, which stores the distance from each code point in the i-th subspace to the query subvector D Qi . A query table needs to be constructed for each query subvector, and all query tables are concatenated to obtain a query table that stores vector distances. 10.根据权利要求9所述的一种基于乘积量化的高效代码搜索方法,其特征在于:所述步骤8-2的具体方法为:10. The efficient code search method based on product quantization according to claim 9, characterized in that: the specific method of step 8-2 is: 1)获取代码数据库中某一个代码片段特征向量vX,并提取该向量的M个子向量vX1,vX2,…,vXM对应的量化表示[qX1ID,qX2ID,…,qXMID],即距离M个子向量最近的聚类中心点编码;1) Obtain a code snippet feature vector v X in the code database, and extract the quantized representations [q X1 ID,q X2 ID,…,q XM ID] corresponding to the M sub-vectors v X1 ,v X2 ,…,v XM of the vector, that is, the cluster center code closest to the M sub-vectors; 2)计算自然语言查询向量DQ与代码片段特征向量vX之间距离,即计算各个查询子向量DQi与代码片段量化表示qXiID之间的距离和,当计算第i个查询子向量与第i个代码片段子向量量化表示的距离时,相应距离在查询表中获取,记为disi;计算每段自然语言查询子向量与每段代码子向量的距离,得到向量距离列表dis={dis1,dis2,…,disM};2) Calculate the distance between the natural language query vector D Q and the code snippet feature vector v X , that is, calculate the sum of the distances between each query subvector D Qi and the code snippet quantized representation q Xi ID. When calculating the distance between the i-th query subvector and the i-th code snippet subvector quantized representation, the corresponding distance is obtained in the query table and recorded as dis i ; calculate the distance between each natural language query subvector and each code subvector to obtain a vector distance list dis = {dis 1 , dis 2 , …, dis M }; 3)将向量距离列表dis中的距离累加,得到自然语言查询向量DQ与代码片段特征向量vX的距离disX-Q3) Accumulate the distances in the vector distance list dis to obtain the distance dis XQ between the natural language query vector D Q and the code snippet feature vector v X ; 使用上述步骤1)-3)将自然语言查询向量DQ与代码数据库中所有代码对应的向量均进行距离的计算,将距离最小的排名前K个代码片段的特征向量VC-TopK作为搜索结果存储下来。Using the above steps 1)-3), the distance between the natural language query vector D Q and the vectors corresponding to all codes in the code database is calculated, and the feature vectors V C-TopK of the top K code snippets with the smallest distance are stored as search results.
CN202410052881.5A 2024-01-15 2024-01-15 An efficient code search method based on product quantization Pending CN117951251A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410052881.5A CN117951251A (en) 2024-01-15 2024-01-15 An efficient code search method based on product quantization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410052881.5A CN117951251A (en) 2024-01-15 2024-01-15 An efficient code search method based on product quantization

Publications (1)

Publication Number Publication Date
CN117951251A true CN117951251A (en) 2024-04-30

Family

ID=90791730

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410052881.5A Pending CN117951251A (en) 2024-01-15 2024-01-15 An efficient code search method based on product quantization

Country Status (1)

Country Link
CN (1) CN117951251A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119201131A (en) * 2024-11-08 2024-12-27 金华八达集团有限公司 A method and system for iterative use of code snippet extraction based on parallelization
CN119829746A (en) * 2025-03-17 2025-04-15 浪潮通用软件有限公司 Code searching method, equipment and medium based on large model agent
CN119961381A (en) * 2025-04-09 2025-05-09 广东工业大学 Code search method, system and device based on large language model classification and prototype integration

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119201131A (en) * 2024-11-08 2024-12-27 金华八达集团有限公司 A method and system for iterative use of code snippet extraction based on parallelization
CN119829746A (en) * 2025-03-17 2025-04-15 浪潮通用软件有限公司 Code searching method, equipment and medium based on large model agent
CN119961381A (en) * 2025-04-09 2025-05-09 广东工业大学 Code search method, system and device based on large language model classification and prototype integration

Similar Documents

Publication Publication Date Title
CN110275936B (en) Similar legal case retrieval method based on self-coding neural network
CN111104510B (en) A Text Classification Training Sample Expansion Method Based on Word Embedding
CN105045875B (en) Personalized search and device
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN117951251A (en) An efficient code search method based on product quantization
US20210350125A1 (en) System for searching natural language documents
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
US20220004545A1 (en) Method of searching patent documents
CN119336900B (en) Retrieval optimization method based on hierarchical expert routing model and CoT reasoning
CN110633365A (en) A hierarchical multi-label text classification method and system based on word vectors
CN111538639B (en) Log analysis method
CN118278365A (en) Automatic generation method and device for scientific literature review
CN119739867A (en) Medical knowledge graph retrieval system and method based on potential relationship reasoning
CN112597285B (en) Man-machine interaction method and system based on knowledge graph
CN117453861A (en) Code search recommendation method and system based on contrastive learning and pre-training technology
CN119829764A (en) Text dividing method and device and electronic equipment
CN114610744A (en) Data query method and device and computer readable storage medium
CN114676346A (en) News event processing method, device, computer equipment and storage medium
CN115204519B (en) A method for predicting the quality level of domain patents by integrating knowledge information
CN112988952A (en) Multi-level-length text vector retrieval method and device and electronic equipment
CN117992572A (en) A code search system and method based on pre-training model
CN111930953A (en) Text attribute feature identification, classification and structure analysis method and device
CN120429311B (en) A document semantic search method and system based on elastic search
CN114610880B (en) Text classification method, system, electronic equipment and storage medium
CN120632110B (en) Method, device, equipment and readable storage medium for processing archive task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination