WO2025245186A1

WO2025245186A1 - Querying unstructured databases

Info

Publication number: WO2025245186A1
Application number: PCT/US2025/030314
Authority: WO
Inventors: Bo Dai; Hanjun Dai; Dale Eric SCHUURMANS; Mengjiao YANG; Azade Nova; Phitchaya Mangpo PHOTHILIMTHANA; Pengcheng YIN; Charles Aloysius Sutton; Xingchen WAN; Yixin Wang
Original assignee: GDM Holding LLC
Current assignee: GDM Holding LLC
Priority date: 2024-05-22
Filing date: 2025-05-21
Publication date: 2025-11-27
Anticipated expiration: 2026-11-22

Abstract

A method of executing a database query on a database comprising a plurality of database data items, each database data item comprising one or more unstructured data values, the database query comprising a query condition associated with at least one unstructured data value of the one or more unstructured data values. The method comprises selecting a plurality of sample data items. Each sample data item is selected based upon a database data item embedding of at least one unstructured data value of the respective database data item. The plurality of sample data items is a proper subset of the plurality of database data items. The query condition is executed on each of the plurality of sample data items to generate respective query condition outputs for each of the plurality of sample data items. One or more database query outputs is generated responsive to the database query based upon the generated query condition outputs.

Description

Attorney Docket No.45288-0476WO1 QUERYING UNSTRUCTURED DATABASES BACKGROUND [0001] This specification relates to querying databases comprising unstructured data. Databases comprise real world data which often exists in unstructured formats such as images and text. Such databases may be queried using a number of techniques. However, querying databases in such a way can often be computationally expensive. SUMMARY [0002] This specification describes systems and methods implemented as computer programs on one or more computers in one or more locations that enables databases comprising unstructured data to be queried in a computationally efficient manner. Implementations of the described techniques use data embedding and data sampling using the data embeddings; some implementations use data clustering, either implicitly or explicitly. [0003] According to a first aspect there is provided a method of executing a database query on a database comprising a plurality of database data items, e.g. querying a relational database for relevant records. Each database data item comprises one or more unstructured data values. The database query comprises a query condition associated with at least one unstructured data value of the one or more unstructured data values. The method comprises selecting a plurality of sample data items. Each sample data item is selected based upon a database data item embedding of at least one unstructured data value of the respective database data item. The plurality of sample data items is a proper subset of the plurality of database data items. The method further comprises executing the query condition on each of the plurality of sample data items to generate respective query condition outputs for each of the plurality of sample data items. The method further comprises generating one or more database query outputs responsive to the database query based upon the generated query condition outputs. [0004] There is also provided a system comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method of the first aspect. 1 69705877-1 Attorney Docket No.45288-0476WO1 [0005] There is also provided one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method of the first aspect. [0006] There is also provided a computer-readable medium storing a plurality of database data items, each database data item comprising one or more unstructured data values and at least one embedding of at least one unstructured data value of the respective database data item, wherein the embeddings are adapted for selection of a plurality of sample data items from the plurality of database data items for execution of a database query comprising a query condition associated with at least one unstructured data value of the one or more unstructured data values. In some implementations, the computer-readable medium is adapted for the selection of the plurality of sample data items from the plurality of database data items for execution of the database query comprising the query condition associated with the at least one unstructured data value of the one or more unstructured data values. In some implementations, the computer-readable medium is hardware storing the database of the first aspect. [0007] Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. [0008] Structured Query Language (SQL) is a specialized programming language designed for managing and manipulating relational database management systems. SQL is used to handle structured data in a database, where there are clear relations defined between different data variables. That is, it enables users to perform tasks such as querying the database. SQL operates through declarative statements, allowing users to specify what data they want to manipulate or retrieve. SQL is a powerful tool with wide applications in analyzing structured data. However, databases often include records comprising unstructured data such as images and text which are difficult to query using standard techniques (e.g. SQL). Thus, it is desirable to provide an efficient mechanism for querying unstructured data in a database. Described herein is an Unstructured Query Language (UQL) which provides additional functionality over SQL for querying a database for database data items comprising unstructured data values. [0009] Implementations of the described techniques provide unstructured data querying and analysis in a computationally efficient way. That is, implementations limit the number of executions of a query condition e.g. to a number of data items sampled from a larger population of database data items in a database. By using embeddings of unstructured data items in combination with sampling techniques, unstructured data querying and analysis can 2 69705877-1 Attorney Docket No.45288-0476WO1 be performed in a corresponding way to SQL for structured data in a computationally efficient way. In this way, the time and/or computational resource to complete a database query on unstructured data can be significantly reduced and even made possible where it is effectively not possible without such techniques due to the computational resource required to carry out the query without the described techniques. Implementations also provide a data structure comprising an embedding that provides functional data that allows searching to be efficiently performed. [0010] Large-scale models, for example the first machine learning model, implemented as neural networks can produce impressive results on a range of tasks, including query condition execution for example. However, implementations of some of these models, particularly Transformer-based models, can have more than a billion parameters and can require substantial computing resources, power, and time to process a network input to generate the network output. Sometimes such models can have can more than 10 billion or more than 100 billion parameters. If such models were used at scale, for example if such models were used to execute the query condition on each database data item in a database such as the database described herein such queries may be highly computational expensive or even effectively not possible to execute due to the computational resource required to do so. [0011] An additional consideration is the need to optimize the computing load between a front-end component, for example a client device, and a back-end component, for example a server. This problem is known as load balancing. The techniques described herein address this problem by providing database data items comprising embedding(s) adapted for selecting a sample of the database data items. That is, the computational load incurred by the front-end components is reduced by generating embeddings that are adapted for the selection prior, for example by the back-end components, to such selection, or query execution on those selected sample data items as described herein. Thus, the described techniques enable a beneficial distribution of computing load between a local, mobile computing device and a back-end server in a network. [0012] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS 3 69705877-1 Attorney Docket No.45288-0476WO1 [0013] FIG.1 shows an example system for executing a database query on a database comprising a plurality of database data items. [0014] FIG.2 shows a specific example of a database comprising a plurality of database data items and a database query comprising a query condition. [0015] FIG.3 shows an embedding space comprising a plurality of clusters each comprising a plurality of database data item embeddings. [0016] FIG.4 is a flow diagram of an example method for executing a database query on a database comprising a plurality of database data items. [0017] FIG.5A shows an example tokenizer pattern for unstructured query language queries. [0018] FIG.5B shows example unstructured query language grammar for unstructured query language queries. [0019] FIG.6A shows a first example algorithm for executing a database query on a database. [0020] FIG.6B shows a second example algorithm for executing a database query on a database. [0021] FIG.7A shows a first set of experimental results of executing a database query on a database in accordance with the techniques described herein. [0022] FIG.7B shows a second set of experimental results of executing a database query on a database in accordance with the techniques described herein. [0023] Like reference numbers and designations in the various drawings indicate like elements. DETAILED DESCRIPTION [0024] FIG.1 depicts a system 1 for executing a database query 150 on a database 100 comprising a plurality of database data items 110, 112, …, 114. A method of executing the database query 150 on the database 150, e.g. using the system 1, and more specific detail regarding the elements generally described in this part is presented below with reference to proceeding figures. [0025] Each database data item 110, 112, 114 (e.g. database record) of the database 100 comprises one or more unstructured data values 120, 122, …, 124. For example, with reference to FIG.2, a database record may comprise one or more data fields such as “movie”, “rating”, “review_text”, and “cover_image”. In this example, fields “movie” and “rating” may be considered structured data. That is, the data values in these fields belong to a predefined taxonomy, i.e. data values which are organized according to a specific system or hierarchy. Structured data may include numbers, string labels with a predefined vocabulary (e.g. categorical labels), datetime values, among other forms of data. Further in this example, 4 69705877-1 Attorney Docket No.45288-0476WO1 fields “review_text” and “cover_image” may be considered unstructured data. That is, the data values in these fields do not belong to a predefined taxonomy. Unstructured data may include text, images, videos, and audio among other forms of data. Further in this example, the database record may comprise structured data values “Interstellar” and “7” for the respective data fields “movie” and “rating”, and may further comprise unstructured data values “This movie was out of this world!” and “<JPG>” (i.e. image data) for the respective data fields “review_text” and “cover_image”. Generally, the database data item 110, 112, 114 represents a predefined unit of data in the database 100. [0026] The database query 150 comprises a query condition 152 associated with at least one unstructured data value of the one or more unstructured data values 120, 122, 124. For example, the database query 150 may be an Unstructured Query Language (UQL) query. A tokenizer pattern and UQL grammar for the UQL query is described below with reference to FIG.5A and FIG.5B respectively. In a specific example, the UQL query may include ‘ SELECT COUNT(*) as count FROM imdb_movie_reviews WHERE “the review sentiment is negative” ’. In this example, the query condition 152 may be ‘ WHERE “the review sentiment is negative” ’. The query condition 152 may be associated with a “review_text” field of the database data items 110, 112, 114. [0027] The system 1 further comprises, for each respective database data item 110, 112, …, 114 of the database 100, a respective database data item embedding 130, 132, …, 134 of at least one unstructured data value of the respective unstructured data values 120, 122, …, 124. As described in more detail below, the database data item embeddings 130, 132, 134 may be used to perform a selection of a plurality of sample data items 140. The plurality of sample data items 140 selected may be a proper subset 140 of the plurality of database data items 110, 112, …, 114, i.e. the proper subset 140 includes some, but not all, of the plurality of database data items 110, 112, …, 114 (or data items derived therefrom). For example, as depicted, the plurality of sample data items 140 selected includes database data item 2112. Whilst not depicted in FIG.1, the plurality of sample data items 140 may also include database data item 1110, but not database data item N 114 (i.e. in this example, the plurality of sample data items 140 is also a proper subset). In some specific examples, the selection 140 of the plurality of sample data items 140 may comprise obtaining a query condition embedding 154 (i.e. an embedding of the query condition 152) and selecting the plurality of sample data items 140 based upon a relationship 190 between the query condition embedding 154 and the database data item embeddings 130, 132, …, 134. In more specific examples, the relationship 190 between the query condition embedding 154 and the database data item 5 69705877-1 Attorney Docket No.45288-0476WO1 embeddings 130, 132, …, 134 may be defined by a second machine learning model 192 (e.g. an unsupervised machine learning model, such as a machine learning model specifically adapted to perform clustering using the database data item embeddings 130, 132, 134). In some examples, the relationship 190 may indicate a plurality of clusters (not depicted) to which different embeddings may belong, the plurality of clusters being determined using the second machine learning model 192 (e.g. a K-means clustering model) and the database data item embeddings 130, 132, …, 134. Clustering is described in more detail below with reference to FIG.3. [0028] To generate one or more database query outputs 180 (e.g. aggregated, such as a count responsive to the query 150; or non-aggregated, such as one or more database data items per se) responsive to the database query 150, the query condition 152 may be executed on each of the plurality of sample data items 140 to generate respective query conditions outputs 170 for each of the plurality of sample data items 140, e.g. including generating the query condition outputs 170 for sample data item 2112. The execution may occur in any suitable way and may, as depicted in FIG.1, occur using execution logic 160a. The execution logic 160a may be any hardware and/or software configured to perform the execution described in this part. The execution logic 160a may be part of the logic for operating and querying the database 100 (not depicted) or may not be part of it (i.e. if the execution is performed remotely from the database 100). In some specific examples, the execution using the execution logic 160a may comprise providing at least one unstructured data value 122 for each of the plurality of sample data items (e.g. sample data item 2112) as input to a first machine learning model 162 to generate the query condition outputs 170. The database query outputs 180 responsive to the database query 150 may be generated based upon the query condition outputs 170. The generation of the database query outputs 180 may, like the execution of the query condition 150, occur in any suitable way and may occur using execution logic 160b (i.e. analogous to, or the same as, the other execution logic 160a). The first machine learning model 162, the second machine learning model 192, the query condition outputs 170, and the database query outputs 180, along with other key elements described above in this part will be described in further detail below with reference to proceeding figures. [0029] FIG.2 shows a specific example 2 of the database 100 comprising the plurality of database data items 110, 112, …, 114 and the database query 150 comprising the query condition 152. As previously described and as presently depicted in FIG.2, a record or database data item 1110 may comprise structured data values “Interstellar” 210a and “7” 210b for the respective data fields “movie” 200a and “rating” 200b, and may further 6 69705877-1 Attorney Docket No.45288-0476WO1 comprise unstructured data values “This movie was out of this world!” 120a and “<JPG>” 120b (i.e. image data) for the respective data fields “review_text” 200c and “cover_image” 200d. In this example, the structured data values “Interstellar” 210a and “7” 210b may conform to the pre-defined taxonomy (e.g. the data field “rating” 200b may only accept data in the form of numeric values, i.e. structured data value “7” 210b, “5” 220b, …, “8” 230b). In this example, the unstructured data values “This movie was out of this world!” 120a and “<JPG>” 120b may not conform to the pre-defined taxonomy (e.g. the image data “<JPG>” is not structured in a format suitable for querying using a typical quantitative type of database access, such as a querying the database 100 using regular SQL). [0030] To further illustrate the structure of the database 100, the database data item 112, 114 of the database 100 may also comprise respective unstructured data values (i.e. “No purposeful plot” 122a and “<PNG>” 122b for database data item 2112; and “Bold, brash, exciting” 124a and “<JPG>” 124b for database data item N 114). In the specific example depicted in FIG.2, each database data item 110, 112, 114 comprises at least 2 structured data values corresponding to data fields 200 “movie” 200a and “rating” 200b and at least 2 unstructured data values corresponding to data fields 200 “review_text” 200c and “cover_image” 200d. The number of unstructured or structured data values per database data item 110, 112, 114 may vary in general. The method described herein however generally applies to database data items comprising at least one unstructured data value. A database 100 comprising database data items including at least one unstructured data value may be referred to as an “unstructured database”. [0031] The database query 150 in the specific example depicted with reference to FIG.2 is ‘ SELECT COUNT(*) as count FROM imdb_movie_reviews WHERE “the review sentiment is negative” ’ and the query condition 152 is ’ WHERE “the review sentiment is negative” ’. Other examples of the query condition 152 are envisaged such as ‘ GROUP BY “the reason why the review is positive” AS reason ’. The database query 150 is executed on the database (e.g. and on particular database data items 110, 112, 114 of the database 100), as indicated by the directed line connecting the database query 150 and the database 100, in any suitable way (e.g. using the execution logic 160a). [0032] FIG.3 shows an embedding space 3 comprising a plurality of clusters 300a, 300b, 300c, 300d each comprising a plurality of database data item embeddings (e.g. 130, 132, 134). As described above, each database data item embedding 130, 132, 134 is an embedding of at least one unstructured data value (e.g. 120a and 120b for database data item 1110; 122a and 122b for database data item 2112; 124a and 124b for database data item N 114). Each database data 7 69705877-1 Attorney Docket No.45288-0476WO1 item 110, 112, 144 may be associated with a cluster of the plurality of clusters 300a – 300d. For example, as depicted in FIG.3, the database data item embedding 130 for database data item 1110 belongs to a fourth cluster 300d. In this example, the database data item 1110 may be associated with the fourth cluster 300d. It follows that, in this example, the database data item 2112 may be associated with the first cluster 300a and the database data item 3114 may be associated with the second cluster 300b, i.e. because the database data item embeddings 132, 134 respectively belong to the first and second clusters 300a, 300b. The query condition embedding 154 previously described may also belong to one of the clusters 300a – 300d (e.g. if the query condition embedding 154 is an embedding in the same vector or embedding space 3 as the database data item embeddings 130, 132, …, 134). The query condition 150 may thus also be associated with a cluster. The embeddings may be n-dimensional, e.g. the embeddings each include numeric values representing a position of the embedding in one of a plurality of n-dimensions 320a – 320n of a vector or embedding space 3. For convenience and practicality, only two dimensions are depicted in FIG.3, i.e. dim1 and dimN. The clusters 300a, 300b, 300c, 300d are depicted as disjoint (i.e. the clusters do not share one or more of the same embeddings). As will be described in more detail below, the plurality of sample data items 140 may be selected based upon the clusters 300a, 300b, 300c, 300d, e.g. based upon the associations between the clusters 300a, 300b, 300c, 300d and respective database data items 110, 112, …, 114 of the database 100. [0033] FIG.4 is a flow diagram of an example method for executing the database query 150 on the database 100 comprising a plurality of database data items 110, 112, …, 114. For convenience, the proceeding method is described with reference to FIG.1 – FIG.3. [0034] At step 400, the method may comprise receiving the plurality of database data items 110, 112, …, 114 previously described. At step 410, the method may comprise receiving the database query 150 previously described. [0035] At step 420, the method may comprise selecting the plurality of sample data items 140, each sample data item 112 being selected based upon the database data item embedding 130, 132, 134 of at least one unstructured data value 120, 122, 124 of the respective database data item 110, 112, 114. For example, the sample data item 2112 may be selected for the plurality of sample data items 140 based upon the database data item embedding 132. The selection may use any suitable method for selecting sample data items 140 using an embedding, for example by using importance sampling. The embedding may be a vector representation of the at least one unstructured data value. For example, unstructured data in the form “This movie was out of this world!” 120a may be provided to a machine learning 8 69705877-1 Attorney Docket No.45288-0476WO1 model (e.g. a machine learning model the same as, or other than, the first and second machine learning models 162, 172), such as a language model, to output a vector representation for that string of text. With reference to FIG.2 and the unstructured data value 120a, it will be apparent that the vector representation of “This movie was out of this world!” in this specific example may be the database data item embedding 130 for the database data item 1110. In another example, the embedding may be for multiple unstructured data values of the database data item, such as an embedding of “This movie was out of this world!” 120a and “<JPG>” 120b. In this example, the embedding may be generated using a multi-modal machine learning model. The embedding may, for example, embed all unstructured data values of the respective database item 110, 112, 114 (e.g.120a, 120b for database data item 1110), or may embed one or more unstructured data values on which queries are to be performed (e.g. only embed “This movie was out of this world!” 120a and not “<JPG>” 120b, such as in the case where the query condition 152 is to be executed only on data values corresponding to the “review_text” data field 200c. It is envisaged that the embedding may be generated in any number of ways apparent to a person skilled in the art. [0036] The plurality of sample data items 140 is a proper subset of the plurality of database data items 110, 112, …, 114 in the database 100. That is, the plurality of sample data items 140 is a set of data items including some, but not all, of the database data items 110, 112, …, 114 of the database 100. For example, for a database 100 comprising 100 records, the plurality of sample data items 140 may include 10 of those 100 records. [0037] At step 430, the method further comprises executing the query condition 152 on each of the plurality of sample data items 140 to generate respective query condition outputs 170 for each of the plurality of sample data items 140. For example, the query condition 152 ‘ WHERE “the review sentiment is negative” ’ described above may be executed on a first sample data item 110 including [ “Interstellar”, “7”, “This movie was out of this world!”, <JPG> ] and a second sample data item 112 including [ “Glitter”, “5”, “No purposeful plot”, <PNG> ], e.g. if the plurality of sample data items 140 depicted in FIG.1 also included the database data item 1110. In this example, a query condition output for the first sample data item 110 may be 0 indicating a non-match, and a query condition output (i.e. one of the query condition outputs 170) for the second sample data item 112 may be 1 indicating a match, i.e. that the review sentiment of the second sample data 112 item is negative. In other examples, the query condition output may be a continuous value that may be subject to a threshold in order to determine whether the database data item (e.g.112) is a match for the query condition 152. In yet other examples, the query condition 152 may be an image. In some 9 69705877-1 Attorney Docket No.45288-0476WO1 examples, the first machine learning model 162 (e.g. a language model) may be used to execute the query condition 152. That is, a machine learning model as described herein may be used to generate the query condition outputs 170. [0038] At step 440, the method further comprises generating the one or more database query outputs 180 responsive to the database query 150 based upon the generated query condition outputs 170. For example, the query condition outputs 170 may be [1, 0, 1, 1, 0, 1] for database data items of the database 100 corresponding to indices [0, 1, 2, 3, 4, 5] indicating that database data items of the database 100 corresponding to indices [0, 2, 3, 5] have satisfied or conform to the query condition 152. In this example, the database query outputs 180 may be the database records for database data items of the database 100 corresponding to indices [0, 2, 3, 5] (e.g. including database data items 110, 112, 114 per se). In another example, the database query outputs 180 may be a count of 4 indicating the number of records that satisfy the query condition 152. It will be readily appreciated that the database query outputs 180 are envisaged as any number of aggregated and/or non-aggregated outputs or values, as described below. [0039] In some implementations, each database data item 110, 112, …, 114 is associated with a cluster of a plurality of clusters 300a, 300b, 300c, 300d. For example, a first database data item may be assigned to a first cluster “0” 300a, a second database data item may be assigned to a second cluster “1” 300b, and a third database data item may be assigned to the first cluster “0” 300a. In this example, the first and third database item may be considered to be part of the same cluster, or group, whilst the second database item may not. The clusters may be disjoint, i.e. the clusters include no database data items in common. For example, clusters 300a – 300d of FIG.3 are all depicted as being disjoint. The cluster associated with each database data item 110, 112, …, 114 may be determined based upon the database data item embedding 130, 132, …, 134 of the at least one unstructured data value of the database data item 110, 112, …, 114. For example, the database data item embedding 130, 132, 134 may be a vector representing the semantic meaning of the at least one unstructured data value. The vector may be compared with vectors representing the other database data items in order to determine corresponding clusters 300a – 300d for each database data item 110, 112, 114. The skilled person would readily appreciate a number of applicable techniques, e.g. K- means clustering, to perform clustering of the embeddings. [0040] As previously described, the plurality of sample data items 140 may be selected based upon the clusters 300a – 300d. For example, for each cluster 300a, 300b, 300c, 300d, a proper subset of the database data items associated with a given cluster may be selected. That is, if 10 69705877-1 Attorney Docket No.45288-0476WO1 the first cluster “0” 300a is associated with database data items corresponding to indices [0, 3, 4, 6, 7, 9], and the second cluster “1” 300b is associated with database data items corresponding to indices [1, 2, 5, 8], then the plurality of sample data items may be database data items corresponding to indices [1, 3, 7, 8]. It will be readily appreciated that the clusters 300a – 300d may be sampled in any suitable way, for example using importance sampling as mentioned above, or random sampling. [0041] In some implementations, executing the query condition 152 optionally comprises providing at least one unstructured data value for each of the plurality of sample data items 140 as input to the first machine learning model 162 to generate the query condition outputs 170 – i.e. at step 432 of FIG.4. For example, the database data item 112 may be the first sample data item as described above, [ “Glitter”, “5”, “No purposeful plot”, <PNG> ]. That is, the unstructured data value “No purposeful plot” may be provided as input to the first machine learning model 162. In some implementations, the first machine learning model 162 is a multi-modal machine learning model configured to receive as input at least two types of data selected from a group consisting of: text data, image data, video data, and audio data. That is, both unstructured data values “No purposeful plot” 122a and “<PNG>” 122b (i.e. image data) may be provided as input to a machine learning model that is configured to receive both text and image data. In some implementations, the unstructured data values (e.g. 122a and 122b) comprise text data, image data, video data, or audio data. [0042] In some implementations, the database query 150 comprises at least two query conditions 152. That is, the query conditions 152 may be ‘ SELECT “the reason why” ’ and ‘ WHERE “the review sentiment is negative” ‘, i.e. the two types of query condition 152 in this example are a select condition and a where condition. Group-by conditions are also contemplated, e.g. ‘ GROUP BY “the reason why the review is positive” AS reason ’. The query condition(s) 152, in general, may be any suitable condition for the database query 150 (e.g. any condition for generating the database query outputs 180). [0043] In some implementations, executing the query condition 152 is based upon the types of query condition 152 of the database query 150. A database query 150 may include a number of conditions, or clauses. The order and execution of these clauses may affect computational efficiency of execution. Thus, it may be advantageous to execute query conditions 152 based upon the types of query condition of the database query 150. For example, executing a where condition prior to a select condition has been shown in experiments to improve computational efficiency. Thus, a specification of the order of 11 69705877-1 Attorney Docket No.45288-0476WO1 execution of the types of query condition of the database query may be provided to improve query execution (e.g. it may form part of the execution logic 160a, 160b). [0044] In some implementations, generating the one or more database query outputs 180 responsive to the database query 150 based upon the generated query condition outputs 170 comprises applying a weight to each query condition output. [0045] In some implementations, the weight applied to each query condition output is based upon a size of the cluster associated with the respective sample data item. For example, the first cluster “0” 300a may be associated with 10 sample data items, whereas the second cluster “1” 300b may be associated with 100 sample data items (not depicted). In this example, the weight for sample data items associated with the first cluster “0” 300a may be an order of magnitude 10 times less than the weight for sample data items associated with the second cluster “1” 300b. It will be readily appreciated that the weights for different clusters may differ based upon any suitable approach. In some implementations, the weight applied to each query condition output is based upon a number of sample database data items associated with the same cluster as the sample data item associated with the corresponding query condition output. In some implementations, the weight applied to each query condition output is normalised, for example by way of a normalised weight described below with respect to an estimator for an expectation. [0046] In some implementations, the one or more database query outputs 180 comprises an aggregated value for the plurality of database data items 110, 112, …, 114. In some implementations, the aggregated value is indicative of a count operation, a sum operation, or an average operation of the plurality of database data items 110, 112, …, 114. The aggregated value may therefore provide a value indicating a quantity associated with the database query 150 rather than providing a set of the database data items 110, 112, …, 114 that satisfy the database query 150 as output. [0047] In some implementations, generating the one or more database query outputs 180 comprises scaling the respective query condition outputs 170. For example, the database query outputs 180 may be [0, 1, 0, 1, 1, 0] for six respective database data items 110, 112, …, 114 (e.g. comprising a query condition output 170 for each). Each of these query condition outputs 170 may be scaled. For example, scaling may include multiplying the query condition outputs 170 each by a set value, e.g.10, resulting in database query outputs 180 of [0, 10, 0, 10, 10, 0]. Alternatively, the query condition outputs 170 may be scaled independently, e.g.5 for the first three query condition outputs 170, and 10 for the final three query condition outputs 170. The scaling may, for example, be based upon the number of samples of the 12 69705877-1 Attorney Docket No.45288-0476WO1 plurality of sample data items 140 relative to the size of the database 100 or a number of samples of the plurality of sample data items 140 relative to the size of the respective cluster. In this example, the query condition outputs 170 generated for the sample data item 2112 may be scaled by 10 because (i) the first cluster 300a is associated with 10 different database data items (i.e. via the 10 distinct database data item embeddings depicted as being part of the cluster 300a of FIG.3) and (ii) only one of the database data items 110, 112, …, 114 associated with the first cluster 300a is selected for the plurality of sample data items 140 (i.e. the database data item 2112, as depicted in FIG.1). In this way, representative database query outputs 180 may be acquired. [0048] In some implementations, generating the one or more database query outputs 180 responsive to the database query 150 based upon the generated query condition outputs 170 comprises applying an estimator for an expectation having the form: ^_{^^∊^^…ே^^^^^T୧, ^^^^^^^^^^ ≃ |S|} ^^ ^_^ _{∑ ^^^T୧, ^^^^^^^^^} _{^∊ௌ ^^^} where, ^^_{^∊^^…ே^} is a database data item i (e.g. data item 1110), S is the plurality of sample data items 140, |S| is a size (e.g. number) of the plurality of sample data items 140, |^^_^^^^| is the size of the cluster (e.g. number of database data items associated with the cluster) and I^c^{^}j^{^} ൌ c^{^}i^{^}^ is an indicator function returning 1 if the condition c^j^ ൌ c^i^ is true and 0 if the condition is false. It will be appreciated that the estimator for an expectation above provides a normalised weighted value for the query condition output 170 associated with a sample. [0049] In some implementations, the method optionally further comprises obtaining an embedding 154 of the query condition 152 (referred to herein as “query condition embedding”) – i.e. at step 422 of FIG.4. The plurality of sample data items 140 may optionally be selected based upon the relationship 190 between the embedding 154 of the query condition 152 and the database data item embeddings 110, 112, …, 114 – i.e. at step 424 of FIG.4. For example, an embedding representing the query condition 152, e.g. an embedding representing “the review sentiment is negative”, may be compared with the database data item embedding 130, i.e. an embedding representing the unstructured data value 120a “This movie was out of this world!”. It will be readily appreciated than such a relationship 190 may be expressed in any number of ways, such as an inner product or cosine similarity for the respective embeddings (i.e.130, 132, …, 134 and 154). In other examples, a 13 69705877-1 Attorney Docket No.45288-0476WO1 linear classifier may express the relationship 190 between input variables, i.e. the embeddings, and such a classifier may be used to select the sample data items 140 by providing the embeddings as input to the classifier to obtain an output. A linear classifier is a type of machine learning model that makes predictions based on a linear combination of input features and classifies data by drawing a linear decision boundary (i.e. hyperplane) that separates different classes within a dataset. Additionally or alternatively, the relationship 190 may be defined by a logistic regression function. Logistic regression is a specific type of linear classifier that may be used for binary classification tasks. In contrast to linear classifiers that output a continuous score, logistic regression applies a logistic function to the linear score to produce a probability value between 0 and 1. This probability reflects the likelihood of an input belonging to a given class. In yet other examples, a combination of techniques may be used, e.g. providing an inner product of the embeddings as input to a linear classifier. [0050] In some implementations, the method further comprises updating the relationship 190 between the embedding 154 of the query condition 152 and the database data item embeddings 130, 132, …, 134 based on the generated query condition outputs 170. As described, the relationship 190 may be expressed in any number of ways (e.g. a linear classifier). Thus, as query condition outputs 170 are generated, such outputs 170 may be provided as feedback data to the model (i.e. for updating). For example, if the query condition outputs 170 are further “review_text” for a sample data item of the plurality of sample data items 140, such text data may be embedded and provided as feedback to the linear classifier. Alternatively, if using inner product or cosine similarity for example, an average value for those metrics may be updated. In such implementations, a further plurality of sample data items (i.e. further to the plurality of sample data items 140) may be selected based upon the updated relationship 190. Selection may be as described above in relation to the relationship 190 prior to updating. In such implementations, the one or more database query outputs 180 are generated based upon the selected further plurality of sample data items (not depicted). Generating may be as described above in relation to generating the database query outputs 180 based on the plurality of sample data items 140 prior to updating the relationship 190. It is envisaged that such updating and selection is an iterative process and may undergo any number of iterations (i.e. selection, execution, and/or updating), although some steps may be omitted in some iterations. For example, the updating may be performed in only some iterations. 14 69705877-1 Attorney Docket No.45288-0476WO1 [0051] In some implementations, selecting a further plurality of sample data items comprises iteratively selecting a further plurality of sample data items until a stopping condition is met. For example, the plurality of sample data items 140 may initially include 100 sample data items. After a first iteration, a further plurality of sample data items may be selected. As a result, the further plurality of sample data items may increase to include 200 sample data items. The process may iterate until a stopping condition is reached when the iterative process may stop. [0052] In some implementations, the relationship 190 approximates the query condition 152 for the sample data items. That is, the relationship 190 may provide a computationally efficient way of estimating whether the plurality of sample data items 140 will satisfy the query condition 152 of the database query 150 upon execution. This provides a number of advantages as described herein. [0053] In some implementations, the stopping condition is based upon a predetermined computational resource. For example, the computational resource may be a number of tokens provided as input to a machine learning model (e.g. the first machine learning model 162) used to execute the query condition 152 on the plurality of sample data items 140. In another example, the computational resource may be a number of tokens generated as output by the machine learning model. As described above, executing the query condition 152 on each of the sample data items 140 to generate respective query condition outputs 170 may be based upon, i.e. using a machine learning model such as the first machine learning model 162. For example, a machine learning model such as a large language model may be used to generate the query condition outputs 170. Generating the query condition outputs 170 from a machine learning model may incur a quantifiable computational cost (e.g. a number of input tokens provided as input to the machine learning model). Thus, the stopping condition may be based upon a predetermined computational resource, for example a predetermined number of input or output tokens. That is, once a threshold number of tokens (e.g.10,000) is exceeded for the machine learning model, the stopping condition may be reached and the iterative process may stop. Alternatively, the predetermined computational resource may be a threshold amount of processing cycles. In alternative embodiments the stopping condition may, for example, be based upon a predetermined number samples to be processed or a predetermined number of output results to be provided. The general principle of the stopping condition may be to provide a mechanism to limit a computational cost incurred by using the machine learning model. 15 69705877-1 Attorney Docket No.45288-0476WO1 [0054] In some implementations, the relationship 190 between the embedding 154 of the query condition 152 and the database data item embeddings 130, 132, …, 134 is defined by the second machine learning model 192. Updating the relationship 190 may comprise training the second machine learning model 192 based on the generated query condition outputs 170. As described above, the relationship 190 may be expressed by, for example, a linear classifier. Alternatively, other machine learning models are envisaged (e.g. neural networks). Generally, any computational model that may express the relationship 190 between the embedding 154 of the query condition 152 and the database data item embeddings 130, 132, …, 134 is envisaged (e.g. a Gaussian distribution initially having the respective embedding’s inner product as its mean value). [0055] In some implementations, selecting the plurality of sample data items 140 based upon the relationship 190 and executing the query condition 152 on the selected plurality of sample data items 140 uses fewer computational resources than executing the query condition 152 on database data items 110, 112, …, 114. For example, determining the relationship 190 (e.g. the second machine learning model 192) for a particular data item may consume fewer computational resources than the execution of the query condition 152 for the same data item (e.g. using the first machine learning model 162). In some implementations, the number of parameters of a model (e.g. the second machine learning model 192) associated with determining the relationship 190 may be smaller than a number of parameters of a model (e.g. the first machine learning model 162) associated executing the query condition 152. [0056] In some implementations, the second machine learning model 192 is a pre-trained machine learning model trained on the unstructured data values 120, 122, …, 124 of the database data items 110, 112, …, 114 of the database 100. That is, the second machine learning model 192 may have been trained prior to use in the method described above and below. The second machine learning model 192 may be trained or used using a different computing resource to the computing resource used to execute the database query 150 (for example a server and client respectively). [0057] In some implementations, the plurality of sample data items 140 are selected based upon an argmax operation defined based upon the relationship 190 between the embedding 154 of the query condition 152 and the database data item embeddings 130, 132, …, 134. Arguments of the maxima (argmax) is a mathematical function used to determine the argument(s) (or input value) to a function (e.g. the relationship 190) that yields the maximum value for that given function. For example, if the relationship 190 between the embedding 154 of the query condition 152 and the database data item embeddings 130, 132, …, 134 is 69705877-1 Attorney Docket No.45288-0476WO1 expressed as a cosine similarity value (e.g.0.7) for the two embeddings, the plurality of sample data items 140 may be selected for those sample data items that maximize the cosine similarity metric across a given sample. In general, the plurality of sample data items 140 may be selected based upon the following equation: ^_{^௧ ൌ ^^^^^^^^^^^^^^∊^^…ே^ \ௌ ^ ^^^^^^ ^ ^௧,^} ^∊ௌ_^ where ^^_௧ is a batch of is an approximation of _{^^^T୧, ^^^^^^^^^ as} _{used to allow a degree of} exploration with regard to the sample data items returned as ^^_௧. It follows from the above description that S (i.e. the plurality of sample data items 140) may be updated according to _{^^ ← ^^ ∪ ^^௧. That is, during each iteration as described above the set of samples may be} updated. [0058] In some implementations, the database data item embedding 130, 132, 134 is an embedding of each of the one or more unstructured data values of the respective data item 110, 112, 114. For example, a database data item may comprise an unstructured data value for each of “review_text”, “reason”, and “cover_image”. In this example, such unstructured data may be “Very entertaining!”, “Fun to watch…”, and “<JPG>” (i.e. representing image data). In this example, the database data item embedding may be an embedding, such as a vector, representing the semantic meaning of “Very entertaining!”, “Fun to watch…”, and the image representing the cover image. [0059] In some implementations, the method further comprises outputting the one or more database query outputs 180. For example, the one or more database query outputs 180 may be transmitted to a client device (not depicted) for output or may be displayed. For example, the data may be transmitted over the internet or other suitable network to a client device of a user of the database 100. In some implementations, the client device may be used to output the database query outputs 180 to the user, for example, using a display and/or graphical user interface. In some implementations, the database query outputs 180 may be provided as output to the user as part of a chat application, search request, or recommendation system. For example, the database query outputs 180 may be integrated into a message provided to the user from a chatbot. In other examples, the database query outputs 180 may be integrated into search results provided to the user. In other examples, the database query outputs 180 may be used by a recommendation system to recommend items, e.g. videos, to the user. It will be appreciated that the database query 150 may also be received by the database 100 17 69705877-1 Attorney Docket No.45288-0476WO1 from a user, for example over the internet. In other examples, the database query 150 may be received from an application or system (e.g. a server fulfilling a request on behalf of a user). [0060] In some implementations, the query condition 152 may be generated based upon input from a user of the database 100. For example, the database query 150 may be received from a user that specified they would like to retrieve records where “the review sentiment is negative”. In general, the user may specify the whole or part of the database query 150 comprising the query condition 152. It will be readily appreciated that the database query 150 may be received from a user of the database 100 by any suitable means such as a client computer having a graphical user interface, a web browser, or an app through which a user can interact with the database 100. The query condition 152 may be generated based upon the user inputting text via a keyboard. Other such means for generation of the query condition 152 are envisaged (e.g. for a query condition 152 comprising image data, the user may provide the image data as input via a file upload function of a computer). [0061] In some implementations the database data item embeddings 130, 132, …, 134 are generated in a first computing environment. Selecting the plurality of sample data items 140, executing the query condition 152 on each of the plurality of sample data items 140 and generating the one or more database query outputs 180 may be performed in a second computing environment. The second computing environment may be computing resource constrained relative to the first computing environment. Generating the database data item embeddings 130, 132, …, 134 may be computationally expensive and generating them in a different computing environment to the computing environment in which queries are executed allows computational cost to be effectively moved from the query to a pre- processing step such that query execution is able to be performed more quickly and/or in a computationally limited environment. For example, the first computing environment may be a server and the second computing environment may be a client device with limited computational resources relative to the server. [0062] FIG.5A shows an example tokenizer or lexer pattern for unstructured query language queries (i.e. the database query 150 comprising the query condition 152 associated with at least one unstructured data value). Part of the execution of the database query 150 on the database 100 may include receiving the database query 150 (e.g. received as a text string input for execution by a user or automated system) and processing the database query 150 to identify one or more tokens. The tokens can include reserved tokens (i.e. tokens that are reserved for specific execution operations such as conditional or grouping operations on data). In general, the tokens identified determine the execution of the database query 150, i.e. how the execution 18 69705877-1 Attorney Docket No.45288-0476WO1 of the database query 150 is performed on the database 100, such as when executed using the execution logic 160a, 160b or other execution logic not depicted in FIG.1. For example, in the case a “WHERE” token is identified, execution of the database query 150 on the database 100 may include only returning database data items (i.e. via the database query outputs 180) where those database data items include unstructured data values matching the “WHERE” condition. The tokens identified from the database query 150 may, as part of the execution of the database query 150, be provided as input to a parser which analyzes the grammatical structure of the database query 150 with respect to the identified tokens, thus informing the execution of the database query 150. [0063] FIG.5B shows example unstructured query language grammar for unstructured query language queries. As previously described, as part of the execution of the database query 150, identified tokens may be provided as input to a parser which analyzes the grammatical structure of the database query 150 with respect to the identified tokens. This informs the execution of the database query 150. To ensure that unstructured query language queries are executed correctly, a set of valid UQL grammar as used in experiments is depicted in FIG.5B. The parser (not depicted) may, as part of the execution of the database query 150, determine whether the tokens identified in the database query 150 conform to the UQL grammar depicted in FIG.5B. For example, according to the UQL grammar provided in FIG.5B, a valid “uql_query” is defined by either (i) a “select_clause” followed by a “from_clause”; or (ii) a “select_clause” followed by a “from_clause” followed by a “optional_clause_combo”. Each element of the UQL grammar (e.g. “optional_clause_combo”) may be defined in terms of itself or in terms of other elements of the UQL grammar. For example, “optional_clause_combo” is defined by either (i) an “optional clause_combo” (i.e. itself) followed by an “optional_clause”; or (ii) an “optional_clause”. [0064] FIG.6A shows a first example algorithm 600 for executing the database query 150 on the database 100. FIG.6B shows a second example algorithm 602 for executing the database query 150 on the database 100. The first example algorithm 600 is for generated an aggregated type of the database query outputs 180 (e.g. where the database query outputs 180 includes a count for the number of database data items 110, 112, …, 114 of the database 100 conforming to the query condition 152). The second example algorithm 602 is for generating a non- aggregated type of the database query outputs 180 (e.g. where the database query outputs 180 include the database data items 110, 112, …, 114 per se of the database 100 conforming to the query condition 152). In both the first and second example algorithms 600, 602, an embedding of at least one of the unstructured data values 120, 122, …, 124 for each of the database data 19 69705877-1 Attorney Docket No.45288-0476WO1 items 110, 112, …, 114 is generated. In the first example algorithm 600, the embeddings are clustered into K disjoint clusters or “groups” (e.g.300a – 300d) and sampling is performed to generate the samples, i.e. the plurality of sample data items 140. As described above in more detail, an estimator function is then applied to generate the aggregated type of the database query outputs 180. In the second example algorithm 602, ^^^^^^ is maintained as an _{approximation of ^^^T୧, ^^^^^^^^^ and is initialized as a uniform distribution parameterized by 0} and 1. In some examples, ^^^{^}^^^{^} is the relationship 190 between the database data item embeddings 130, 132, …, 134 and the query condition embedding 154. At each iteration (i.e. each step t), the plurality of sample data items 140 is updated by determining ^^_௧ in accordance with the formula and as described in more detail above. At the next step of the second example algorithm 602, ^^^^^^ may be updated and either (i) the plurality of sample data items 140 (i.e. as iteratively updated) is returned as the non-aggregated type of the database query outputs 180 or (ii) the plurality of sample data items 140 continue to be updated (e.g. to return more database data items responsive to the database query 150). [0065] FIG.7A shows a first set 700 of experimental results of executing the database query 150 on the database 100 in accordance with the techniques described herein. FIG.7B shows a second set 702 of experimental results of executing the database query 150 on the database 100 in accordance with the techniques described herein. The first set 700 of experimental results were results for the methods of executing the database query 150 applying conditional aggregation (i.e. generating the aggregated type of the database query outputs 180). The second set 702 of the experimental results were results for the methods of executing the database query 150 applying semantic retrieval (i.e. generating the non-aggregated type of the database query outputs 180). In both the first set 700 and the second set 702, results for the method of executing the database query 150 of the database 100 described herein (i.e. results 704 and 706 for the first and second set 700, 702 respectively) improve upon the performance of prior methods of executing a database query on a database, i.e. as indicated in the other results 708, 710 for the first and second sets 700, 702 respectively. The results 704, 706 indicate the relative error and average monetary cost of executing each database query 150 on different benchmark evaluations (i.e. IMDB, ABCD, AirDialog, and Clevr). In both the first and second sets 700, 702 of results, the techniques described herein generally achieved a reduced error rate whilst maintaining the same monetary cost per database query 150 when compared with other methods of executing a database query on a database. A reduced error rate minimizes wasted processing cycles and computational resources that may otherwise be spent returning incorrect 20 69705877-1 Attorney Docket No.45288-0476WO1 database query outputs 180. The reduced error rate, as experimentally proven, may thus lead to improvements in computational efficiency, faster query execution times, and improved use of computational resources. [0066] The machine learning models as described herein may be neural networks. For example, the machine learning models may comprise a neural network having one or more (self-)attention layers, such as a Transformer neural network. The neural networks may be any of a variety of Transformer-based neural network architectures for example. Examples of such architectures include those described in J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models, arXiv preprint arXiv:2203.15556, 2022; J.W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, H. F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A.Wu, E. Elsen, S. M. Jayakumar, E. Buchatskaya, D. Budden, E. Sutherland, K. Simonyan, M. Paganini, L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya, D. Donato, A. Lazaridou, A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas, A. Guy, C. Jones, J. Bradbury, M. Johnson, B. A. Hechtman, L. Weidinger, I. Gabriel, W. S. Isaac, E. Lockhart, S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, and G. Irving. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021; Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019; Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. Towards a human-like open-domain chatbot. CoRR, abs/2001.09977, 2020; and Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. [0067] Generally, however, the Transformer-based neural network includes a sequence of attention blocks, and, during the processing of a given input sequence, each attention block in the sequence receives a respective input hidden state for each input token in the given input 21 69705877-1 Attorney Docket No.45288-0476WO1 sequence. The attention block then updates each of the hidden states at least in part by applying self-attention to generate a respective output hidden state for each of the input tokens. The input hidden states for the first attention block are embeddings of the input tokens in the input sequence and the input hidden states for each subsequent attention block are the output hidden states generated by the preceding attention block. It will be readily appreciated that such neural networks having a Transformer-based architecture may be used to generate the embeddings 130, 132, …, 134, 154 as described herein, for example, by sampling the input hidden states for a given block. [0068] In some implementations, the machine learning models described herein are pre- trained, e.g., trained on a particular modeling task. For example, the machine learning models described herein, such as the first and second machine learning models 162, 192, may be language models, vision models, multi-modal models, or any other suitable type of machine learning model that has been trained prior to inference and is suitable for processing the database data items described herein. [0069] To illustrate, a system may pre-train a language model on a language modeling task, e.g., a task that requires predicting, given a current sequence of text tokens, the next token that follows the current sequence in the training data. As a particular example, the language model can be pre-trained on a maximum-likelihood objective on a large dataset of text, e.g., text that is publically available from the Internet or another text corpus. It will be readily appreciated that the machine learning models described herein may further be fine-tuned to a particular task. [0070] In some implementations, the machine learning models described herein are supervised or unsupervised machine learning models (e.g. the first machine learning model 162 may be a supervised machine learning model whereas the second machine learning model 192 may be an unsupervised machine learning model). [0071] A description of self-attention, as may be employed by some of the machine learning models described herein, now follows. [0072] A self-attention block, as referred to above, is a neural network layer that includes an attention mechanism that operates over the self-attention block input (or an input derived from the layer input) to generate the self-attention block output. A self-attention mechanism may be causally masked so that any given position in an input sequence does not attend over (e.g. use data from) any positions after the given position in the input sequence. There are many different possible attention mechanisms. Some examples of self-attention layers including attention mechanisms, are described in Vaswani et al. “Attention is all you need”, 22 69705877-1 Attorney Docket No.45288-0476WO1 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA; Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019; Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. Towards a human-like open-domain chatbot. CoRR, abs/2001.09977, 2020; and Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. [0073] Generally, an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys, and values are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function, e.g. a dot product or scaled dot product, of the query with the corresponding key. [0074] Generally, a self-attention mechanism is configured to relate different positions in the same sequence to determine a transformed version of the sequence as an output. For example the attention layer input may comprise a vector for each element of the input sequence. These vectors provide an input to the self-attention mechanism and are used by the self- attention mechanism to determine a new representation of the same sequence for the attention layer output, which similarly comprises a vector for each element of the input sequence. An output of the self-attention mechanism may be used as the attention layer output, or it may be processed by one or more of feed-forward layers, skip connections, or normalization operations to provide the attention layer output. [0075] In some implementations the attention mechanism is configured to apply each of a query transformation e.g. defined by a matrix ^^^ொ, a key transformation e.g. defined by a matrix ^^^{^}, and a value transformation e.g. defined by a matrix ^^^{^}, to the attention layer _{input which is the input data X to the attention layer, to derive a query matrix ^^ ൌ ^^^^ொ that} _{includes a respective query for each vector in the input sequence, key matrix ^^ ൌ ^^^^^ that} _{includes a respective key for each vector in the input sequence, and value matrix ^^ ൌ ^^^^^} that includes a respective value for each vector in the input sequence, which are used determine an attended sequence for the output. For example, the attention mechanism may be a dot product attention mechanism applied by applying each query vector to each key 23 69705877-1 Attorney Docket No.45288-0476WO1 vector to determine respective weights for each value vector, then combining the value vectors using the respective weights to determine the self-attention layer output for each element of the input sequence. The self-attention layer output may be scaled by a scaling factor e.g. by the square root of the dimensions of the queries and keys, to implement scaled dot product attention. Thus, for example, an output of the attention mechanism may be d_{etermined as ^^^^^^^^^^^^^^ ^} ^^^^^{^} √^ௗ ^V where d is a dimension of the key (and value) vector. In mechanism be comprise an “additive attention” function using a feed-forward network with a hidden layer. The output of the attention mechanism may be further processed by one or more fully-connected, feed forward neural network layers. [0076] The attention mechanism may implement multi-head attention, that is, it may apply multiple different attention mechanisms in parallel. The outputs of these may then be combined, e.g. concatenated, with a learned linear transformation applied to reduce to the original dimensionality if necessary. [0077] As described above, the described methods, systems and data structures can be used to enable efficient query of databases using unstructured data. As described herein, the unstructured data may take various forms, including images, video, audio and text. It will be appreciated that by enabling search using unstructured data, a wide range of analysis and data processing tasks are enabled across data sets. [0078] In this specification, the term "configured" is used in relation to computing systems and environments, as well as computer program components. A computing system or environment is considered "configured" to perform specific operations or actions when it possesses the necessary software, firmware, hardware, or a combination thereof, enabling it to carry out those operations or actions during operation. For instance, configuring a system might involve installing a software library with specific algorithms, updating firmware with new instructions for handling data, or adding a hardware component for enhanced processing capabilities. Similarly, one or more computer programs are "configured" to perform particular operations or actions when they contain instructions that, upon execution by a computing device or hardware, cause the device to perform those intended operations or actions. [0079] The embodiments and functional operations described in this specification can be implemented in various forms, including digital electronic circuitry, software, firmware, computer hardware (encompassing the disclosed structures and their structural equivalents), or any combination thereof. The subject matter can be realized as one or more computer programs, 24 69705877-1 Attorney Docket No.45288-0476WO1 essentially modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by or to control the operation of a computing device or hardware. The storage medium can be a storage device such as a hard drive or solid-state drive (SSD), a storage medium, a random or serial access memory device, or a combination of these. Additionally or alternatively, the program instructions can be encoded on a transmitted signal, such as a machine-generated electrical, optical, or electromagnetic signal, designed to carry information for transmission to a receiving device or system for execution by a computing device or hardware. Furthermore, implementations may leverage emerging technologies like quantum computing or neuromorphic computing for specific applications, and may be deployed in distributed or cloud-based environments where components reside on different machines or within a cloud infrastructure. [0080] The term "computing device or hardware" refers to the physical components involved in data processing and encompasses all types of devices and machines used for this purpose. Examples include processors or processing units, computers, multiple processors or computers working together, graphics processing units (GPUs), tensor processing units (TPUs), and specialized processing hardware such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). In addition to hardware, a computing device or hardware may also include code that creates an execution environment for computer programs. This code can take the form of processor firmware, a protocol stack, a database management system, an operating system, or a combination of these elements. Embodiments may particularly benefit from utilizing the parallel processing capabilities of GPUs, in a General-Purpose computing on Graphics Processing Units (GPGPU) context, where code specifically designed for GPU execution, often called kernels or shaders, is employed. Similarly, TPUs excel at running optimized tensor operations crucial for many machine learning algorithms. By leveraging these accelerators and their specialized programming models, the system can achieve significant speedups and efficiency gains for tasks involving artificial intelligence and machine learning, particularly in areas such as computer vision, natural language processing, and robotics. [0081] The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution 25 69705877-1 Attorney Docket No.45288-0476WO1 environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. [0082] A computer program, also referred to as software, an application, a module, a script, code, or simply a program, can be written in any programming language, including compiled or interpreted languages, and declarative or procedural languages. It can be deployed in various forms, such as a standalone program, a module, a component, a subroutine, or any other unit suitable for use within a computing environment. A program may or may not correspond to a single file in a file system and can be stored in various ways. This includes being embedded within a file containing other programs or data (e.g., scripts within a markup language document), residing in a dedicated file, or distributed across multiple coordinated files (e.g., files storing modules, subprograms, or code segments). A computer program can be executed on a single computer or across multiple computers, whether located at a single site or distributed across multiple sites and interconnected through a data communication network. The specific implementation of the computer programs may involve a combination of traditional programming languages and specialized languages or libraries designed for GPU programming or TPU utilization, depending on the chosen hardware platform and desired performance characteristics. [0083] In this specification, the term “database” is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations. Thus, for example, the database described herein can include multiple collections of data, each of which may be organized and accessed differently. [0084] In this specification, the term "engine" broadly refers to a software-based system, subsystem, or process designed to perform one or more specific functions. An engine is typically implemented as one or more software modules or components installed on one or more computers, which can be located at a single site or distributed across multiple locations. In some instances, one or more dedicated computers may be used for a particular engine, while in other cases, multiple engines may operate concurrently on the same one or more computers. Examples of engine functions within the context of AI and machine learning could include data pre-processing and cleaning, feature engineering and extraction, model training and optimization, inference and prediction generation, and post-processing of results. The specific design and implementation of engines will depend on the overall architecture and the 26 69705877-1 Attorney Docket No.45288-0476WO1 distribution of computational tasks across various hardware components, including CPUs, GPUs, TPUs, and other specialized processors. [0085] The processes and logic flows described in this specification can be executed by one or more programmable computers running one or more computer programs to perform functions by operating on input data and generating output. Additionally, graphics processing units (GPUs) and tensor processing units (TPUs) can be utilized to enable concurrent execution of aspects of these processes and logic flows, significantly accelerating performance. This approach offers significant advantages for computationally intensive tasks often found in AI and machine learning applications, such as matrix multiplications, convolutions, and other operations that exhibit a high degree of parallelism. By leveraging the parallel processing capabilities of GPUs and TPUs, significant speedups and efficiency gains compared to relying solely on CPUs can be achieved. Alternatively or in combination with programmable computers and specialized processors, these processes and logic flows can also be implemented using specialized processing hardware, such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs), for even greater performance or energy efficiency in specific use cases. [0086] Computers capable of executing a computer program can be based on general-purpose microprocessors, special-purpose microprocessors, or a combination of both. They can also utilize any other type of central processing unit (CPU). Additionally, graphics processing units (GPUs), tensor processing units (TPUs), and other machine learning accelerators can be employed to enhance performance, particularly for tasks involving artificial intelligence and machine learning. These accelerators often work in conjunction with CPUs, handling specialized computations while the CPU manages overall system operations and other tasks. Typically, a CPU receives instructions and data from read-only memory (ROM), random access memory (RAM), or both. The essential elements of a computer include a CPU for executing instructions and one or more memory devices for storing instructions and data. The specific configuration of processing units and memory will depend on factors like the complexity of the AI model, the volume of data being processed, and the desired performance and latency requirements. Embodiments can be implemented on a wide range of computing platforms, from small embedded devices with limited resources to large-scale data center systems with high-performance computing capabilities. The system may include storage devices like hard drives, SSDs, or flash memory for persistent data storage. [0087] Computer-readable media suitable for storing computer program instructions and data encompass all forms of non-volatile memory, media, and memory devices. Examples include 27 69705877-1 Attorney Docket No.45288-0476WO1 semiconductor memory devices such as read-only memory (ROM), solid-state drives (SSDs), and flash memory devices; hard disk drives (HDDs); optical media; and optical discs such as CDs, DVDs, and Blu-ray discs. The specific type of computer-readable media used will depend on factors such as the size of the data, access speed requirements, cost considerations, and the desired level of portability or permanence. [0088] To facilitate user interaction, embodiments of the subject matter described in this specification can be implemented on a computing device equipped with a display device, such as a liquid crystal display (LCD) or an organic light-emitting diode (OLED) display, for presenting information to the user. Input can be provided by the user through various means, including a keyboard), touchscreens, voice commands, gesture recognition, or other input modalities depending on the specific device and application. Additional input methods can include acoustic, speech, or tactile input, while feedback to the user can take the form of visual, auditory, or tactile feedback. Furthermore, computers can interact with users by exchanging documents with a user's device or application. This can involve sending web content or data in response to requests or sending and receiving text messages or other forms of messages through mobile devices or messaging platforms. The selection of input and output modalities will depend on the specific application and the desired form of user interaction. [0089] Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, e.g., inference, workloads. [0090] Machine learning models can be implemented and deployed using machine learning frameworks, such as TensorFlow or JAX. These frameworks offer comprehensive tools and libraries that facilitate the development, training, and deployment of machine learning models. [0091] Embodiments of the subject matter described in this specification can be implemented within a computing system comprising one or more components, depending on the specific application and requirements. These may include a back-end component, such as a back-end server or cloud-based infrastructure; an optional middleware component, such as a middleware server or application programming interface (API), to facilitate communication and data exchange; and a front-end component, such as a client device with a user interface, a web browser, or an app, through which a user can interact with the implemented subject matter. For instance, the described functionality could be implemented solely on a client device (e.g., for on-device machine learning) or deployed as a combination of front-end and back-end components for more complex applications. These components, when present, can be 28 69705877-1 Attorney Docket No.45288-0476WO1 interconnected using any form or medium of digital data communication, such as a communication network like a local area network (LAN) or a wide area network (WAN) including the Internet. The specific system architecture and choice of components will depend on factors such as the scale of the application, the need for real-time processing, data security requirements, and the desired user experience. [0092] The computing system can include clients and servers that may be geographically separated and interact through a communication network. The specific type of network, such as a local area network (LAN), a wide area network (WAN), or the Internet, will depend on the reach and scale of the application. The client-server relationship is established through computer programs running on the respective computers and designed to communicate with each other using appropriate protocols. These protocols may include HTTP, TCP/IP, or other specialized protocols depending on the nature of the data being exchanged and the security requirements of the system. In certain embodiments, a server transmits data or instructions to a user's device, such as a computer, smartphone, or tablet, acting as a client. The client device can then process the received information, display results to the user, and potentially send data or feedback back to the server for further processing or storage. This allows for dynamic interactions between the user and the system, enabling a wide range of applications and functionalities. [0093] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination. [0094] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and 29 69705877-1 Attorney Docket No.45288-0476WO1 components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. [0095] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. [0096] What is claimed is: 30 69705877-1

Claims

Attorney Docket No.45288-0476WO1 CLAIMS 1. A method of executing a database query on a database comprising a plurality of database data items, each database data item comprising one or more unstructured data values, the database query comprising a query condition associated with at least one unstructured data value of the one or more unstructured data values, the method comprising: selecting a plurality of sample data items, wherein each sample data item is selected based upon a database data item embedding of at least one unstructured data value of the respective database data item, wherein the plurality of sample data items is a proper subset of the plurality of database data items; executing the query condition on each of the plurality of sample data items to generate respective query condition outputs for each of the plurality of sample data items; and generating one or more database query outputs responsive to the database query based upon the generated query condition outputs. 2. The method according to claim 1, wherein each database data item is associated with a cluster of a plurality of clusters, the cluster associated with each database data item being determined based upon the database data item embedding of the at least one unstructured data value of the database data item, wherein the plurality of sample data items are selected based upon the clusters. 3. The method according to any preceding claim, wherein executing the query condition comprises providing at least one unstructured data value for each of the plurality of sample data items as input to a first machine learning model to generate the query condition outputs. 4. The method according to claim 3, wherein the first machine learning model is a multi- modal machine learning model configured to receive as input at least two types of data selected from a group consisting of: text data, image data, video data, and audio data. 5. The method according to any preceding claim, wherein the unstructured data values comprise text data, image data, video data, or audio data. 31 69705877-1 Attorney Docket No.45288-0476WO1 6. The method according to any preceding claim, wherein the database query comprises at least two query conditions, each query condition having an associated query condition type, wherein executing the query condition is based upon the types of query condition of the database query. 7. The method according to any preceding claim, wherein generating one or more database query outputs responsive to the database query based upon the generated query condition outputs comprises applying a weight to each query condition output. 8. The method according to claim 7 when dependent upon claim 2, wherein the weight applied to each query condition output is based upon a size of the cluster associated with the respective sample data item. 9. The method according to claim 7 or 8, wherein the weight applied to each query condition output is based upon a number of sample data items. 10. The method according to any one of claims 7 to 9, wherein the weight applied to each query condition output is normalised. 11. The method according to any preceding claim, wherein the one or more database query outputs comprises an aggregated value for the plurality of database data items. 12. The method according to claim 11, wherein the aggregated value is indicative of a count operation, a sum operation, or an average operation of the plurality of database data items. 13. The method according to any preceding claim, wherein generating one or more database query outputs comprises scaling the respective query condition outputs. 14. The method according to any preceding claim, wherein generating one or more database query outputs responsive to the database query based upon the generated query condition outputs comprises applying an estimator for an expectation having the form: ^_{^^^^^^^^^ ≃ |S|} ^^ ^_^ _{^^ ^^^ ^^^^^^^^^} 69705877-1 Attorney Docket No.45288-0476WO1 where, ^^_{^∊^^…ே^} is the estimator, ^^^^^^^^ is the query condition, T_i is a database data item i, S is a plurality of sample data items, |S| is a size of the plurality of sample data items, |^^_^^^^| is the the cluster, and I^c^j^ ൌ c^i^^ is an indicator function returning 1 if the condition c^j^ ൌ c^i^ is true and 0 if the condition is false. 15. The method according to any preceding claim, further comprising: obtaining an embedding of the query condition; wherein the plurality of sample data items are selected based upon a relationship between the embedding of the query condition and the database data item embeddings. 16. The method according to claim 15, further comprising: updating the relationship between the embedding of the query condition and the database data item embeddings based on the generated query condition outputs; selecting a further plurality of sample data items based upon the updated relationship; and generating the one or more database query outputs based upon the selected further plurality of sample data items. 17. The method according to claim 16, wherein selecting a further plurality of sample data items comprises iteratively selecting a further plurality of sample data items until a stopping condition is met. 18. The method according to any one of claims 15 to 17, wherein the relationship approximates the execution of the query condition for the sample data items. 19. The method according to claims 17 or 18, wherein the stopping condition is based upon a predetermined computational resource. 20. The method according to any one of claims 16 to 19, wherein the relationship between the embedding of the query condition and the database data item embeddings is defined by a second machine learning model, and wherein updating the relationship comprises training the second machine learning model based on the generated query condition outputs. 33 69705877-1 Attorney Docket No.45288-0476WO1 21. The method according to claims 15 to 20, wherein selecting the plurality of sample data items based upon the relationship uses fewer computational resources than executing the query condition on the plurality of sample data items. 22. The method according to claim 20 or 21, wherein the second machine learning model is a pre-trained machine learning model trained on the unstructured data values of the database data items of the database. 23. The method according to any one of claims 15 to 22, wherein the plurality of sample data items are selected based upon an argmax operation defined based upon the relationship between the embedding of the query condition and the database data item embeddings. 24. The method according to any preceding claim, wherein the database data item embedding is an embedding of each of the one or more unstructured data values of the respective database data item. 25. The method according to any preceding claim, further comprising outputting the one or more database query outputs. 26. The method according to any preceding claim, wherein the database data item embeddings are generated in a first computing environment, wherein selecting a plurality of sample data items, executing the query condition on each of the plurality of sample data items and generating one or more database query outputs is performed in a second computing environment, and wherein the second computing environment is computing resource constrained relative to the first computing environment. 27. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-26. 34 69705877-1 Attorney Docket No.45288-0476WO1 28. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method of any one of claims 1-26. 29. A computer-readable medium storing a plurality of database data items, each database data item comprising one or more unstructured data values and at least one embedding of at least one unstructured data value of the respective database data item, wherein the embeddings are adapted for selection of a plurality of sample data items from the plurality of database data items for execution of a database query comprising a query condition associated with at least one unstructured data value of the one or more unstructured data values. 35 69705877-1