WO2025235713A1

WO2025235713A1 - Systems and methods for federated data harmonization

Info

Publication number: WO2025235713A1
Application number: PCT/US2025/028319
Authority: WO
Inventors: Yuval Baror; Ittai DAYAN; Antoni Abella VENDRELL; Richard Han; Adrish SANNYASI; Daniel Feller
Original assignee: Rhino Federated Computing Inc
Current assignee: Rhino Federated Computing Inc
Priority date: 2024-05-08
Filing date: 2025-05-08
Publication date: 2025-11-13
Anticipated expiration: 2026-11-08

Abstract

Systems and methods for federated data harmonization can include a client agent residing on an edge server at a first site and being configured to access a first dataset; and a server accessible by the client agent and comprising instructions which, when executed by one or more processors, cause the server to perform a process. The process can be operable to receive an input format selection and an output format selection from a user device; receive a syntactic mapping definition from the user device; receive, from the user device, a selection of input datasets accessible by the edge server; and transmit the syntactic mapping definition and the selected input datasets to the client agent. The client agent, via the edge server, can execute data transformation code to apply the syntactic mapping definition and one or more semantic transformations to the selected input datasets to generate at least one output dataset.

Description

TITLE

SYSTEMS AND METHODS FOR FEDERATED DATA HARMONIZATION

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Application No. 63/644,218, filed May 8, 2024, which is herein incorporated by reference in its entirety.

BACKGROUND OF THE DISCLOSURE

[0002] In the field of data management, particularly in healthcare, data harmonization is a process that involves the integration of diverse datasets. This process is often complex due to the heterogeneity of data sources, which can include electronic health records (EHRs) and data warehouses, among others. These data sources often store data in different formats and structures, making it challenging to integrate and analyze the data collectively.

[0003] Data extraction, transformation, and loading (ETL) is often a part of data harmonization. ETL processes involve extracting data from source systems, transforming the data into a suitable format for analysis, and loading the transformed data into a target data system. ETL processes are often used in conjunction with semantic and syntactic mapping to transform and load data into a target common data model. Semantic mapping is a technique used in data harmonization to map values from a source dataset to a target vocabulary and involves identifying equivalent or similar concepts between the source and target vocabularies. Syntactic mapping, also known as schema mapping, involves mapping table and column names from a source to a target data model. In some embodiments, syntactic mapping can include a set of rules to map inputs field(s) to output field(s) along with an optional list of transformations to apply during this mapping from inputs to outputs.

[0004] However, new interoperability standards, such as Fast Healthcare Interoperability Resources (FHIR) and Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) for hospital and medical data have seen various challenges in being implemented effectively. While many hospitals are implementing these, implementations differ from institution to institution because each hospital tends to implement them to meet their own unique needs and information. Moreover, most ways to achieve both syntactic and semantic interoperability are currently highly labor-intensive, costly, and non-scalable, which is undesirable.

SUMMARY OF THE DISCLOSURE

[0005] According to one aspect of the present disclosure, a system for federated data harmonization can include a client agent residing on an edge server at a first site and being configured to access a first dataset; and a server accessible by the client agent and comprising instructions which, when executed by one or more processors, cause the server to perform a process. The process can be operable to receive an input format selection and an output format selection from a user device; receive a syntactic mapping definition from the user device; receive, from the user device, a selection of input datasets accessible by the edge server; and transmit the syntactic mapping definition and the selected input datasets to the client agent. The client agent, via the edge server, can execute data transformation code to apply the syntactic mapping definition and one or more semantic transformations to the selected input datasets to generate an output dataset.

[0006] In some embodiments, the server can be further operable to package the data transformation code object into a software container image. In some embodiments, receiving the input format selection can include receiving a custom textual format or a selection of a pre-defined format. In some embodiments, receiving the output format selection can include receiving a custom textual format or a selection of a pre-defined format. In some embodiments, receiving the syntactic mapping definition from the user device can include analyzing the input format selection and output format selection with a machine learning model; generating one or more recommended syntactic mapping definitions; causing the one or more recommended syntactic mapping definitions to be displayed on the user device; and receiving a selection of one of the one or more recommended syntactic mapping definitions from the user device.

[0007] In some embodiments, the client agent can be further operable to receive at least one semantic mapping definition comprising input data and a target vocabulary; communicate the at least one semantic mapping definition to the client agent; and trigger an artificial intelligence (Al) model to be executed by the client agent to generate recommendations for the semantic mapping of the input data to target values in the target vocabulary; and generate a confidence value for each recommendation. In some embodiments, the server can be further operable to provide secure access to a second user device to view the generated semantic mapping recommendations and associated confidence values. In some embodiments, providing secure access to the second user device to view the generated semantic mapping recommendations can include establishing one or more encrypted channels between the second user device and the client agent. In some embodiments, providing secure access to the second user device to view the generated semantic mapping recommendations can enable the second user device to perform actions on the semantic mapping values. In some embodiments, executing the data transformation code can include applying the recommended to the specified input data to generate the output data.

[0008] In some embodiments, the client agent can be further operable to generate a predefined number of recommended semantic mapping definitions; cause the predefined number of recommended semantic mapping definitions to be displayed on the user device; and receive a selection of at least one of the predefined number of recommended semantic mapping definitions from the user device. In some embodiments, the server can be further operable to access a plurality of edge servers, the plurality of edge servers comprising the edge server; compile semantic mapping data from the plurality of edge servers; access the Al model stored in the server; and trigger a federated fine-tuning process of the Al model using semantic mapping data from the plurality of edge servers. In some embodiments, the federated fine-tuning process can include generating a different model for each edge server or semantic mapping on an edge server.

[0009] In some embodiments, the client agent can be further operable to access a structured data store; and communicate output datasets to the structured data store to be stored in the data store. In some embodiments, the server can be further operable to receive a definition of a custom vocabulary from the user device, including a set of values; and instruct the client agent to trigger the Al model to generate the semantic mapping recommendations limiting to target values within the custom vocabulary. In some embodiments, the server can be further operable to execute a pre-processing service in the cloud that comprises one or more of a vector lookup database or a text-based filtering service for a specific vocabulary; and allow the client agent to access the pre-processing service as part of the generation of semantic mapping recommendations, wherein the pre-processing service augments the input to the Al model generating the mapping recommendations.

[0010] In some embodiments, the server can be further operable to receive a semantic mapping privacy configuration as part of the semantic mapping definition from the user device; and allow sharing parts of semantic mappings, as allowed by the privacy configuration, across client agents or across different data transformation efforts. In some embodiments, the server can be further operable to receive an existing semantic mapping and additional input data; update the semantic mapping to provide recommended mappings for values in the additional input data that were not present in the existing semantic mapping; and update the semantic mapping to adjust recommended mappings for values that were already present in the existing semantic mapping. In some embodiments, the server can be further operable to analyze the semantic mappings for different datasets; and provide analytics and quality metrics regarding the semantic mappings. In some embodiments, the server can be further operable to utilize one or more of statistical analysis, structured rules, or Al models to provide recommendations for improvements to the semantic mappings.

BRIEF DESCRIPTION OF THE FIGURES

[0011] FIG. 1 is a block diagram of an example system for federated data harmonization according to some embodiments of the present disclosure.

[0012] FIG. 2 is an example process for federated data harmonization according to some embodiments of the present disclosure.

[0013] FIG. 3 is an example process for federated fine-tuning according to some embodiments of the present disclosure.

[0014] FIG. 4 is an example server device that can be used within the system of FIG. 1 according to an embodiment of the present disclosure.

[0015] FIG. 5 is an example computing device that can be used within the system of FIG. 1 according to an embodiment of the present disclosure.

DESCRIPTION

[0016] The following detailed description is merely exemplary in nature and is not intended to limit the invention or the applications of its use.

[0017] Embodiments of the present disclosure relate to systems and methods for federated data harmonization, for example in the context of healthcare data. In particular, the disclosed systems and methods can combine the functionalities of a federated learning and computing architecture and transformer-based models (e.g., large language models (LLMs)). The disclosed system can generate data mappings at scale via distributed LLMs. Moreover, the system can leverage mapping data from within the federated architecture to fine-tune the LLM. Moreover, the federated nature of the system can enable cross-functional collaboration in a secure manner.

[0018] In some embodiments, the disclosed systems and methods can provide several benefits. For instance, disclosed systems can enable efficient and accurate conversion of healthcare data from one format to another, facilitating interoperability between different healthcare systems. The disclosed systems can also allow for the preservation of context about data mapping decisions, thereby making the process reusable for future data transformations. Furthermore, the system can support the transformation of data to various target data models, such as OMOP or FHIR, thereby enhancing its versatility and applicability in different healthcare settings.

[0019] In some embodiments, syntactic mapping can include transforming the schema or format of data. For example, syntactic mapping can include taking Column 1 from Table A and mapping it to Column 37 in output Table B. In another example, syntactic mapping can include applying some logic while mapping is being performed, such as extracting the year from a date, truncating the first three characters of a field, concatenating a field, etc. In some embodiments, in addition to single input field to single output field mappings, the disclosed syntactic mappings can include mapping from multiple fields to a single field and/or from multiple fields to multiple fields.

[0020] In some embodiments, semantic mapping can include transforming a single field from one encoding format to another encoding format. For example, semantic mapping can transform an input that is free text to a standard vocabulary, a structured format, etc.

[0021] FIG. 1 is a block diagram of an example system 100 for interactive distributed computing, according to some embodiments of the present disclosure. In some embodiments, the system 100 can operate within or in conjunction with (i.e., some components may be overlapping) with the system described in U.S. Application Nos. 18/180,710, titled “Systems and Methods for Using Federated Learning in Healthcare Model Development;” 18/180,713, titled “Systems and Methods for Using Distributed Computing in Healthcare Model Development;” and 18/633,013, titled “Systems and Methods for Interactive Distributed Computing,” all of which are herein incorporated by reference in their entireties. The system 100 includes a server 102 and an edge server 120, which are communicably coupled via an encrypted network tunnel 118. Although there is only one edge server 120 shown in FIG. 1, any number of edge servers 120 is possible. In some embodiments, the edge server 120 can be installed on-premises at each of one or more sites or other similar sites. In such embodiments, the edge server 120 is communicably coupled to data systems at the sites 132, which can be a site’s internal computing system and/or network. In some embodiments, the edge server 120 can operate in the cloud, such as in a virtual private cloud being used by the associated institution. By virtue of its connection to the site’s data system 132, the edge server 120 can have access to the institution’s (e.g., research institution, hospital, etc.) personal data (with or without PIT).

[0022] In some embodiments, the encrypted network tunnel 118 can include an ad-hoc Transport Layer Security (TLS) tunnel that encrypts all data sent over the connection. In some embodiments, the encrypted network tunnel 118 can be terminated upon termination of the software container code (i.e., either by a user or via keep-alive functionality). In some embodiments, client agents 128 are unable to communicate with each other via the encrypted network tunnel 118.

[0023] In some embodiments, the server 102 can be a cloud server and can include multiple services, each handling a specific subset of functionality. In some embodiments, the services can be included in a single monolith and may share a single database. In other embodiments, the services may rely on separate databases depending on their specific requirements and interdependences. In addition, the server 102 can be hosted on AWS, although this is merely exemplary in nature.

[0024] The server 102 includes a cloud database 104, an authentication and authorization service 106, a compute orchestration service 108, a proxy 112, a container registry 116, a secure access module 138, a review module 140, and an artificial intelligence (Al) model 136. The Al model 136 can be stored within the container registry 116 and the review module 140 can be a subsystem within the secure access module 138. In some embodiments, the cloud database 104 can be a Postgres database. The cloud database 104 is configured to store structured data that doesn’t include any personal data e.g., personal identifiable information (PIT) or protected health information (PHI). In some embodiments, the cloud database 104 can include an AWS Aurora instance. The authentication and authorization service 106 is configured to perform various authentication and authorization procedures in order to allow users to access external data for computations, such as the data contained at a site data system 132. For example, the authentication and authorization service 106 can perform a process that authenticates the user, such as via a login process, AWS account validation, or another Single Sign On (SSO) account validation process and validates authorization of the user to execute the interactive distributed code on the selected dataset, e.g. based on role based access control (RBAC). The compute orchestration service 108 is configured to handle orchestration of distributed computing using container orchestration services such as kubemetes. The server 102 also includes a web-based user interface (not shown) that functions as a gateway through which users interact with the system 100. In some embodiments, the web-based user interface can include an AWS EC2 server running nginx, and user interaction can be performed in Javascript with a web framework like React, Vue, Angular, or other Javascript frameworks. The server 102 also includes a REST API (not shown) that allows users to interact programmatically with the system 100. In some embodiments, a Software Development Kit (SDK) can be provided (e.g., in Python) to make programmatic interaction with server 102 easier.

[0025] In some embodiments, the cloud database 104 can have various entities created within it as part of the interactive computation procedures described herein, such as Activity, Code/Model, CodeRuns, and Dataset(s). A CodeRun can represent an interactive computing run. In addition, the server 102 can perform checks of the cloud database 104 to determine if a model is already running for a given site or group of users. In some embodiments, the server 102 can be configured to maintain the “liveliness” of a current interactive computing session and to provide details about the code (e.g., whether it is running). A record can be kept in the cloud database 104 to indicate that the session is active. In some embodiments, a kubemetes cron job can be used to check the status of all running code sessions and terminate the expired ones (e.g., after fifteen minutes of being idle).

[0026] In some embodiments, the server 102 can utilize the container registry 116 and a push command to provide a mechanism with which to upload software containers (e.g., a data transformation code object) to the cloud environment in a way that minimizes the data that is uploaded. This can be achieved by analyzing the different layers within the software container and only uploading layers that have any difference from the version in the cloud. In some embodiments, container input data can be deleted when the container finishes running. In some embodiments, container output data can be deleted after the container finishes running and any output dataset has been imported into the system. In some embodiments, container images can be purged after a time period, such as thirty days. In some embodiments, containers may not have access to any other files on the host operating system. In some embodiments, containers may not have access to communicate with other containers (e.g., databases). In some embodiments, containers may not be allowed to communicate with any external service over the Internet. In some embodiments logs collected from the container can be cleaned before sending back to the cloud, such as having sensitive data redacted, log lines truncated, and/or limiting the number of log lines being sent back to the cloud. In some embodiments logs collected from the container can be stored locally and not sent to the cloud at all. In some embodiments, there can be limitations on resources (e.g., CPU, GPU, memory, disk space, etc.) to avoid abuse of resources.

[0027] The containers hosted in the container registry 116 and container orchestration services can be used for distributed computing. They can be used to transmit a container to a client agent (e.g., client agent 128), where it can be run on that site’s data. In some embodiments, this can be used to facilitate execution of Al models, e.g. transformer-based models and LLMs, federated fine-tuning of those models, and execution of data transformation code objects. In some embodiments, the container registry 116 may contain pre-built containers for common tasks like converting between common data types (e.g., DICOM to png), common tools (e.g. Jupyter Notebook), the latest LLMs for semantic or syntactic mapping, or general-purpose tasks. In some embodiments, the server 102 can be communicably coupled to a user device 114 via a proxy 112 such that this proxy can perform authentication and authorization using the authentication and authorization service 106. In some embodiments, the proxy can be an NGNIX ingress controller that exposes an encrypted endpoint (e.g., user device 114) to the server 102 via a network load balancer (not shown in FIG. 1). In some embodiments, the network load balancer can be exposed publicly so that any user can connect to it. In some embodiments, a proxy pass record can be added to the proxy configuration for each client agent 128.

[0028] In some embodiments, the secure access module 138 and the review module 140 can operate in conjunction to provide secure access to a third-party to access and review semantic mapping definitions generated at the edge server 120. As described herein, secure access can be a way for various users (e.g., a clinician at another hospital or research institution) to view certain semantic mapping definitions, data inputs, and code that resides outside the users’ network in a cloud-based UI in a secure manner. For example, a user generating ETL code via the disclosed system may not have explicit experience in a certain area of semantic mapping. The user therefore can utilize secure access to enable a clinician knowledgeable in the relevant area to review the generated mappings, data, and code objects and provide insights to the user. In some embodiments, the server 106 will connect to the necessary client agent 128 as a pseudo database so as not to save any dataset data out of site. In some embodiments, providing secure access can include initiating an encrypted channel (e.g., https) between a user device of the clinician and the relevant client agent 128. The encrypted channels can act as a proxy/passthrough that allows only verified requests to move between user device and client agent 128. In addition, in some embodiments, the review module 140 can be configured to receive feedback from the clinician via their own user device.

[0029] Moreover, the Al model 136 can be configured to analyze certain input and output format selections from the user device 114 with a machine learning model and generate one or more recommended syntactic mapping definitions, which can then be displayed on the user device 114.

[0030] Server device 102 may include any combination of one or more of web servers, mainframe computers, general-purpose computers, personal computers, or other types of computing devices. Server device 102 may represent distributed servers that are remotely located and communicate over a communications network, or over a dedicated network such as a local area network (LAN). Server device 102 may also include one or more back-end servers for carrying out one or more aspects of the present disclosure. In some embodiments, server device 102 may be the same as or similar to server device 400 described below in the context of FIG. 4. In some embodiments, server 102 can include a primary server and multiple nested secondary servers for additional deployments of server 102. This can enable greater scalability and deployability, as well as the ability to deploy asset-based severity scoring systems at a specific premises if requested by a user. In some embodiments, server device 102 may run a container orchestration service (e.g., Kubemetes) to manage the different services being run on it.

[0031] In some embodiments, the server 102 is communicably coupled to the edge server 120 via an encrypted network tunnel (e.g., https) that allow only verified requests to move between the server 102 and the server 120. In some embodiments, the server 102 can communicate with the client agent 128 via an asynchronous message queue mechanism or via a synchronous communication mechanism.

[0032] In some embodiments, the edge server 120 includes an edge file system 122, a software container 124, a client agent 128, and an edge database 130. In addition, the client agent 128 is connected to the site’s data system 132. In some embodiments, a network policy on the software container 124 blocks all outgoing internet access. Moreover, the edge server 120 can include an LLM module 134. It is important to note that the LLM module 134 is designated with a dash line, meaning the module can be executed locally at the edge server 120 but may not necessarily be stored at the edge server 120.

[0033] In some embodiments, the LLM module 134 can include various transformer-based and classification-based models including, but not limited to, LLMs. In embodiments in which the LLM module 134 comprises an LLM, the LLM module 134 can include an LLM such as e.g., LLaMa-2, -3, Gemma, Mistral, Mixtral, Bart, and others. In some embodiments, an LLM can include various transformer-based models trained on vast corpuses of data that utilize an underlying neural network. The LLM module 134 can receive an input, such as a user query and documentation and material that has been identified as being relevant to the query. The LLM module 134 is configured to analyze an input format selection, an output format selection, and a first dataset (e.g., dataset accessible by the client agent 128) and generate a semantic mapping definition based on the analyzing. In some embodiments, the LLM module 134 can generate a predefined number of recommended semantic mapping definitions, which can be displayed on the user device 114.

[0034] The client agent 128 can also access the server 102 in a cloud environment for orchestration, which can include the cloud environment requesting that the agent 128 perform specific actions (e.g., initiate a code run, make specific datasets available to the container, analyze the outputs from the container, etc.). The client agent 128 can then perform the requested action and provide a response to the cloud. In some embodiments, the client agent 128 can include software installed in one or more of the following ways: (1) a cloud-based site-provisioned server in a virtual private cloud (VPC); (2) an on-site site-provisioned virtual machine (VM); or (3) an on-site server. In some embodiments, the minimum technical specifications of the client agent 128 can be pre-defined by the entity managing the cloud environment. In some embodiments, the client agent 128 can include a set of software containers with different components to be run and a management/orchestration layer (e.g., Kubemetes) for the containers.

[0035] The client agent 128 can be further communicably coupled to an edge database 130 and edge filesystem that may store copies of the data for the interactive distributed code to have access to. In some embodiments, the client agent 128 can import datasets from the site’s data systems to which it was provided access.

[0036] The system also can include a user device 114 that allows a user (e.g., project leader or researcher) to interface with the server 102 and edge server 120. A user device 114 can include one or more computing devices capable of receiving user input and or communicating with the server 102. In some embodiments, a user device 114 can be representative of a computer system, such as a desktop or laptop computer. Alternatively, a user device 114 can be a device having computer functionality, such as a personal digital assistant (PDA), a mobile telephone, a smartphone, or other suitable device. In some embodiments, a user device 114 can be the same as or similar to the device 500 described below with respect to FIG. 5. In some embodiments, the system 100 can include any number of user devices 114.

[0037] FIG. 2 is an example process 200 for federated data harmonization according to some embodiments of the present disclosure. It is important to note that various blocks in process 200 can be optional and that they need not all be performed in every iteration of process 200. At blocks 202 and 204, the server 102, via the proxy 112, can receive an input and output format selection from the user device 114. In some embodiments, the input and output format selections can include custom selections, such as free text or other manually defined input formats and output formats in which the input formats will be transformed to. For example, a set of predefined vocabularies can be used, such as OMOP, ICD9, ICD10, SNOMED, etc.

[0038] In some embodiments, process 200 can then proceed to either block 206 or 208 or, alternatively, both blocks 206 and 208 can be performed. At block 206, the server 102 can receive a syntactic mapping definition from the user device 114. At block 208, the Al model 136 can analyze the input format selection and output format selection with a machine learning model and generate one or more recommended syntactic mapping definitions. In some embodiments, the one or more recommended syntactic mapping definitions can then be displayed on the user device 114 and the user, via the user device 114, select one or more of such recommended syntactic mapping definitions. In some embodiments, a syntactic mapping can include a textual description, a source data format (e.g., custom, OMOP), a target data format (e.g., OMOP, FHIR R4, FHIR CorelL)

[0039] In addition, process 200 can, after one or more of blocks 206 and 208 have been performed, proceed to one or more of blocks 210 and 212. At block 210, the server 102 can receive a semantic mapping definition from the user device 114. At block 212, the server 102 can analyze the syntactic mapping definition (either the received or generated definition) to generate a semantic mapping definition. At block 214, the LLM module 134 can generate one or more proposed semantic mapping definitions. For example, the LLM module 134 can execute a transformer-based model, such as an LLM, to analyze the input format selection, the output format selection, and the selected dataset, such as data contained within the site data system 132. The LLM module 134 can then generate a semantic mapping definition based on this analysis. In some embodiments, the LLM module 134 can generate a proposed list of semantic mapping definitions, such as a predefined number of options (e.g., five). In some embodiments, a semantic mapping definition can include a source value, the number of times the source value has appeared in the field, a predefined number of recommended target values (e.g., target concept names), and the confidence level of the recommendation for each of the recommended target values. In some embodiments, the generated semantic mapping definitions are stored on the client agent 128. In some embodiments, in addition to prefiltering using vector databases and text filtering, various artificial intelligence (e.g., embedding modules) can be used to perform such pre-filtering steps. For example, the results of which can be fed into the classification Al model that can provide the ranked recommendations and the confidence in each recommendation.

[0040] At block 216, the secure access module 138 can provide secure viewing access to the semantic mapping definition generated via the LLM module 134 for a third party. In some embodiments, providing secure access can include connecting to the necessary client agent 128 as a pseudo database so as not to save any data out of site. In some embodiments, providing secure access can include initiating an encrypted channel (e.g., https) between a user device of the third party that will be reviewing the generated semantic mapping definition and the relevant client agent 128 (i.e., where the data is actually stored). The encrypted channels can act as a proxy/passthrough that allows only verified requests to move between that user device and the client agent 128. In addition, in some embodiments, the review module 140 can be configured to receive feedback from the clinician via their own user device. In some embodiments, the secure access module 138 can allow the reviewer to filter values based on various fields. For example, numeric fields can have range filters, and other fields can have insensitive string searches. In addition, the review module 140 can allow the reviewer to sort values and input text directly to modify or add semantic mapping definitions. The reviewer can also have the option to apply semantic mappings to the relevant dataset and therefore review the transformation.

[0041] At block 218, the server 102 can receive a request to transform data from the user device 114. In some embodiments, this can include a request to trigger the data transformation on input datasets that the relevant client agent 128 has access to. At block 220, the client agent 128 performs the requested data transformation. In some embodiments, the server 102 can compile a code object based on the semantic mapping definitions and the syntactic mapping definition. In some embodiments, the server can package the code object into a software container image, which can be transmitted to respective edge servers 120 for execution. The client agent 128 at the edge server 120 can receive the software container image and execute the code object to perform the data transformation. In some embodiments, the output datasets (i.e., the datasets that have undergone data transformation) can be stored at the edge server 120 or communicated to an external clinical system within the same site.

[0042] In some embodiments, the secure access procedures at block 212 can be performed in an alternate or additional manner. For example, in some embodiments, the secure access module 138 can also provide secure access to a third party to view and provide feedback on the generated code object.

[0043] FIG. 3 is an example process 300 for federated fine-tuning according to some embodiments of the present disclosure. At block 302, the server 102 can access a plurality of edge servers 120, each being associated with a separate site, having access to separate datasets (i.e., respective site data systems 132), and including a respective client agent 128. Moreover, each of the edge servers 120 can store semantic mapping definitions and other semantic mapping data that has been previously generated by their respective LLM modules 134. At block 304, the server 102 can compile the semantic mapping data from each of the plurality of edge servers. In some embodiments, the server 102 can compile the semantic mapping data via one or more containers. At block 306, the server 102 can trigger federated fine-tuning of the Al Model 136 utilizing the semantic mappings stored on the edge servers 120 via the compute orchestration service 108. In some embodiments, within the context of system 100, the server 102 can operate as a federated server and the various client agents 128 can operate as federated learning clients. In some embodiments, the Al model that was finetuned is stored within Container Registry 116. In some embodiments, there is a local finetuning phase at each edge server 120 utilizing semantic mapping data and datasets stored on the Edge File System 122 and the Edge DB 130. In some embodiments, process 300 can include an additional step of model localization, where the server 102 sends a fine-tuned Al model to one or more edge servers 120 for localization at each server. Moreover, the resulting localized models could subsequently be sent back to the cloud to be stored there.

[0044] FIG. 4 is a diagram of an example server device 400 that can be used within system 100 of FIG. 1. Server device 400 can implement various features and processes as described herein. Server device 400 can be implemented on any electronic device that runs software applications derived from complied instructions, including without limitation personal computers, servers, smart phones, media players, electronic tablets, game consoles, email devices, etc. In some implementations, server device 400 can include one or more processors 402, volatile memory 404, non-volatile memory 406, and one or more peripherals 408. These components can be interconnected by one or more computer buses 410.

[0045] Processor(s) 402 can use any known processor technology, including but not limited to graphics processors and multi-core processors. Suitable processors for the execution of a program of instructions can include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors or cores, of any kind of computer. Bus 410 can be any known internal or external bus technology, including but not limited to ISA, EISA, PCI, PCI Express, USB, Serial ATA, or FireWire. Volatile memory 404 can include, for example, SDRAM. Processor 402 can receive instructions and data from a read-only memory or a random access memory or both. Essential elements of a computer can include a processor for executing instructions and one or more memories for storing instructions and data.

[0046] Non-volatile memory 406 can include by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD- ROM disks. Non-volatile memory 406 can store various computer instructions including operating system instructions 412, communication instructions 414, application instructions 416, and application data 417. Operating system instructions 412 can include instructions for implementing an operating system (e.g., Mac OS®, Windows®, or Linux). The operating system can be multi-user, multiprocessing, multitasking, multithreading, real-time, and the like. Communication instructions 414 can include network communications instructions, for example, software for implementing communication protocols, such as TCP/IP, HTTP, Ethernet, telephony, etc. Application instructions 416 can include instructions for various applications. Application data 417 can include data corresponding to the applications.

[0047] Peripherals 408 can be included within server device 400 or operatively coupled to communicate with server device 400. Peripherals 408 can include, for example, network subsystem 418, input controller 420, and disk controller 422. Network subsystem 418 can include, for example, an Ethernet of WiFi adapter. Input controller 420 can be any known input device technology, including but not limited to a keyboard (including a virtual keyboard), mouse, track ball, and touch-sensitive pad or display. Disk controller 422 can include one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks.

[0048] FIG. 5 is an example computing device that can be used within the system 100 of FIG. 1, according to an embodiment of the present disclosure. The illustrative user device 500 can include a memory interface 502, one or more data processors, image processors, central processing units 504, and/or secure processing units 505, and peripherals subsystem 506. Memory interface 502, one or more central processing units 504 and/or secure processing units 505, and/or peripherals subsystem 506 can be separate components or can be integrated in one or more integrated circuits. The various components in user device 500 can be coupled by one or more communication buses or signal lines.

[0049] Sensors, devices, and subsystems can be coupled to peripherals subsystem 506 to facilitate multiple functionalities. For example, motion sensor 510, light sensor 512, and proximity sensor 514 can be coupled to peripherals subsystem 506 to facilitate orientation, lighting, and proximity functions. Other sensors 516 can also be connected to peripherals subsystem 506, such as a global navigation satellite system (GNSS) (e.g., GPS receiver), a temperature sensor, a biometric sensor, magnetometer, or other sensing device, to facilitate related functionalities.

[0050] Camera subsystem 520 and optical sensor 522, e.g., a charged coupled device (CCD) or a complementary metal-oxide semiconductor (CMOS) optical sensor, can be utilized to facilitate camera functions, such as recording photographs and video clips. Camera subsystem 520 and optical sensor 522 can be used to collect images of a user to be used during authentication of a user, e.g., by performing facial recognition analysis.

[0051] Communication functions can be facilitated through one or more wired and/or wireless communication subsystems 524, which can include radio frequency receivers and transmitters and/or optical (e.g., infrared) receivers and transmitters. For example, the Bluetooth (e.g., Bluetooth low energy (BTLE)) and/or WiFi communications described herein can be handled by wireless communication subsystems 524. The specific design and implementation of communication subsystems 524 can depend on the communication network(s) over which the user device 500 is intended to operate. For example, user device 500 can include communication subsystems 524 designed to operate over a GSM network, a GPRS network, an EDGE network, a WiFi or WiMax network, and a Bluetooth™ network. For example, wireless communication subsystems 524 can include hosting protocols such that device 500 can be configured as a base station for other wireless devices and/or to provide a WiFi service.

[0052] Audio subsystem 526 can be coupled to speaker 528 and microphone 530 to facilitate voice-enabled functions, such as speaker recognition, voice replication, digital recording, and telephony functions. Audio subsystem 526 can be configured to facilitate processing voice commands, voice-printing, and voice authentication, for example.

[0053] I/O subsystem 540 can include a touch-surface controller 542 and/or other input controller(s) 544. Touch-surface controller 542 can be coupled to a touch-surface 546. Touch-surface 546 and touch-surface controller 542 can, for example, detect contact and movement or break thereof using any of a plurality of touch sensitivity technologies, including but not limited to capacitive, resistive, infrared, and surface acoustic wave technologies, as well as other proximity sensor arrays or other elements for determining one or more points of contact with touch-surface 546.

[0054] The other input controller(s) 544 can be coupled to other input/control devices 548, such as one or more buttons, rocker switches, thumb-wheel, infrared port, USB port, and/or a pointer device such as a stylus. The one or more buttons (not shown) can include an up/down button for volume control of speaker 528 and/or microphone 530.

[0055] In some implementations, a pressing of the button for a first duration can disengage a lock of touch-surface 546; and a pressing of the button for a second duration that is longer than the first duration can turn power to user device 500 on or off. Pressing the button for a third duration can activate a voice control, or voice command, module that enables the user to speak commands into microphone 530 to cause the device to execute the spoken command. The user can customize a functionality of one or more of the buttons. Touch-surface 546 can, for example, also be used to implement virtual or soft buttons and/or a keyboard.

[0056] In some implementations, user device 500 can present recorded audio and/or video files, such as MP3, AAC, and MPEG files. In some implementations, user device 500 can include the functionality of an MP3 player, such as an iPod™. User device 500 can, therefore, include a 36-pin connector and/or 8-pin connector that is compatible with the iPod. Other input/output and control devices can also be used.

[0057] Memory interface 502 can be coupled to memory 550. Memory 550 can include highspeed random access memory and/or non-volatile memory, such as one or more magnetic disk storage devices, one or more optical storage devices, and/or flash memory (e.g., NAND, NOR). Memory 550 can store an operating system 552, such as Darwin, RTXC, LINUX, UNIX, OS X, Windows, or an embedded operating system such as VxWorks.

[0058] Operating system 552 can include instructions for handling basic system services and for performing hardware dependent tasks. In some implementations, operating system 552 can be a kernel (e.g., UNIX kernel). In some implementations, operating system 552 can include instructions for performing voice authentication.

[0059] Memory 550 can also store communication instructions 554 to facilitate communicating with one or more additional devices, one or more computers and/or one or more servers. Memory 550 can include graphical user interface instructions 556 to facilitate graphic user interface processing; sensor processing instructions 558 to facilitate sensor- related processing and functions; phone instructions 560 to facilitate phone-related processes and functions; electronic messaging instructions 562 to facilitate electronic messaging-related process and functions; web browsing instructions 564 to facilitate web browsing-related processes and functions; media processing instructions 566 to facilitate media processing- related functions and processes; GNSS/N avigation instructions 568 to facilitate GNSS and navigation-related processes and instructions; and/or camera instructions 570 to facilitate camera-related processes and functions.

[0060] Memory 550 can store application (or “app”) instructions and data 572, such as instructions for the apps described above in the context of FIGS. 1-3. Memory 550 can also store other software instructions 574 for various other software applications in place on device 500.

[0061] The described features can be implemented in one or more computer programs that can be executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. A computer program is a set of instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result. A computer program can be written in any form of programming language (e.g., Objective-C, Java), including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

[0062] Suitable processors for the execution of a program of instructions can include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors or cores, of any kind of computer. Generally, a processor can receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer may include a processor for executing instructions and one or more memories for storing instructions and data. Generally, a computer may also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data may include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory may be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).

[0063] To provide for interaction with a user, the features may be implemented on a computer having a display device such as an LED or LCD monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user may provide input to the computer.

[0064] The features may be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination thereof. The components of the system may be connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, e.g., a telephone network, a LAN, a WAN, and the computers and networks forming the Internet.

[0065] The computer system may include clients and servers. A client and server may generally be remote from each other and may typically interact through a network. The relationship of client and server may arise by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[0066] One or more features or steps of the disclosed embodiments may be implemented using an API. An API may define one or more parameters that are passed between a calling application and other software code (e.g., an operating system, library routine, function) that provides a service, that provides data, or that performs an operation or a computation. [0067] The API may be implemented as one or more calls in program code that send or receive one or more parameters through a parameter list or other structure based on a call convention defined in an API specification document. A parameter may be a constant, a key, a data structure, an object, an object class, a variable, a data type, a pointer, an array, a list, or another call. API calls and parameters may be implemented in any programming language. The programming language may define the vocabulary and calling convention that a programmer will employ to access functions supporting the API.

[0068] In some implementations, an API call may report to an application the capabilities of a device running the application, such as input capability, output capability, processing capability, power capability, communications capability, etc.

[0069] While various embodiments have been described above, it should be understood that they have been presented by way of example and not limitation. It will be apparent to persons skilled in the relevant art(s) that various changes in form and detail may be made therein without departing from the spirit and scope. In fact, after reading the above description, it will be apparent to one skilled in the relevant art(s) how to implement alternative embodiments. For example, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.

[0070] In addition, it should be understood that any figures which highlight the functionality and advantages are presented for example purposes only. The disclosed methodology and system are each sufficiently flexible and configurable such that they may be utilized in ways other than that shown.

[0071] Although the term “at least one” may often be used in the specification, claims and drawings, the terms “a”, “an”, “the”, “said”, etc. also signify “at least one” or “the at least one” in the specification, claims and drawings.

Finally, it is the applicant's intent that only claims that include the express language "means for" or "step for" be interpreted under 35 U.S.C. 112(f). Claims that do not expressly include the phrase "means for" or "step for" are not to be interpreted under 35 U.S.C. 112(f).

Claims

1. A system for federated data harmonization comprising: a client agent residing on an edge server at a first site and being configured to access a first dataset; and a server accessible by the client agent and comprising instructions which, when executed by one or more processors, cause the server to perform a process operable to: receive an input format selection and an output format selection from a user device; receive a syntactic mapping definition from the user device; receive, from the user device, a selection of input datasets accessible by the edge server; and transmit the syntactic mapping definition and the selected input datasets to the client agent; wherein the client agent, via the edge server: executes data transformation code to apply the syntactic mapping definition and one or more semantic transformations to the selected input datasets to generate at least one output dataset.

2. The system of claim 1, wherein the server is further operable to package the data transformation code object into a software container image.

3. The system of claim 1, wherein receiving the input format selection comprises receiving a custom textual format or a selection of a pre-defined format.

4. The system of claim 1, wherein receiving the output format selection comprises receiving a custom textual format or a selection of a pre-defined format.

5. The system of claim 1, wherein receiving the syntactic mapping definition from the user device comprises: analyzing the input format selection and output format selection with a machine learning model; generating one or more recommended syntactic mapping definitions; causing the one or more recommended syntactic mapping definitions to be displayed on the user device; and receiving a selection of one of the one or more recommended syntactic mapping definitions from the user device.

6. The system of claim 1, wherein the client agent is further operable to: receive at least one semantic mapping definition comprising input data and a target vocabulary; communicate the at least one semantic mapping definition to the client agent; and trigger an artificial intelligence (Al) model to be executed by the client agent to: generate recommendations for the semantic mapping of the input data to target values in the target vocabulary; and generate a confidence value for each recommendation.

7. The system of claim 6, wherein the server is further operable to provide secure access to a second user device to view the generated semantic mapping recommendations and associated confidence values.

8. The system of claim 7, wherein providing secure access to the second user device to view the generated semantic mapping recommendations comprises establishing one or more encrypted channels between the second user device and the client agent.

9. The system of claim 7, wherein providing secure access to the second user device to view the generated semantic mapping recommendations enables the second user device to perform actions on the semantic mapping values.

10. The system of claim 6, wherein executing the data transformation code comprises applying the recommended to the specified input data to generate the at least one output dataset.

11. The system of claim 1, wherein the client agent is further operable to: generate a predefined number of recommended semantic mapping definitions; cause the predefined number of recommended semantic mapping definitions to be displayed on the user device; and receive a selection of at least one of the predefined number of recommended semantic mapping definitions from the user device.

12. The system of claim 6, wherein the server is further operable to: access a plurality of edge servers, the plurality of edge servers comprising the edge server; compile semantic mapping data from the plurality of edge servers; access the Al model stored in the server; and trigger a federated fine-tuning process of the Al model using semantic mapping data from the plurality of edge servers.

13. The system of claim 12, wherein the federated fine-tuning process comprises generating a different model for each edge server or semantic mapping on an edge server.

14. The system of claim 1, wherein the client agent is further operable to: access a structured data store; and communicate output datasets to the structured data store to be stored in the data store.

15. The system of claim 6, wherein the server is further operable to: receive a definition of a custom vocabulary from the user device, including a set of values; and instruct the client agent to trigger the Al model to generate the semantic mapping recommendations limiting to target values within the custom vocabulary.

16. The system of claim 6, wherein the server is further operable to: execute a pre-processing service in the cloud that comprises one or more of a vector lookup database, a text-based filtering service for a specific vocabulary, or an artificial intelligence model configured to perform pre-filtering techniques; and allow the client agent to access the pre-processing service as part of the generation of semantic mapping recommendations, wherein the pre-processing service augments the input to the Al model generating the mapping recommendations.

17. The system of claim 6, wherein the server is further operable to: receive a semantic mapping privacy configuration as part of the semantic mapping definition from the user device; and allow sharing parts of semantic mappings, as allowed by the privacy configuration, across client agents or across different data transformation efforts.

18. The system of claim 9, wherein the server is further operable to: receive an existing semantic mapping and additional input data; update the semantic mapping to provide recommended mappings for values in the additional input data that were not present in the existing semantic mapping; and update the semantic mapping to adjust recommended mappings for values that were already present in the existing semantic mapping.

19. The system of claim 6, wherein the server is further operable to: analyze the semantic mappings for different datasets; and provide analytics and quality metrics regarding the semantic mappings.

20. The system of claim 19, wherein the server is further operable to: utilize one or more of statistical analysis, structured rules, or Al models to provide recommendations for improvements to the semantic mappings.