CN118941395A

CN118941395A - An automatic generation and management system for data assets based on data operation technology

Info

Publication number: CN118941395A
Application number: CN202411426510.5A
Authority: CN
Inventors: 陈刚; 王旭飞; 王明浩; 赵凯
Original assignee: Sinocbd Inc
Current assignee: Sinocbd Inc
Priority date: 2024-10-14
Filing date: 2024-10-14
Publication date: 2024-11-12
Anticipated expiration: 2044-10-14
Also published as: CN118941395B

Abstract

The invention discloses a data asset automatic generation and management system based on a data operation technology, which comprises 1) completing construction of a data stream based on metadata to form a logic data resource set facing an application scene, and carrying out quality inspection on the logic data resource by configuring a quality inspection rule; 2) Determining the type and classification strategy of the logic data resources by combining the format information and the sampling mode of the metadata, and carrying out final classification identification of the data; 3) For the logic data resources which are aggregated and subjected to quality inspection and classification, a data product with asset value is constructed by configuring specific product information to a data query interface of the logic data resources; 4) Performing value evaluation on the data product through cost evaluation, income evaluation and transaction evaluation, and performing asset configuration by taking the data resource as a unit; 5) And (3) periodically packaging and generating data resources, and carrying out global statistics and management on the currently effective data assets.

Description

Automatic data asset generation and management system based on data operation technology

Technical Field

The invention relates to the field of digital economy, in particular to an automatic data asset generation and management system based on a data operation technology.

Background

Data capitalization refers to the process of gradually converting data from raw data to data assets. From the process of data asset formation, data asset formation is a value creation activity around data, including various links and processes such as data acquisition, processing, management, development, and transaction, and the final objective is to promote the conversion of raw data into data assets, and to excite and release the potential of data value.

The prior data capitalization mainly adopts a data value evaluation and capitalization mode in the vertical field, mainly aims at the vertical fields such as medical treatment, consumption, finance and the like, mainly focuses on management, valuation and circulation of data assets, establishes a general standard of a data asset classification and valuation method required by the vertical field through demand investigation and expert review, provides corresponding software platform facilities, and provides unified management and circulation functions of the data assets for related subjects. Such software platform facilities typically employ blockchains or other validation means by which businesses, organizations and individuals within the industry can comb their own data and upload it to the software platform facilities, relying on the platform to conduct subsequent data asset circulation transactions.

The existing vertical field data value evaluation and capitalization method can not completely solve the difficulty and pain of capitalization of data elements, and the existing problems are as follows:

1) Lack of versatility. By refining and summarizing the general conditions in the field, the method can meet most common scenes, but is difficult to be compatible with edge conditions, and for enterprises and organizations with special service requirements or different subclasses, the types of data, the composition conditions of data items and feature sampling of the enterprises and organizations may have irreconcilable deviation from the general standard, and the enterprises and organizations cannot be fully applicable to the unified standard;

2) Data preparation is difficult. The existing method generally relies on manual work or forms data assets meeting the standard through data center platform aggregation and processing to upload to a software platform, the process is not only strongly dependent on data processing capacity and facilities, but also needs to invest a great deal of time and manpower, related personnel need to regularly comb newly generated data resources and upload, and enterprise managers or staff are difficult to ensure the comprehensiveness and instantaneity of data due to the existence of a great deal of tribal knowledge and data islands;

3) And the expansion and optimization are difficult. Along with industry development, technical progress and continuous development of the value of the data elements, research and formulation of unified standards of data assets in various vertical fields by individuals or groups not only need to consume a great deal of cost, but also have alignment requirements on the cognition degree of standard updating contents of all participants, otherwise, data asset contents and value evaluation deviations caused by misunderstanding are easy to generate.

Accordingly, there is a need for an improvement over the prior art described above to overcome the above-described deficiencies.

Disclosure of Invention

The invention aims to provide a data asset automatic generation and management system based on a data operation technology, which can be used for completing full-chain evolution and management from original data to data assets, and reduces time, manpower and fund investment required by data asset formation through artificial intelligence-based process construction and scheduling. In addition, the vertical domain large model and the industry mechanism knowledge graph constructed in the invention can also solve the problems of efficiently and accurately discovering, constructing and managing the data asset when the knowledge about the data asset is not known enough by enterprises, thereby helping the enterprises to provide full-link automatic operation assistance and decision suggestion. Through the large model capable of being automatically fine-tuned and the intelligent perfect knowledge graph, relevant knowledge and mechanisms can be guaranteed to be continuously optimized and perfect along with the use of users, the edge situation is properly solved, and the environment change is adapted.

The technical aim of the invention is realized by the following technical scheme:

A data asset automatic generation and management system based on data operation technology comprises the following functional modules:

1) Data stream arrangement and quality control: according to the data requirements and the application scene, building the data stream based on metadata, forming a logic data resource set facing the application scene, and carrying out quality inspection on the logic data resource by configuring a quality inspection rule;

2) Data classification and classification identification: determining the type and classification and grading strategy of the logic data resources by combining the format information and the sampling mode of the metadata; selecting a data list range to be followed according to industries and regions where the data classification and classification recognition are performed;

3) A circulation transaction: for the logic data resources which are aggregated and subjected to quality inspection and classification, a data product with asset value is constructed by configuring specific product information to a data query interface of the logic data resources;

4) Value evaluation: evaluating the data assets in three modes of cost evaluation, income evaluation and transaction evaluation, and carrying out asset configuration by taking the data resources as units;

5) Asset generation and entry: and automatically and periodically packaging and generating data resources, and carrying out global statistics and management on the currently effective data assets.

Further, the method for arranging and controlling the quality of the data stream comprises the following steps:

1.1 Metadata ingest: by configuring protocol type, connection position and authentication information, adding user files to be managed, database data, information system data and real-time information of the edge equipment of the Internet of things, automatically connecting and scanning all readable contents in a target position by a system, and recording corresponding storage positions and explicit characteristics for each readable content;

1.2 Semantic information capture: carrying out semantic analysis on the currently acquired content by utilizing the large model subjected to task fine adjustment; the task fine tuning adopts a transfer learning method, and is subjected to supervised learning adjustment aiming at data types and semantic requirements on the basis of large-scale general corpus;

1.3 Intelligent assisted drag data stream construction: when a user builds a data stream, the user is helped to quickly complete the building of the data stream through multi-level intelligent recommendation and interactive building functions;

1.4 Data resource construction and quality management: after the data stream is constructed, the system generates a logic data resource set by the configuration and connection of the data stream and presents the logic data resource set in a user interface.

Further, the data classification and classification recognition method comprises the following steps:

2.1 Determining a corresponding classification hierarchical recognition strategy and an object according to the format of the logic data resource;

2.2 After the classification and hierarchical recognition strategy and the object are definitely classified, selecting a data list range to be followed;

2.3 Classification and classification recognition is performed by the vertical domain large model agent subjected to specific task fine tuning.

Further, the method of the circulation transaction is as follows:

3.1 For the data resources with finished quality inspection rules and classification, identification and configuration, configuring specific product information through a data query interface, and constructing a data product with asset value;

3.2 A real-time recommendation algorithm is adopted, and the most suitable data products are recommended to the user according to the historical behaviors, the preferences and the current requirements of the user;

3.3 Recording the description information of all the data products, carrying out matching and searching of the data products, and carrying out transaction by adopting an off-line business process or an on-line transaction mode after determining the data products;

3.4 Recording the transaction condition, including transaction occurrence time, transaction side information, transaction type information, and sampling and storing the query result, satisfying the storing of the data transaction process, and ensuring the subsequent traceability.

Further, the value evaluation method comprises the following steps:

4.1 Right-of-weight confirmation: the user first needs to confirm and declare his equity for the data asset to be generated.

4.2 Cost collection): the system allows a user to configure the cost of the data resources in detail from generation to maintenance.

4.3 Revenue measurement: and comprehensively evaluating the value contribution of the data resource to enterprise operation and overall business through the real-time statistics and analysis functions.

4.4 Transaction statistics): the system can comprehensively count the transaction conditions of the data resources on the platform, including historical transaction frequency, transaction price and future transaction expectation.

Further, the asset generation and tabulation method is as follows:

5.1 Asset configuration: asset configuration is carried out on data resources in the system according to own requirements;

5.2 Automatically packaging assets to be audited: extracting new content in a period from the original data, and packaging the new content into static and unchangeable to-be-checked data assets;

5.3 Data asset audit): analyzing the historical auditing records and the characteristics of the data assets by adopting a random forest algorithm based on Bootstrap sampling, automatically carrying out preliminary classification and scoring on the data assets to be audited, and providing auditing suggestions for users;

5.4 Unified management of data assets): the formed data assets are uniformly managed by the system as valuable non-asset capacity, and users can check the current data asset list, the total value and the amortization condition.

Further, the task fine tuning method of the large model in the step 1.2) is as follows:

a) Data preparation: preparing field data related to an expected task in a manual labeling mode, wherein the field data comprises thousands of data table structures and fine-tuning corpus of corresponding semantic information of the data table structures, and the fine-tuning corpus is used as initial data of the fine-tuning task;

b) Fine tuning of the model: the pre-training model based on the open source large model is subjected to fine adjustment by using field data, and semantic understanding performance of the model on a specified training statement is optimized by adopting a cross entropy loss function through a method of automatically dividing a training set and a verification set; for a single sample, the cross entropy loss definition is as shown in equation (1):

(1)

For cross entropy loss of the entire training set, the average of all samples is typically taken and calculated with equation (2):

(2)

c) Model optimization: through mixed precision training and gradient accumulation, the training efficiency is optimized and is cross-compared with a verification set while the model accuracy is ensured, and the effect verification is performed through a predefined loss function.

Further, the method for constructing the intelligently-assisted drag data stream in the step 1.3) is as follows:

Based on the definition and description of the application scene, the AP-GNN algorithm is used for assisting the user in carrying out quick recommendation of metadata related to the user target; the AP-GNN algorithm firstly performs association rule mining, and searches all frequent item sets by gradually increasing the length of the item sets; if the frequent item set meets the specified support threshold, the metadata association modes are indicated to occur more frequently in the historical data stream configuration;

the support degree calculation is shown in a formula (3):

(3)

Generating association rules from the frequent item set, and evaluating the reliability of the rules through confidence, wherein a confidence calculation formula is shown as a formula (4):

(4)

After obtaining reliable metadata association patterns, representing the correspondence between the patterns and specific data sources by using a graph structure; nodes in the graph structure represent metadata and data sources, and edges in the graph structure represent metadata association patterns; the weight on the edge is set as the confidence or support of the association rule; then gradually capturing local features and global structural features of the nodes by utilizing convolution operation of the graph neural network;

the convolution operation is implemented as shown in equation (5):

(5)

through the method, a user can locate the required data content in a transparent mode, and the metadata, the data operation steps and the logic output data set are connected in series through the visual drag interface, so that a data stream is built according to the need, and the data format, the data storage position and the data access mode are not required to be concerned.

Further, the method for performing quality inspection on the logic data resource by configuring quality inspection rules in the step 1) is as follows:

a) Data consistency verification: converting the data block into a hash value with a fixed length by using a hash checking algorithm, and judging whether the data are consistent or not by comparing the hash values;

Assume that the data block is The hash function isThen the data blockHash value of (a)Can be calculated using equation (6):

(6)

b) Data integrity verification: carrying out integrity check on the data through external key constraint, and avoiding partial deletion or error of the data;

c) And (3) checking data accuracy: the accuracy of the data is evaluated in real time and the potential abnormal data is identified by using a Bayesian network, and the implementation method is shown in a formula (7);

(7)

d) And (3) checking data timeliness: monitoring the update frequency of data by utilizing time sequence analysis and real-time stream processing technology, and ensuring that the real-time requirement of the service on the data is met;

e) User-defined verification: the data quality standard and the detection rule are configured by the user, and the real-time verification is carried out on the ingested data, so that whether the data meets the expected format, the value interval and the statistical information can be verified by the user-defined rule, and the specific service requirement can be met.

Further, the method for classifying, classifying and classifying the data in the step 2.3) is as follows:

a) Accurately classifying and identifying each data object by adopting a large model;

b) For the classification and grading result generated by the large model, a CE-F1 function is used for classification optimization;

the CE-F1 function fuses the functions of the cross entropy loss function and the classification evaluation index F1-score; the cross entropy loss function is used for measuring a single error coefficient of the classification result, F1-Score represents a quality weight Partial-weight of the recall result obtained by calculating a harmonic average value of the precision rate and the recall rate, and is used for measuring the performance of the classifier when the unbalanced data set is processed;

The mass weight calculation is shown in formulas (8) - (10):

(8)

(9)

(10)

Further, the cost aggregation method is as follows:

For each data resource, it is assumed to contain A plurality of movable steps, each step having a cost ofThen the total cost can be calculated using equation (11)：

(11)

Wherein the cost of each activity stepCan be calculated by equation (12):

(12)

Wherein, Representing the resource cost associated with activity i; Representing activity driving factors, representing the degree or quantity of resources consumed by the activity i;

the method for measuring and calculating the benefits comprises the following steps:

The ARI-MLR algorithm is adopted to analyze and predict the use condition of the data resources in different time periods so as to identify peak periods and valley periods, as shown in formulas (13) and (14):

(13)

(14)

the transaction statistics method comprises the following steps:

Predicting future transaction trend by adopting a time sequence prediction algorithm Prophet model, and generating a transaction evaluation result by combining records of historical data transactions, wherein the transaction evaluation result is shown in a formula (15):

(15)

Further, the method for automatically packaging the assets to be audited comprises the following steps:

and segmenting and aggregating the data by adopting a distributed data processing algorithm based on MapReduce. According to the characteristics of the data, each data block is distributed to different computing nodes for processing, and the implementation method is as shown in a formula (16):

(16)

the Reduce function formula (17) shows:

(17)

the method for auditing the data asset comprises the following steps:

Analyzing the historical auditing records and the characteristics of the data assets based on a random forest algorithm of Bootstrap sampling, automatically carrying out preliminary classification and scoring on the data assets to be audited, and providing auditing suggestions; the splitting of the random forest algorithm based on Bootstrap sampling adopts an information gain method shown in a formula (18):

(18)

further, the system also comprises a dynamic large model access and intelligent body construction and management module, and the implementation steps are as follows:

Step one: packaging model: the differences between calling parameters and return structures of interfaces of different large models are packaged in the form of a Connector, a unified use mode is provided, and switching among the different large models according to the need is supported;

Step two: generating an Agent: the method comprises the steps of generating Agent corresponding to specific type tasks or creating copies based on existing agents to perform experimental development or further fine adjustment, and aiming at tasks such as data stream construction, asset discovery, map fusion and the like;

Step three: acquiring an agent execution result: other modules can transmit input meeting the expected format through calling functions, and acquire an intelligent agent execution result; after the execution result is obtained, other modules can dynamically generate original corpus required by the fine adjustment of the intelligent body according to feedback and modification of the execution result by a user or other predefined rules, feed back the original corpus through an intelligent body feedback function, locate information during execution according to a task ID, input a large model during calling by combining with the result in the feedback, and store the result as a fine adjustment corpus;

step four: determining the fine adjustment requirement of an agent: the dynamic large model access and agent construction and management module can determine strategies such as fine tuning frequency, tendency and the like in a global configuration mode, and the module can globally coordinate fine tuning requirements of each agent according to configured time and frequency and dynamically plan fine tuning execution time of each agent;

Step five: fine tuning training agent: when the fine tuning starts, an intelligent agent uses model reasoning to clean data, eliminates wrong, inferior and repeated contents, solidifies the processed fine tuning corpus, converts the format of the fine tuning corpus based on the fine tuning interface parameter format requirement of a basic model, and finally uses the fine tuning function of the basic large model to carry out fine tuning training;

Step six: managing and multiplexing fine-tuning corpus: the stored and solidified high-quality fine tuning corpus is anonymized and is only related to corresponding tasks and agents, when the module is integrally migrated to a new basic large model, each agent can re-execute the fine tuning tasks according to the historical fine tuning records, so that the agents based on the new basic large model are trained necessarily, and the executing effect and quality are ensured.

Further, the method also comprises a knowledge graph capable of automatically and intelligently fusing growth, and the implementation steps are as follows:

Step one: constructing a basic structure of a map, wherein the basic structure comprises industry and specific links, original data description information, data resource description information and composition/inclusion relation;

step two: preparing profile map content, and ensuring basic use experience of a user when actual use data of the user cannot be fully accumulated;

step three: map-based data asset intent selection: the knowledge graph can be selected to be used or not to propose the proposal of the asset and provide operation assistance, and simultaneously, the system is allowed to collect macroscopic description information which does not relate to the specific bearing information of the data;

Step four: generating data asset recommendation based on the map: after the user selects to use the related service and agrees to collect and share, the system arranges semantic description information of the current global data, including the content in the accessed original data, which is acquired in the process of data stream arrangement and quality control, so as to judge the data type and basic condition owned by the current user, and compare the data type and basic condition with the existing knowledge Graph content of the system according to the scanning result, the user can select the industry, service type and link corresponding to the user to locate Sub-Graph, and propose the subsequent data operation and capitalization according to the comparison result;

Step four: expanding and optimizing map content based on user information: when the user builds the data resource, the related information is anonymized and transmitted to the server, and the system updates the existing nodes and relations or creates new nodes and relations according to the situation.

In summary, the invention has the following beneficial effects:

The invention obviously improves the automation degree and the convenience degree of the whole data asset process by standardizing the products and the auxiliary modules and functions, solves the defects of the traditional method in aspects of universality, usability and growth, comprehensively reduces various costs of data asset and improves the output efficiency and the subsequent multiplexing capability of the data asset process.

1) The adaptability of the edge scene is greatly improved. The method can adapt to any edge scene by automatically acquiring user information and intelligently fusing the user information into a grown knowledge graph, is compatible with special requirements and actual conditions of different users, and has obviously better support degree to various requirements than the traditional method by continuously perfecting the complementary knowledge graph;

2) The data asset generation costs are significantly reduced. By combining full life cycle management of data assets with data operation technology, the invention obviously reduces cost investment required by data carding and inventory, ensures long-term reusability by arranging and configuring data flow and asset generation, and reduces the investment required by enterprise acquisition, processing and long-term maintenance of data resources; through the operation of global business data, the management of data assets and the normal data use flow of enterprises are combined, and the discovery and the treatment of all data resources can be completed in the normal business development, so that the timeliness and the comprehensiveness of the data are ensured;

3) Has good expansibility and freedom. Based on the autonomous learning user operation mode and the regular large model intelligent agent, the invention can adapt to the data asset construction mode in the small probability scene, combines the continuous complementary optimization of the autonomous intelligent fusion growing knowledge graph, avoids the dependence of the user on the expert and team, and changes the traditional manual-dependent mode into the system mode capable of automatically and intelligently optimizing the growth, thereby ensuring that the system has the capability of optimizing along with the development of industry, the technological progress and the continuous development of the value of data elements. In addition, the intelligent body capable of automatically fine-tuning and the knowledge graph capable of automatically increasing and optimizing enable the system to have long-term learning capability, and the characteristics of the fine-tuning effect can be reserved through supporting hot plug, so that the investment required by subsequent system upgrading and optimizing can be obviously reduced while the development of related AI technology is adapted.

Drawings

FIG. 1 is a schematic diagram of an automated data asset generation and management system in accordance with the present invention.

Fig. 2 is a schematic diagram of a data asset system based on data operation technology according to the present invention.

FIG. 3 is a block diagram of the dynamic large model access and agent construction and management according to the present invention.

Fig. 4 is a knowledge graph module diagram capable of autonomous intelligent fusion growth according to the present invention.

Detailed Description

In order that the manner in which the above-recited features, advantages, objects and advantages of the invention are obtained, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

Referring to fig. 1 and 2, the automatic data asset generation and management system based on the data operation technology provided by the invention comprises five steps of data stream arrangement and quality control, data classification and classification identification, circulation transaction, value evaluation, asset generation and tabulation.

Step one: data stream arrangement and quality control. Multisource heterogeneous data of a connected user are ingested into metadata containing basic information and semantic information based on artificial intelligence; after the ingestion is completed, a user can construct a dragging page to complete the construction of a data stream through a code-free data stream with AI auxiliary capability based on own data requirements and application scenes; the built data flow can form a logic data resource set facing the application scene, and confirms the source position, the operation flow and the output format of the logic data resource set, so that the logic data resource set can be continuously multiplexed in the aggregation and the treatment of newly added data. In addition, the system supports the allocation of quality check rules to the formed data resources, thereby ensuring that the quality of the data always meets requirements and expectations. The steps for realizing data stream arrangement and quality control are as follows:

Metadata ingestion: the user can add the contents such as user files to be managed, database data, information system data, real-time information of the edge equipment of the internet of things and the like by configuring the contents such as protocol types, connection positions, authentication information and the like, the system can automatically connect and scan all readable contents in the positions, and generate corresponding storage positions and explicit characteristics (such as path names, table/file names, table fields, text titles and the like which can be acquired without natural language processing and semantic understanding) for each piece of contents.

Semantic information capture: and carrying out semantic analysis on the currently acquired content by using the large model, and carrying out semantic analysis on the acquired content by using the large model which is subjected to fine adjustment of specific tasks by using the system. The fine tuning process adopts a transfer learning method, and on the basis of a large-scale general corpus, supervised learning fine tuning is performed aiming at the data types and semantic requirements in the system. The fine tuning process includes the steps of:

Data preparation: preparing field data related to an expected task in a manual labeling mode, wherein the field data comprises thousands of data table structures and fine-tuning corpus of corresponding semantic information of the data table structures, and the fine-tuning corpus is used as initial data of the fine-tuning task;

Fine tuning of the model: the pre-training model based on the open source large model is subjected to fine adjustment by using field data, and semantic understanding performance of the model on a specified training statement is optimized by adopting a cross entropy loss function through a method of automatically dividing a training set and a verification set; for a single sample, the cross entropy loss definition is as shown in equation (1):

(1)

where C is the number of categories of the classification, Is the actual label, if the sample belongs to class i,

Then=1 Otherwise=0. The probability that the model predictive sample belongs to class i will be calculated by the softmax function. For cross entropy loss of the entire training set, calculate with equation (2):

(2)

where N is the number of samples in the training set. AndThe actual label and the predicted probability distribution for the j-th sample, respectively. In the fine tuning process, each sample in the training set is passed through the model to generate a predictive probability distribution, and then the cross entropy loss is calculated with the actual label. And calculating the gradient of the loss function relative to the model parameters through a back propagation (Backpropagation) algorithm, and randomly gradient-descending the SGD (generalized gradient) by using an optimization algorithm to update the parameters of the model, so that the loss function is gradually reduced, and the performance of the model in semantic understanding is optimized.

Model optimization: through technologies such as mixed precision training and gradient accumulation, training efficiency is optimized while model accuracy is guaranteed, cross comparison is carried out on a verification set, and effect verification is carried out through a predefined loss function.

In the semantic analysis stage, the trimmed model can more accurately infer the semantic description of the data content and generate corresponding vector representations. The system can further process and archive the semantic information, and vector and store all the content with the semantic information based on acge _text_ embeddin model so as to realize efficient fuzzy search through 2048 high-dimensional vector distance calculation. Finally, the system encodes all the acquired information according to the unified metadata encoding specification and stores the information in a unified way, so that standardized management and efficient retrieval of metadata are realized.

Intelligent auxiliary drag data stream construction: when a user builds a data stream, the system helps the user to quickly complete the building of the data stream through multi-level intelligent recommendation and interactive building functions. Firstly, the system rapidly recommends metadata related to a user target based on the definition and description of an application scene by using a machine learning technology such as an association rule mining algorithm and cluster analysis. The system uses a self-developed AP-GNN algorithm to assist the user in making quick recommendations of metadata related to the user's goals. The AP-GNN algorithm is a novel algorithm fused with the Apriori algorithm and a Graph Neural Network (GNN), and the core idea is to extract metadata association modes from historical data stream configuration by using the Apriori algorithm, model complex mapping between the modes and data sources by using the graph neural network, and realize intelligent auxiliary drag data stream construction. The Ap-GNN algorithm first performs association rule mining to find all frequent item sets by gradually increasing the length of the item sets. If the frequent item set meets a prescribed support threshold, it is indicated that these metadata-associated patterns occur more frequently in the historical data stream configuration. The support degree calculation is shown in a formula (3):

(3)

Wherein, Representing a particular set of metadata items representing one or more field/data table combinations for computing their corresponding support，Is the frequency of occurrence of a set of metadata items in a historical data stream configuration, represents the frequency of occurrence of a particular field/data table combination,Is the total number of historical data stream configurations, representing the total number of observations by the system in the intelligent auxiliary drag data stream build, which means all configurations in the past.

(4)

Wherein, AndIs a set of two items that are to be combined,Representing a certain set of metadata itemsRepresentative andAnother metadata item or combination of associations, association ruleRepresenting "if item setIf so, item setIt is also likely that a "rule" will occur,Is a collection of itemsIs used for the support of the (a),Is a collection of itemsSum item setThe degree of support that occurs simultaneously in the same data flow configuration.

After obtaining reliable metadata association patterns, first, the correspondence between these patterns and the specific data sources is represented by a graph structure. Nodes in the graph represent metadata and data sources, and edges represent metadata association patterns. The weights on the edges are set to the confidence or support of the association rule. Then, the local and global structural features of the nodes are progressively captured using a convolution operation (Convolutional Operation) of the graph neural network. The convolution operation is implemented as shown in equation (5):

(5)

Wherein, The characteristic representation of the representation node v at the k-th layer,A set of neighbor nodes representing node v,Is the normalized constant between node v and its neighbor node u,Is the firstThe weight matrix of the layer, σ, is the ReLU activation function.

In this way, the user can locate the required data content in a transparent way, and connect nodes such as metadata, data operation steps, logic output data sets and the like in series through an intuitive drag interface, so that the data stream is built according to the need, and the bottom details such as the data format, the data storage position, the data access mode and the like are not required to be concerned. In addition, the system also provides intelligent data stream building functions in natural language form. The user can use a deep learning model agent subjected to vertical domain data fine tuning to perform one-stop data stream construction through describing the expected application scene and the field requirements of the final data. This agent enhances the understanding and generating capabilities of the model for natural language by using techniques such as generating a countermeasure network (GAN). A large number of data stream construction examples and comparison data sets of natural language descriptions are applied in the fine tuning process, so that the performance of the model under a specific task is improved. After the user inputs the description, the system generates a data stream meeting the requirements, and the data stream is displayed on an interface for verification and confirmation by the user or further modification, and the modified record can be used as a corpus training set for later fine adjustment of the system.

Data resource construction and quality management: after the data stream is built, the system will ingest the metadata according to its configuration, generate a logical data resource set, i.e., an adaptive semantic data space DIKube, and present in the user interface. The user can inquire the data in real time through the inquiry interface attached to the data resource. The query interface employs a distributed query optimization technique APACHE CALCITE to ensure that users can efficiently access the most recently aggregated data. The system will dynamically acquire and update these data in the background to ensure that they are consistent with the current business scenario. To ensure the accuracy and reliability of logical data resources, the system allows a user to configure data quality criteria and detection rules. These rules are automatically applied as the data stream runs to detect data quality in real time. The detection types supported by the system include:

Data consistency verification: in order to improve the efficiency of data consistency verification, the invention uses a hash verification algorithm, as shown in a formula (6), which can convert a data block into a hash value with a fixed length. The uniqueness of the hash value ensures the integrity and consistency of the detected data. If the data changes during transmission or storage, its hash value will also change. That is, by comparing the hash values, it is judged whether the data are identical. For example, assume that the data block is The hash function isThe hash value h of the data block D can be calculated with equation (6):

(6)

Wherein, Is a SHA-256 hash function.Is a fixed length hash value generated by the hash function. If two data blocksAndThe hash values of (a) are identical, i.eThe two data blocks are considered to be identical.

Data integrity verification: carrying out integrity check on the data through external key constraint, and avoiding partial deletion or error of the data;

And (3) checking data accuracy: the system uses the Bayesian network to unsupervised real-time assessment of the accuracy of the data and identification of potentially anomalous data. A bayesian network is a probabilistic graph-based model representing a set of random variables and their conditional dependencies. It consists of nodes (representing variables) and directed edges (representing conditional dependencies), each node having a corresponding conditional probability distribution. Through the Bayesian network, probability inference can be carried out on the accuracy of the data, so that potential abnormal data can be identified, and the implementation method is shown in a formula (7).

(7)

Wherein the method comprises the steps ofIs the posterior probability of event a occurring in the event B occurring.Is the conditional probability that event B occurs in the event a occurs.Is the a priori probability of event a.Is the edge probability of event B. In data accuracy verification, bayesian networks are used to model conditional dependencies between data variables. The Bayesian network for data accuracy verification is constructed as follows: defining target metadata range to be checked for accuracy, locating corresponding original data set by position information in metadata, and taking each attribute or feature in data set as a random variable. A parent node (i.e., a variable whose condition depends on) is defined for each random variable, and a Directed Acyclic Graph (DAG) is formed. Each node in the graph represents a variable and each edge represents a conditional dependency. For each nodeGenerating a conditional probability tableRepresenting the probability that a node will take a different value given a parent node.

And (3) checking data timeliness: monitoring the update frequency of data by utilizing time sequence analysis and real-time stream processing technology, and ensuring that the real-time requirement of the service on the data is met; .

User-defined verification: the user can configure the data quality standard and the detection rule by himself, check the ingested data in real time, and can customize the rule to check whether the data meets the expected format, the value interval, the statistical information and the like so as to meet the specific service requirement.

Step two: and (5) classifying and identifying data. After the data resource is established, the system can combine metadata format information and a sampling mode to determine the type and classification strategy of the data resource; then, the user can select the range of the data list to be followed according to the industry and the region where the user is located; after the front-end operation is finished, the system can carry out final data classification and classification recognition based on the large model.

The method for realizing the classified and hierarchical identification of the data comprises the following steps:

Classification hierarchical recognition strategy of data resources: the data format supported by the system and the corresponding identification strategy comprise the following 4 cases:

a) Classifying hierarchical objects into various fields of the structured data such as a table file, a data table in a structured database, a table stored in an information system and the like;

b) Classifying hierarchical objects as paragraphs in the unstructured data of the document class;

c) For unstructured data of the type of design drawings, industrial control programs and system monitoring data, classifying and grading objects into a whole;

d) Other unidentifiable types of data will not be classified.

Determining the type of the data list: after the classification strategy and the object are clearly classified, the user can select the range of the data list to be followed according to the situation of the user, and the data list recorded by the system comprises:

a) The general list mainly comprises personal data, sensitive personal data and definition and corresponding description of data types obviously violating relevant laws and regulations or public sequence colloquial;

b) A regional list of data standards and management requirements for the specific region state-owned enterprise;

c) The industry list mainly comprises data standards and management requirements which are promulgated by industry authorities such as finance and automobiles and are complied with by industry state-owned enterprise.

Classification and classification identification: the system relies on a vertical domain large model agent that is fine-tuned for a particular task when performing classification hierarchical recognition. The intelligent agent has the characteristics of being compatible with a plurality of open-source or closed-source large models in a hot plug mode, can automatically generate the existing corpus according to specific requirements of the large models, and has the functions of automatically searching related knowledge and accurately classifying and identifying each data object by training a large amount of field data and actual service scenes related to data classification and classification. For classification grading results generated by the large model, the system uses a self-developed CE-F1 function for classification optimization. And ensuring that the classification capacity of the model meets the service requirement by integrating the weighted results of the indexes. The CE-F1 function fuses the functions of the Cross entropy loss function (Cross-Entropy Loss) and the class assessment index F1-score. Wherein the cross entropy loss function is used for measuring a single error coefficient of the classification result, F1-Score represents a quality weight Partial-weight of the Recall result obtained by calculating a harmonic average of an accuracy rate (Precision) and a Recall rate (Recall), and is used for measuring performance of the classifier when the unbalanced data set is processed. The mass weight calculation is shown in formulas (8) - (10):

(8)

(9)

(10)

Where Partial-weight represents the weight of the F1-score portion of the CE-F1 function, the final value of which is harmonically averaged with the cross entropy loss function calculation result. TP (True Positives) represents the number of samples for which the positive example sample is predicted to be positive. FP (False Positives) represents the number of samples that the counterexample samples mispredict as positive. FN (False Negatives) represents the number of positive examples that are mispredicted as negative examples. TN (True Negatives) represents the number of counterexample samples that the counterexample samples correctly predict as counterexample samples. The recognition and evaluation process uses a knowledge graph and an adaptive learning mechanism to improve recognition accuracy and evaluation dimension, wherein the knowledge graph provides concepts and relations in the field, so that the intelligent agent can consider wider semantic association in classification; the adaptive learning mechanism then ensures that the agent can adjust its classification strategy to maintain high accuracy in the face of new or changing data. Knowledge maps provide concepts and relationships within the domain, enabling the agents to consider broader semantic associations in classification. The adaptive learning mechanism then ensures that the agent can adjust its classification strategy to ensure high accuracy in the face of new or changing data. In order to ensure that the classification and grading results not only have technical accuracy, but also meet the actual business requirements, the system allows users to manually intervene and adjust the automatically generated classification and grading results. The system provides a user-friendly auditing interface in which the user can intuitively view and modify the classification results. Through functions such as clicking, batch operation and the like, a user can quickly adjust classification labels or classification grades. The user may also customize classification rules or classification criteria according to specific business scenarios to support data management criteria or other relevant requirements that may exist within the enterprise. The system supports the continual improvement of classification agents through user feedback. Each manual adjustment is recorded and used to update the model or adjust the classification rules to improve the accuracy of the agent in future recognition tasks.

Step three: and (5) a circulation transaction. For the data resources which are aggregated and configured with quality detection and classification, the system supports the data query interface which is self-contained, and upgrades and builds the interface into a data product by configuring the information of specific product information; the data products can be uniformly recorded by the platform and can be traded to users in a specified range or all other users; the platform can record all transaction behaviors, and is convenient for subsequent tracing. The method comprises the following steps of:

Upgrading the data resource into a data product: for data resources for which quality detection rules and classification hierarchical configuration have been completed, the system supports the rapid upgrade of the corresponding data query interface to data products through additional configuration. The data products can be freely configured with query conditions, sorting basis, return quantity, initial positions and the like, so that a product owner can flexibly adjust the tradable range, and meanwhile, the use flexibility and convenience of product consumers are improved.

Data product recommendation: the system uses a self-developed real-time recommendation algorithm MetaGraphFusion that can recommend the most appropriate data products to the user based on the user's historical behavior, preferences, and current needs. The specific working principle of the algorithm MetaGraphFusion can be divided into an offline part and an online part:

Offline part: the system discovers common metadata association patterns by analyzing historical data stream configurations, and builds and optimizes complex relationships between data sources and metadata by using a Graph Neural Network (GNN). The main tasks of the offline part are data preprocessing and model training. And extracting frequently-occurring metadata item sets from the historical data stream configuration in a manner similar to an Apriori algorithm, identifying a common association mode, calculating the Support degree (Support) of each metadata item set by analyzing the historical data, generating association rules, and recording the co-occurrence relationship among the metadata. The system then constructs a graph structure of metadata and data sources based on the output of the association rule, and sets the association rule as an edge and the weight of the edge as a confidence (Confidence) by taking the metadata item set as a node in the graph, thereby forming a graph reflecting the complex relationship between metadata. Finally, the system trains the constructed graph using the graph neural network to learn the feature representation of nodes (metadata) and edges (associations). The system continuously optimizes the characteristic representation of the nodes through a graph rolling operation (Graph Convolution) so that the characteristic representation can reflect the relevance and complex relationship between the metadata and the data source.

On-line part: the system analyzes the matching relation between the selection condition of the metadata and the data source in real time according to the data stream which is being constructed by the user, detects the metadata item set selected by the current user, and maps the metadata item set to the trained GNN model to obtain the related characteristic representation. And carrying out real-time online matching by using association rules and GNN characteristic representations in the offline model, recommending a metadata combination and data source matching scheme with high confidence, inquiring the GNN model according to the metadata selected by the current user, finding other metadata item sets highly related to the GNN model, and providing recommendation through confidence score. In the process of constructing the data stream by the user, the recommendation strategy is optimized in real time, the dynamic requirement of the user is adapted, the recommendation list is dynamically updated by continuously monitoring the selection and adjustment of the user, and retraining or online fine tuning is triggered when necessary to ensure the accuracy of recommendation.

Data product transaction: all data products can be uniformly recorded with description information by the platform, so that the transaction of the data products is supported, and the transaction parties can perform matching and searching of the data products by using the functions of the system. After the data products are determined, long-term or batch purchase can be carried out by adopting an off-line business process according to the callable time or the total calling times, or batch purchase can be carried out by a platform on-line transaction mode.

Transaction information management: the system can automatically record the transaction condition, including transaction occurrence time, transaction information, transaction type information and sampling and storing evidence of the query result, thereby meeting the requirement of storing evidence of the data transaction process and ensuring the subsequent traceability.

Step four: and (5) value evaluation. The system provides three main evaluation methods of cost evaluation, income evaluation and transaction evaluation, and supports the asset configuration by taking data resources as units, wherein the configuration items comprise the following 4 aspects:

Right confirmation: the user first needs to confirm and declare the ownership of the data asset to be generated. The system ensures legitimacy and non-tamper ability of rights statements through smart contract and blockchain techniques. When a user confirms rights, he can verify his ownership or control of the data resource using a rights verification algorithm (e.g., digital signature verification and public key encryption) provided by the system. Only content in data resources that the user is legally in possession of or controlling can be upscaled to ensure the legitimacy and compliance of the upscaling.

Cost collection: the system allows a user to configure the cost of the data resources in detail from generation to maintenance. The method specifically comprises the links of data generation, acquisition, processing, collection, maintenance and the like. The system uses a self-developed Activity-based basic cost algorithm (Activity-Based Costing, ABC) for cost accounting. The proposed algorithm will automatically aggregate the costs and generate the total cost data. These cost data will be used for subsequent data asset cost evaluations to accurately reflect the actual economic investment of the data asset. The campaign basis cost approach is a refined cost calculation method that can assign indirect costs to specific campaigns related to a product or service, thereby accurately measuring the contribution of each campaign to the total cost. For data resources, the activity base cost method can help users identify costs in each link and aggregate them into overall cost data. Cost-aggregation systems developed based on active-basis cost algorithms allow users to customize the cost of each step, which typically includes the generation, collection, processing, aggregation, and maintenance of data. Based on these parameters, the system aggregates costs into total cost data by an activity-based base cost algorithm. For each data resource, it is assumed that it contains N active steps, each step having a cost of respectivelyThen the total cost can be calculated using equation (11)：

(11)

Wherein the cost of each activity stepCan be calculated by equation (12):

(12)

Wherein the method comprises the steps of Representing the resource costs associated with activity i, such as manpower, equipment, energy, etc.Representing activity driving factors, representing the extent or amount of resources consumed by activity i, such as man-hours, amount of data processed, etc.

And (5) profit measurement: the system comprehensively evaluates the value contribution of the data resources to enterprise operation and overall business through real-time statistics and analysis functions. The system counts the access condition of the user on the system interface and the use frequency of the query interface, evaluates the use condition of the data resource in different time periods and different user groups by using the self-developed ARI-MLR algorithm, and further combines the operation index of the enterprise to quantify the contribution of the data resource to the whole business of the enterprise and endow the enterprise with additional economic value. The core idea of the ARI-MLR algorithm is to analyze and predict the use condition of data resources in different time periods by using formulas (13) and (14), so as to identify important time periods such as peak time period, valley time period and the like.

(13)

Wherein the method comprises the steps ofIs the time ofRepresenting the number of transactions, accesses, or frequency of use occurring within a particular time period,Is a constant term, used to represent the long-term average level of data in the time series,Is the order of the autoregressive term, which indicates how many observations (or hysteresis values) at past times were introduced into the model to predict the current value, can help capture the continuity and autocorrelation of the data,Is an autoregressive coefficient used for representingThe effect of the access amount at each point in time on the current access amount prediction,Is the order of the moving average term, which indicates how many past time error terms were introduced in the model to predict the current value,Is a moving average coefficient representing the pastThe effect of the error term at each point in time on the current prediction,Is a white noise error term that represents an unpredictable random variation such as an increase or decrease in the amount of access to a burst. Then, the system evaluates the influence of the frequency of use of the data resources, the service relevance and the uniqueness of the data on the enterprise operation index, and the calculation method is shown in a formula (14):

(14)

Wherein the method comprises the steps of Is a dependent variable such as business operations performance,Is an intercept term., ,…,Is the regression coefficient of the respective variable,, ,…, Is an independent variable (e.g. frequency of use of data resources, business relevance, data uniqueness, etc.),Is an error term.

Transaction statistics: the system can comprehensively count the transaction conditions of the data resources on the platform, including historical transaction frequency, transaction price and future transaction expectation. The system predicts future transaction trends by using a self-developed time series prediction algorithm Prophet model, and combines records of historical data transactions to generate transaction evaluation results. Through these statistics, users can be given additional added value based on the market performance and future potential of the data resources. Propset is a flexible and powerful time series prediction algorithm, particularly suitable for time series data with obvious trends and seasonality. The Prophet model predicts by decomposing the time series into parts of trend, seasonal and holiday effects, etc., and the core idea of the Prophet model is embodied as formula (15):

(15)

Wherein the method comprises the steps of Is the time ofSuch as transaction frequency or price,Is a trend function, used to capture long-term growth or decay,Is a seasonal function reflecting periodic fluctuations in the transaction data.Is a holiday effect function for treating the influence of special time periods such as holidays.Is an error term representing random fluctuations or noise.

After the rights verification, cost aggregation, profit measurement, and transaction statistics are completed, the system automatically generates a comprehensive value calculation formula. The formula combines the cost and added value of the data resource, and supports the user to carry out packing evaluation on the data set in any range. The system uses linear programming and optimization algorithm to automatically calculate the final value according to the size, the number, the use frequency and other factors of the packed data set. The calculation formula and the result provide a standardized asset assessment tool for users, and support enterprises to reasonably price and manage data assets under different business scenes.

Step five: asset generation and entry into a table. The system support can package and generate data resources periodically, and can perform global statistics and management on currently effective data assets, and view content details of any specified data assets. The steps for realizing asset generation and table entry are as follows:

Asset configuration: users can perform asset configuration on data resources in the system according to own requirements. The system supports a user to set the generation period of the data asset by himself, the period determines the time and frequency of the data asset formed by the data asset being packaged regularly, the system uses a self-developed ARI-Prophet model to dynamically optimize the generation period of the data asset, and the ARI-Prophet model is based on the Prophet model in step 4) and the formula (13) in step 3) in step four to balance weights of the Prophet model, predict the future data growth trend, and therefore recommend the optimal asset generation period for the user. The user may choose to package every natural month, quarter, half year, or year. In addition, the user may set the expected lifecycle of the data generation asset based on the content characteristics and manner of use of the data. The periodic task schedule Cron Job scheduling mechanism is implemented to ensure that data assets are automatically generated at predetermined points in time.

Automatically packaging assets to be audited: and automatically extracting new content in a period from the original data, packaging the new content into static and unchangeable to-be-checked data assets, using DELTA LAKE as a data segmentation and snapshot implementation means in the process, introducing a distributed data processing algorithm based on MapReduce as a data segmentation and aggregation algorithm, segmenting and aggregating the data resources into the most suitable asset package in an intelligent mode according to the characteristics of the data (such as data type, data quantity, access frequency and the like). Ensuring that the contents of each package have integrity and consistency. After the data asset to be audited is formed, the system marks the data asset to be audited as a state and informs the user of subsequent operation. MapReduce is a distributed computing framework, and is mainly used for processing large-scale data and generating abstracts. The system introduces a distributed data processing algorithm based on MapReduce to segment and aggregate large-scale data efficiently, and decompose the original data into a plurality of small data blocks (i.e. "key value pairs"). According to the characteristics of the data, each data block is distributed to different computing nodes for processing, and the implementation method is shown in a formula (16).

(16)

Wherein, Is an input key for the data block ID,Is an input value for the data content,Is an intermediate key containing information such as segmented data identification,Is the median value of the segmented data block. The system being based on intermediate keysAnd grouping and reassigning Map outputs, aggregating data blocks of the same key together, processing the aggregated data blocks, and merging the aggregated data blocks into final output data. This process ensures data integrity and consistency and generates the final pending data asset, as shown in Reduce function equation (17):

(17)

Wherein the method comprises the steps of Is an intermediate key, and is provided with a key,Is a list of intermediate values output by the Map stage,Is the data to be packed and is the data to be packed,Is the final output value (i.e., the packaged data asset).

Data asset auditing: the system uses a self-developed random forest algorithm based on Bootstrap sampling to analyze the historical auditing records and the characteristics of the data assets, automatically carries out preliminary classification and scoring on the data assets to be audited, and provides auditing suggestions for users. The user can determine whether to generate the data asset to be checked and the auditing suggestions provided by the system according to the actual conditions, and the system can automatically record all auditing conditions. A random forest algorithm based on Bootstrap sampling is an integrated learning method, and classification accuracy and robustness of a model are improved by constructing a plurality of decision trees and voting the results. Each decision tree uses a different random subset (including samples and features) in the construction process, which enables random forests to effectively reduce the overfitting phenomenon and perform well when processing high-dimensional data. A plurality of subsets are randomly extracted from the historical audit record and data asset signature, each subset being used to train a decision tree. This process is called boottrap sampling, i.e. the samples are decimated with a put back. As each decision tree is built, for each node's split, some features are randomly selected (instead of using all features) to perform a search for the best split. The step reduces the correlation among decision trees and improves the generalization capability of the integrated model. Each decision tree is formed by recursively splitting selected features until stopping conditions (e.g., node purity, maximum depth, etc.) are met. The splitting of the random forest algorithm based on Bootstrap sampling adopts an information gain method shown in a formula (18):

(18)

Wherein the method comprises the steps of Is a nodeEntropy of (c) representing a data setFor determining the purity or uncertainty ofWhether the types of data in (a) are similar,Is a feature, represents an attribute or field of particular metadata, such as the type of data, source, timestamp, etc.,Is characterized byTake the value ofSubset of times according to characteristicsFrom different valuesA subset of the extraction steps is performed,Is a nodeRepresenting the total data volume in the data set.

In the auditing phase, the data asset to be audited passes through all decision trees, each of which classifies the asset. The random forest model ultimately classifies the asset by means of majority voting and generates a classification score (typically in the form of a probability, such as a probability of belonging to a certain class) for it. Through the above procedure, the system may generate a classification label (e.g., "high priority", "low risk" or "need to be reviewed") and a corresponding score (e.g., 0.85, indicating 85% confidence that the asset belongs to a certain category) for each data asset to be reviewed. The score is based on the output of a Bootstrap sampled random forest algorithm, which is a weighted average of the voting results of multiple decision trees. The user can check the data assets to be checked and the checking suggestions provided by the system, and decide whether to generate the assets according to the actual conditions. The system records each audit decision of the user, including whether to adopt the system advice, the specific time of the audit, the auditor and the like. These records will be used to update the random forest model to improve the accuracy of future audits.

Unified management of data assets: the formed data assets are uniformly managed by the system as valuable non-asset capacity, and a user can check the current data asset list, the total value, the amortization condition and the like:

according to the life cycle of the asset, the system adopts a linear amortization method to automatically amortize, and performs statistics according to a common financial format each month, so as to support the downloading exported in a form of a table.

In support of any given data asset, the user may upload a three-way assessment report to assist in its value claims.

And automatically integrating and generating various contents such as price composition, total price of the asset, quality detection result, classification condition, three-party evaluation information and the like of the data asset according to user configuration for the user to view or export.

Referring to fig. 3, the present invention further includes a dynamic large model access and agent construction and management module. The module is used as an important auxiliary module of the system, provides support for various tasks such as data stream construction, classification and classification recognition, data product matching recommendation and the like, can provide support for an ARI-Prophet model, an AP-GNN and other unsupervised/semi-supervised feature modeling algorithm, a bidirectional multidimensional filtering mechanism and other recommendation algorithms and the like used in the process steps, integrates the advantages of a large language model and other artificial intelligence algorithm models, can be suitable for specific tasks with strong pertinence and high verticality and fuzzy tasks with high freedom requirements, can meet the characteristics and requirements of numerous task types and strict output format requirements of the system, and supports long-term use and continuous optimization.

The dynamic large model access and intelligent body construction and management module has the automatic and long-term self-optimization capacity, and the implementation steps of the dynamic large model access and intelligent body construction and management module are as follows:

Step two: generating an Agent: the method is used for generating Agent corresponding to specific type tasks or creating copies based on existing agents for experimental development or further fine tuning, and is oriented to tasks such as data stream construction, asset discovery, map fusion and the like. Any intelligent agent supports two behavior modes of executing with a predefined flow and autonomously judging the flow, and the behavior mode can be dynamically designated when the intelligent agent is used. When the Agent is created, the fine tuning model is automatically created at the same time, and two main functions of calling and feedback are provided, and other functional components can call the Agent to complete the task.

Step three: acquiring an agent execution result: other modules in the system can be transmitted into the input meeting the expected format through calling the function, and the execution result of the intelligent agent is obtained. After the execution result is obtained, other modules can dynamically generate the original corpus required by the fine adjustment of the agent according to feedback and modification of the execution result by a user or other predefined rules, feed back the original corpus through an agent feedback function, locate information during execution according to a task ID, input a large model during calling by combining the information, and store the result in the feedback as a fine adjustment corpus.

Step four: determining the fine adjustment requirement of an agent: the dynamic large model access and agent construction and management module can determine strategies such as trimming frequency, tendency and the like in a global configuration mode, and the module can globally coordinate trimming requirements of each agent according to configured time and frequency and dynamically plan trimming execution time of each agent.

Step six: managing and multiplexing fine-tuning corpus: the stored and solidified fine tuning corpus is anonymized, the stored content is not related to the content such as personal information of a user, and is only related to the corresponding task and the agents, when the module is integrally migrated to a new basic large model, each agent can re-execute the fine tuning task according to the history fine tuning record, so that the agents based on the new basic large model are trained necessarily, and the executing effect and quality are ensured.

Referring to fig. 4, the invention further includes a knowledge graph capable of autonomous intelligent fusion growth, which is used as an important component of the system of the invention, supports the targeted recommendation of users based on metadata information and user asset construction conditions ingested in data stream arrangement and quality control under the condition of obtaining explicit authorization of users, and can intelligently collect, pass and fuse the use conditions of users into proper positions in the graph according to common enterprise data and asset conditions of the same industry, the same service type and links. The method for realizing the knowledge graph capable of realizing autonomous intelligent fusion growth comprises the following steps:

step one: constructing a basic structure of the map. Comprises the following 4 parts of contents:

1) Industry and concrete links: and the identifier is used as a Sub-Graph identifier for distinguishing enterprise conditions of different industries, different service types and links. Different Sub-graphies may have similar or completely different structures and components, and the overall knowledge Graph is composed of a plurality of Sub-graphies;

2) Raw data description information: as one of the main node types, for identifying the description of the original data type owned by the enterprise client in the specific Sub-Graph, each original data description information node has a weight value to describe the popularity of the original data description information node in the current Sub-Graph to the system, in addition, the original data description information node additionally bears the Schema information after fusion weighting and standardization, and the Schema information is used for auxiliary comparison when the node is matched with the current data situation of the user;

3) Data resource description information: as another type of node type, the node is used for identifying the description information of the specific scene of the enterprise client usage data in the specific Sub-Graph, and each data resource description information node is provided with a weight value so as to describe the popularity of the node in the current Sub-Graph to the system;

4) "composition/inclusion" relationship: connecting an original data description information node with a data resource description information node, and identifying the composition relation between the original data and the data resource. Such relationships are provided with additional descriptive information to summarize and describe the data processing and data stream construction process.

Step two: preparing overview map content. The system aims to help the system of the invention to be online, and ensure the basic use experience of the user when the actual use data of the user cannot be fully accumulated.

Step three: map-based data asset intent selection: the user may choose by himself whether to use the knowledge graph to suggest an asset and to provide operational assistance while allowing the system to collect macro descriptive information that does not relate to data specific load information, such as data item cases, data sheet descriptions, built asset descriptions, etc.

Step four: generating data asset recommendation based on the map: after the user selects to use the related service and agrees to collect and share, the system arranges semantic description information of the current global data, including the content in the accessed original data, which is acquired in the process of data flow arrangement and quality control, so as to judge the data type and basic condition owned by the current user, and compare the data type and basic condition with the existing knowledge Graph content of the system according to the scanning result, the user can select the industry, service type and link corresponding to the user to locate (or create new) Sub-Graph, and propose the subsequent data operation and capitalization according to the comparison result.

Step five: expanding and optimizing map content based on user information: when the user builds the data resource, the related information is anonymized and transmitted to the server, and the system updates the existing nodes and relations or creates new nodes and relations according to the situation. The whole process of the task realization is supported by the fine-tuned vertical domain large model intelligent agent, so that the accuracy of recommendation and fusion is improved.

In this document, the terms "upper", "lower", "front", "rear", "left", "right", "top", "bottom", "inner", "outer", "vertical", "horizontal", etc. refer to the directions or positional relationships based on those shown in the drawings, and are merely for clarity and convenience of description of the expression technical solution, and thus should not be construed as limiting the present invention.

In this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a list of elements is included, and may include other elements not expressly listed.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The automatic data asset generation and management system based on the data operation technology is characterized by comprising the following functional modules:

1) Data stream arrangement and quality control: according to the data requirements and the application scene, building the data flow based on metadata, forming a logic data resource set facing the application scene, carrying out quality inspection on the logic data resource by configuring a quality inspection rule, and inspecting the data in the logic data resource set according to the quality inspection rule by the system to ensure that the quality of the data always meets the requirements and expectations;

2) Data classification and classification identification: determining the type and classification and grading strategy of the logic data resources by combining the format information and the sampling mode of the metadata; the user can select the range of the data list to be followed according to the industries and regions to be treated, and final data classification and grading identification can be carried out;

4) Value evaluation: performing value evaluation on the data asset through three modes of cost evaluation, income evaluation and transaction evaluation, and performing asset configuration by taking the data resource as a unit;

2. The automatic data asset generation and management system based on data manipulation technology of claim 1, wherein said data stream orchestration and quality control method is as follows:

1.4 Data resource construction and quality control: after the data stream is constructed, the system can absorb metadata according to the configuration of the metadata, generate a logic data resource set and display the logic data resource set in a user interface, and simultaneously support the configuration quality inspection rule of the data resource so as to ensure that the quality of the data always meets the requirements and expectations.

3. The automatic data asset generation and management system based on data manipulation technology of claim 1, wherein the method of data classification hierarchical identification is as follows:

4. The automatic generation and management system of data assets based on data operation technology according to claim 1, wherein the method of the circulation transaction is as follows:

5. The automatic data asset generation and management system based on data carrier technology of claim 1, wherein the method of value assessment is as follows:

4.1 Right-of-weight confirmation: the user first needs to confirm and declare the ownership of the data asset to be generated;

4.2 Cost collection): the system allows the user to configure the cost of the data resources in detail from generation to maintenance;

4.3 Revenue measurement: comprehensively evaluating the value contribution of the data resource to enterprise operation and overall business through the real-time statistics and analysis function;

6. The automatic data asset generation and management system based on data manipulation technology of claim 1, wherein the asset generation and tabulation method is as follows:

5.1 Asset configuration: asset configuration is carried out on data resources in the system according to the self requirements of users;

5.4 Unified management of data assets): the formed data assets are uniformly managed by the data operation technology as valuable non-asset capacity, and a user can check the current data asset list, the total value and the amortization condition.

7. The automatic data asset generation and management system based on data operation technology according to claim 2, wherein the task fine tuning method of the large model in step 1.2) is as follows:

(1)

wherein C is the number of categories of the classification, Is the actual tag, if the sample belongs to the i-th class, then=1, Otherwise=0; The probability that the model prediction sample belongs to the ith class is calculated through a softmax function;

(2)

where N is the number of samples in the training set; And The actual label and the predicted probability distribution of the j-th sample;

8. The automatic data asset generation and management system based on data operation technology according to claim 2, wherein the intelligently assisted drag data stream construction method in step 1.3) is as follows:

Based on the definition and description of the application scene, the AP-GNN algorithm is used for assisting the user in carrying out quick recommendation of metadata related to the user target; the AP-GNN algorithm firstly performs association rule mining, and searches all frequent item sets by gradually increasing the length of the item sets; if the frequent item set meets the prescribed support A threshold value indicating that the metadata association patterns occur more frequently in the historical data stream configuration;

wherein the degree of support The calculation is shown in formula (3):

(3)

Wherein, Representing a particular set of metadata items representing one or more field/data table combinations for computing their corresponding support，Is the frequency of occurrence of a set of metadata items in a historical data stream configuration, represents the frequency of occurrence of a particular field/data table combination,Is the total number of historical data stream configurations, representing the total number of times the system observes in intelligent auxiliary drag data stream construction;

generating association rules from frequent item sets and by confidence Evaluating reliability, confidence of ruleThe calculation formula (4) is shown as follows:

(4)

Wherein, AndIs a set of two items that are to be combined,Representing a certain set of metadata itemsRepresentative andAnother metadata item or combination of associations, association ruleRepresentation rules: if item setIf so, item setIt is also likely that it will be present,Is a collection of itemsIs used for the support of the (a),Is a collection of itemsSum item setThe degree of support that occurs simultaneously in the same data flow configuration;

the convolution operation is implemented as shown in equation (5):

(5)

Wherein, The characteristic representation of the representation node v at the k-th layer,A set of neighbor nodes representing node v,Is the normalized constant between node v and its neighbor node u,Is the firstThe weight matrix of the layer, σ is the ReLU activation function;

9. The automatic data asset generation and management system based on data operation technology according to claim 1, wherein the method for quality checking of logical data resources by configuring quality checking rules is as follows:

Assume that the data block is The hash function isThe hash value h of the data block D can be calculated with equation (6):

(6)

(7)

Wherein, The posterior probability of event A occurring in the event B occurring; Is the conditional probability of event B occurring in the event a occurring; Is the prior probability of event a; is the edge probability of event B;

10. A data asset automatic generation and management system based on data manipulation technology according to claim 3, said step 2.3) a method of data classification hierarchical identification is as follows:

The mass weight calculation is shown in formulas (8) - (10):

(8)

(9)

(10)

Wherein, partial-weight represents the weight of the F1-score part in the CE-F1 function, and the final value of the Partial-weight is subjected to harmonic average with the cross entropy loss function calculation result; TP represents the number of samples for which the positive sample is predicted to be positive; FP represents the number of samples that the counterexample samples mispredict as positive; FN represents the number of positive samples mispredicted as negative samples.

11. The automatic generation and management system of data assets based on data operation technology of claim 5, the cost aggregation method is as follows:

for each data resource, it is assumed that it contains N active steps, each step having a cost of respectively Then the total cost can be calculated using equation (11) ：

(11)

Wherein the cost of each activity stepCan be calculated by equation (12):

(12)

(13)

Wherein, Is the time ofRepresenting the number of transactions, accesses, or frequency of use occurring within a particular time period,Is a constant term, used to represent the long-term average level of data in the time series,Is the order of the autoregressive term, which indicates how many observations or hysteresis values at past times were introduced into the model to predict the current value, can help capture the continuity and autocorrelation of the data,Is an autoregressive coefficient to represent the effect of the amount of access at each point in time on the current amount of access prediction,Is the order of the moving average term, which indicates how many past time error terms were introduced in the model to predict the current value,Is a moving average coefficient representing the pastThe effect of the error term at each point in time on the current prediction,Is a white noise error term, and represents unpredictable random changes such as increase or decrease of burst access quantity;

(14)

the transaction statistics method comprises the following steps:

(15)

Wherein, Is the time ofIs used for the observation of the (a),Is a trend function, used to capture long-term growth or decay,Is a seasonal function, reflecting periodic fluctuations in transaction data,Is a holiday effect function for processing the influence of special time periods such as holidays and the like,Is an error term representing random fluctuations or noise.

12. The automatic data asset generation and management system based on data manipulation technology of claim 6, wherein the method of automatically packaging assets to be audited is as follows:

segmenting and aggregating data by adopting a distributed data processing algorithm based on MapReduce; according to the characteristics of the data, each data block is distributed to different computing nodes for processing, and the implementation method is as shown in a formula (16):

(16)

Wherein, Is an input key for the data block ID,Is an input value for the data content,Is an intermediate key containing segmented data identification information,Is the median value of the segmented data block;

the Reduce function is shown in equation (17):

(17)

Wherein, Is an intermediate key containing segmented data identification information,Is a list of intermediate values output by the Map stage,Is the data to be packed and is the data to be packed,Is the final output value;

the method for auditing the data asset comprises the following steps:

(18)

Wherein, Is a nodeEntropy of (c) representing a data setFor determining the purity or uncertainty ofWhether the types of data in (a) are similar,Is a feature, represents an attribute or field of particular metadata,Is characterized byTake the value ofSubset of times according to characteristicsFrom different valuesA subset of the extraction steps is performed,Is a nodeRepresents the total data volume in the data set.

13. The automatic data asset generation and management system based on data operation technology according to claim 1, further comprising a dynamic large model access and agent construction and management module, the implementation steps of which are as follows:

Step two: generating an Agent: the method comprises the steps of building a data stream, discovering assets, fusing a map, generating Agent corresponding to a specific type of task, or creating a copy based on existing agents to perform experimental development or further fine adjustment;

Step four: determining the fine adjustment requirement of an agent: the dynamic large model access and agent construction and management module can determine the fine tuning frequency and trend strategy thereof in a global configuration mode, and the module can globally coordinate the fine tuning requirements of each agent according to the configured time and frequency and dynamically plan the fine tuning execution time of each agent;

14. The automatic generation and management system of data assets based on data operation technology according to claim 1, further comprising a knowledge graph capable of autonomous intelligent fusion growth, wherein the implementation steps are as follows: