CN120354943A

CN120354943A - Large model Agent intelligent decision method and system for fusing multi-mode data

Info

Publication number: CN120354943A
Application number: CN202510459338.1A
Authority: CN
Inventors: 生义; 王燕波; 王洋; 于越; 马凌雁; 王强昌
Original assignee: Inspur Smart City Technology Co ltd
Current assignee: Inspur Smart City Technology Co ltd
Priority date: 2025-04-14
Filing date: 2025-04-14
Publication date: 2025-07-22

Abstract

The present invention discloses a large-model Agent intelligent decision-making method and system for fusing multimodal data, which belongs to the technical fields of artificial intelligence, multimodal data processing, deep learning, reinforcement learning and intelligent decision-making. The technical problem to be solved by the present invention is how to improve the performance and adaptability of intelligent decision-making in processing complex tasks and dynamic environments. The technical scheme adopted is: multimodal data fusion: integrating text, image, and audio data from different modalities, and generating a unified feature representation through feature extraction and feature fusion technology; intelligent decision-making: performing decision reasoning based on the fused feature representation, and using a deep learning model and a reinforcement learning algorithm to generate the final decision result; adaptive learning: real-time monitoring of data changes and decision effects, and dynamic adjustment of deep learning model parameters and strategies; feedback optimization: further optimizing the performance of the deep learning model by collecting feedback information on decision results.

Description

Large model Agent intelligent decision method and system for fusing multi-mode data

Technical Field

The invention relates to the technical fields of artificial intelligence, multi-modal data processing, deep learning, reinforcement learning and intelligent decision making, in particular to a large-model Agent intelligent decision making method and system integrating multi-modal data.

Background

With the rapid development of artificial intelligence technology, intelligent decision systems play an increasingly important role in handling complex tasks and dynamic environments. However, conventional intelligent decision systems are generally capable of processing only a single modality of data, such as text or images, and it is difficult to fully utilize the rich information of multi-modality data. In addition, existing systems often lack sufficient adaptability and flexibility when facing dynamically changing environments, making it difficult to adjust decision strategies in real time.

Therefore, how to improve the performance and adaptability of intelligent decision making in complex task and dynamic environment is a technical problem to be solved.

Disclosure of Invention

The technical task of the invention is to provide a large model Agent intelligent decision method and a large model Agent intelligent decision system for fusing multi-mode data, so as to solve the problem of how to improve the performance and adaptability of intelligent decisions in complex task processing and dynamic environments.

The technical task of the invention is realized in the following way, and the large model Agent intelligent decision method integrating multi-mode data is specifically as follows:

Integrating text, images and audio data from different modes, and generating unified feature representation through feature extraction and feature fusion technologies;

Performing decision-making reasoning based on the fused characteristic representation, and generating a final decision-making result by adopting a deep learning model (such as Transformer, BERT and the like) and a reinforcement learning algorithm (such as DQN, PPO and the like);

self-adaptive learning, namely monitoring data change and decision effect in real time, and dynamically adjusting parameters and strategies of a deep learning model;

And feedback optimization, namely further optimizing the performance of the deep learning model by collecting feedback information of decision results.

Preferably, the multimodal data fusion is specifically as follows:

collecting and preprocessing data, namely collecting multi-mode data from various channels of social media, a sensor network and a public database in real time or in batches, collecting data according to different data types through an API (application program interface) or a crawler tool, preprocessing the collected data of different types, and obtaining preprocessed data;

Feature extraction, namely, aiming at data of different modes, carrying out feature extraction by adopting a corresponding feature extraction technology to obtain corresponding features;

and the feature fusion is to fuse the features of the non-use modes to generate a unified feature representation, and adopt a plurality of fusion strategies such as early fusion, medium fusion, late fusion and the like to ensure the comprehensiveness and the effectiveness of the features.

More preferably, the data preprocessing of the collected data of different types is specifically as follows:

performing word segmentation, part-of-speech tagging and named entity recognition operations on text data by applying Natural Language Processing (NLP) technology (such as NLTK and spaCy), and performing deep semantic understanding through a BERT model;

Performing size adjustment, clipping and rotation basic operation by using OpenCV, and performing object detection and classification based on a deep learning method (such as YOLO and SSD) to provide high-quality input for subsequent feature extraction;

for audio data, performing preprocessing operations such as sampling rate conversion, denoising, volume standardization and the like by means of Librosa libraries, and simultaneously adopting a Mel spectrogram conversion technology to convert the audio signals into a form suitable for machine learning model processing;

Aiming at the data of different modes, adopting a corresponding feature extraction technology to extract the features, and acquiring the corresponding features is as follows:

text feature extraction, namely supplementing feature representation by combining a TF-IDF and Word2Vec traditional NLP method besides using a BERT model so as to capture more context information;

Besides ResNet, introducing Inception-V3 and VGG16 convolutional neural network models, selecting the most suitable architecture according to different scene requirements, and realizing more accurate feature capture;

audio feature extraction-apart from MFCC algorithms, advanced audio feature extraction techniques using Perceptual Linear Prediction (PLP) are explored to improve the understanding of speech signals.

Preferably, the deep learning model is specifically as follows:

The deep learning model is excellent in natural language processing tasks and extends to processing of image and audio data, specifically, processing image features using ViT (Vision Transformer), understanding text content through a pre-trained BERT model and processing audio signals using WaveNet;

In order to better capture the relevance between different modes, a Cross-mode attention mechanism (Cross-modal Attention Mechanism) is introduced, and a deep learning model is allowed to dynamically adjust the importance weights of the different modes according to the context;

the reinforcement learning algorithm is specifically as follows:

The parameter adjustment strategy is to consider an adaptive learning rate method (such as AdaGrad, adadelta) and a latest optimizer such as LAMB (Layer-WISE ADAPTIVE Moments for Batch training) besides gradient descent algorithms (such as Adam and RMSprop) so as to improve the efficiency and effect of large-scale distributed training;

Aiming at a specific task, a pre-training model is used as a starting point, new task requirements are quickly adapted through a fine adjustment strategy, and an antagonistic training (ADVERSARIAL TRAINING) technology is used to reduce the risk of overfitting and improve the generalization capability;

Dynamic algorithm selection, in which an intelligent agent can dynamically select the most suitable learning algorithm from a rich algorithm library based on task requirements and data characteristics, preferably selects a Support Vector Machine (SVM) or random forest (Random Forests) when processing structured data, and generates high-quality image or video data for data enhancement or anomaly detection when faced with unstructured data, generating a countermeasure network (GANs) or Diffusion model (Diffusion Models).

Preferably, the adaptive learning is specifically as follows:

The real-time monitoring and analysis comprises the steps of collecting data change and decision effect indexes in real time through a system monitoring tool (Prometheus, grafana), predicting future trend by adopting a time sequence analysis method (ARIMA model) to provide basis for subsequent model adjustment, creating a real-time instrument board by means of Grafana or other visual tools, intuitively displaying the change condition of each Key Performance Index (KPI), regularly generating a detailed performance analysis report to help a developer know the running state of the system and make corresponding adjustment, wherein the decision effect indexes comprise hardware performance indexes of CPU (Central processing Unit) utilization rate, memory occupation, network delay, disk I/O (input/output) and model performance indexes of accuracy, recall rate and F1 score;

Dynamic adjustment, namely, according to the monitoring result, parameter optimization, namely, based on the monitoring result, a super parameter optimization technology (such as Betty optimization and random search) is applied to automatically adjust key parameters in a deep learning model;

the anomaly detection comprises identifying an anomaly pattern in the data using an integrated machine learning algorithm (Autoencoders), adjusting the decision strategy in time, starting a corresponding response mechanism immediately upon detection of an anomaly, notifying the relevant personnel by sending an alarm through Prometheus and suspending the current operation.

Preferably, the feedback optimization is specifically as follows:

feedback collection, namely providing a multi-mode feedback channel, collecting scores (1-5 stars), check boxes (correct/error marks) and text evaluations (NLP emotion analysis extraction satisfaction) through a User Interface (UI), processing a real-time feedback stream, receiving feedback events by using a Kafka message queue, performing de-duplication (based on UUID), normalization (mapping to a 0-1 interval) and multi-mode association through a Flink, simultaneously evaluating the feedback confidence, detecting false feedback (such as brushing behaviors) by using GAN, and filtering abnormal data;

Evaluating performance, namely evaluating model performance by using quantitative indexes of accuracy, recall, F1 fraction and mean square error, and generating a multidimensional evaluation report by using a Jenkins timing trigger LMMs-Eval framework;

And optimizing and adjusting, namely adjusting key parameters of the model in a targeted manner according to feedback information and performance evaluation results, searching a global optimal solution by adopting a Bayesian optimization, genetic algorithm and advanced optimization method of NAS algorithm, and continuously improving a decision effect by means of GRPO algorithm through interaction with the environment when the existing model is found to perform poorly in certain specific scenes.

A large model Agent intelligent decision system integrating multi-mode data adopts a centralized configuration management system to uniformly manage configuration parameters of each module and realize a uniform log record and monitoring system, wherein the system comprises:

the multi-mode data fusion module is used for integrating text, images and audio data from different modes and generating unified feature representation through feature extraction and feature fusion technologies;

the intelligent decision engine is used for carrying out decision reasoning based on the fused characteristic representation, and generating an optimal decision by adopting a deep learning model (such as Transformer, BERT and the like) and a reinforcement learning algorithm (such as DQN, PPO and the like);

the self-adaptive learning module is used for monitoring data change and decision effect in real time and dynamically adjusting model parameters and strategies;

The feedback optimization module is used for further optimizing the performance of the model by collecting feedback information of the decision result;

The multi-modal data fusion module adopts a plurality of fusion strategies of early fusion, medium fusion and late fusion, the intelligent decision engine combines supervised learning and reinforcement learning, a supervised learning rapid updating model is used under the condition of marked data, reinforcement learning optimization strategies are used in non-marked data or exploratory tasks, the self-adaptive learning module performs data acquisition and visualization by using Prometaus and Grafana, and the feedback optimization module performs periodic assessment by using a LMMs-Eval automatic test framework.

Preferably, the multi-mode data fusion module includes:

the data acquisition and preprocessing sub-module is used for acquiring multi-mode data from different data sources and preprocessing the multi-mode data;

the feature extraction submodule is used for using a special feature extraction technology aiming at data of different modes;

The feature fusion sub-module is used for fusing the features of different modes to generate a unified feature representation;

the intelligent decision engine comprises:

The deep learning model submodule is used for processing the fused characteristic representation by adopting a Transformer, BERT model;

the reinforcement learning algorithm submodule is used for optimizing a decision strategy by combining with the DQN and PPO algorithms;

the decision generation sub-module is used for synthesizing the output of the deep learning model and the reinforcement learning algorithm and generating a final decision result;

The adaptive learning module includes:

the real-time monitoring and analyzing sub-module is used for collecting data change and decision effect indexes in real time through the system monitoring tool;

the dynamic adjustment sub-module is used for dynamically adjusting model parameters and strategies according to the monitoring result;

an anomaly detection sub-module for identifying an anomaly pattern in the data using an integrated machine learning algorithm;

the feedback optimization module comprises:

the feedback collection sub-module is used for collecting feedback information of the decision result of the user or the environment;

The performance evaluation sub-module is used for evaluating the performance of the model by using the accuracy, recall and F1 fraction quantization indexes;

And the optimization adjustment sub-module is used for triggering the retraining or parameter adjustment of the model according to the feedback information and the performance evaluation result.

An electronic device includes a memory and at least one processor;

Wherein the memory has a computer program stored thereon;

the at least one processor executes the computer program stored by the memory, such that the at least one processor performs the large model Agent intelligent decision method of fusing multi-modal data as described above.

A computer readable storage medium having stored therein a computer program executable by a processor to implement a large model Agent intelligent decision method incorporating multimodal data as described above.

The large model Agent intelligent decision method and system for fusing multi-mode data has the following advantages:

The invention can effectively integrate and process the data from different modes, fully utilize the rich information of the multi-mode data and improve the accuracy of decision;

the invention can keep the best performance in the dynamic environment and adapt to the new data and task demands through real-time monitoring and dynamic adjustment;

the invention combines the deep learning model and the reinforcement learning algorithm, can quickly generate accurate decision results, and continuously improves the performance through a feedback optimization mechanism;

The centralized configuration management and unified log recording and monitoring system simplifies the maintenance work of the system, and the automatic test framework and the real-time performance monitoring reduce the workload of manual monitoring and fault elimination;

through multi-mode data processing, a deep learning model and a reinforcement learning algorithm, the performance and the adaptability of the intelligent decision in processing complex tasks and dynamic environments are improved, and the accuracy and the efficiency of the intelligent decision are improved;

Compared with the prior art, the method can effectively process multi-mode data, improve the accuracy and adaptability of decision making, and is suitable for complex and changeable environments and task demands.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a flow chart diagram of a large model Agent intelligent decision method fusing multi-modal data;

FIG. 2 is a schematic flow diagram of intelligent decision making;

fig. 3 is a schematic structural diagram of a large model Agent intelligent decision system fusing multi-mode data.

Detailed Description

The large model Agent intelligent decision method and system fusing multi-mode data of the invention are described in detail below with reference to the accompanying drawings and specific embodiments.

Example 1:

as shown in fig. 1, this embodiment provides a large model Agent intelligent decision method for fusing multi-mode data, which specifically includes the following steps:

S1, multi-mode data fusion, namely integrating text, image and audio data from different modes, and generating unified feature representation through feature extraction and feature fusion technology;

S2, intelligent decision making, namely decision making reasoning is carried out based on the fused characteristic representation, and a final decision result is generated by adopting a deep learning model (such as Transformer, BERT and the like) and a reinforcement learning algorithm (such as DQN, PPO and the like);

S3, self-adaptive learning, namely monitoring data change and decision effect in real time, and dynamically adjusting parameters and strategies of a deep learning model;

And S4, feedback optimization, namely further optimizing the performance of the deep learning model by collecting feedback information of the decision result.

The multi-mode data fusion in step S1 of this embodiment is specifically as follows:

S101, data acquisition and preprocessing, namely acquiring multi-mode data from various channels of social media, a sensor network and a public database in real time or in batches, collecting the data through an API (application program interface) or a crawler tool according to different data types, preprocessing the collected data of different types, and acquiring preprocessed data;

S102, extracting features, namely extracting the features by adopting corresponding feature extraction technology aiming at data of different modes to obtain corresponding features;

And S103, feature fusion, namely fusing the features of the unnecessary modes to generate unified feature representation, and adopting a plurality of fusion strategies such as early fusion, medium fusion, late fusion and the like to ensure the comprehensiveness and the effectiveness of the features.

In step S101 of this embodiment, the data preprocessing of the collected data of different types is specifically as follows:

S10101, aiming at text data, performing word segmentation, part-of-speech tagging and named entity recognition operations by applying Natural Language Processing (NLP) technology (such as NLTK and spaCy), and performing deep semantic understanding through a BERT model;

S10102, aiming at image data, performing size adjustment, clipping and rotation basic operation by using OpenCV, and performing object detection and classification based on a deep learning method (such as YOLO and SSD), so as to provide high-quality input for subsequent feature extraction;

S10103, for the audio data, preprocessing operations such as sampling rate conversion, denoising, volume standardization and the like are performed by means of Librosa libraries, and meanwhile, the Mel spectrogram conversion technology is adopted to convert the audio signals into a form suitable for machine learning model processing.

In step S102 of this embodiment, the data for different modes is extracted by using the corresponding feature extraction technology, and the specific feature is obtained as follows:

S10201, extracting text features, namely supplementing feature representation by combining a TF-IDF and Word2Vec traditional NLP method besides using a BERT model so as to capture more context information;

s10202, extracting image features, namely introducing Inception-V3 and VGG16 multiple convolutional neural network models except ResNet, selecting the most suitable architecture according to different scene requirements, and realizing more accurate feature capture;

S10203, extracting audio features, namely exploring and using Perceptual Linear Prediction (PLP) advanced audio feature extraction technology besides the MFCC algorithm, so as to improve the understanding capability of the voice signals.

As shown in fig. 2, the deep learning model in step S2 of this embodiment is specifically as follows:

s2-101, model selection and optimization, namely processing the fused multi-modal feature representation by adopting a transducer architecture and a variant (BERT, roBERTa, T) thereof, wherein the deep learning model is excellent in natural language processing tasks and is expanded to processing of image and audio data, specifically, processing image features by using ViT (Vision Transformer), understanding text content by using a pre-trained BERT model and processing audio signals by using WaveNet;

S2-102, feature interaction and enhancement, namely introducing a Cross-modal attention mechanism (Cross-modal Attention Mechanism) to better capture the relevance between different modalities, allowing a deep learning model to dynamically adjust importance weights of the different modalities according to context, and constructing a relevance map between features by using a graph neural network (Graph Neural Networks, GNNs) to further enhance feature expression capability.

The reinforcement learning algorithm in step S2 of this embodiment is specifically as follows:

S2-201, a parameter adjustment strategy, namely, in addition to gradient descent algorithms (such as Adam and RMSprop), a self-adaptive learning rate method (such as AdaGrad, adadelta) and a latest optimizer (such as LAMB (Layer-WISE ADAPTIVE Moments for Batch training) are considered, so that the efficiency and the effect of large-scale distributed training are improved;

S2-202, migration learning and fine tuning, namely, aiming at a specific task, utilizing a pre-training model as a starting point, rapidly adapting to new task demands through a fine tuning strategy, and using an antagonistic training (ADVERSARIAL TRAINING) technology to reduce the risk of overfitting and improve the generalization capability;

S2-203, dynamic algorithm selection, wherein based on task requirements and data characteristics, an intelligent agent can dynamically select the most suitable learning algorithm from a rich algorithm library, a Support Vector Machine (SVM) or a random forest (Random Forests) is preferentially selected when structured data is processed, and high-quality image or video data is generated by a countermeasure network (GANs) or a Diffusion model (Diffusion Models) when unstructured data is faced, so that data enhancement or anomaly detection is realized.

The adaptive learning in step S3 of this embodiment is specifically as follows:

S301, real-time monitoring and analysis, wherein data change and decision effect indexes are collected in real time through a system monitoring tool (Prometheus, grafana), future trends are predicted by adopting a time sequence analysis method (ARIMA model), a basis is provided for subsequent model adjustment, meanwhile, a real-time instrument board is created by means of Grafana or other visual tools, the change condition of each Key Performance Index (KPI) is intuitively displayed, a detailed performance analysis report is regularly generated, a developer is helped to know the running state of the system and make corresponding adjustment, and the decision effect indexes comprise hardware performance indexes of CPU (central processing unit) utilization rate, memory occupation, network delay, disk I/O (input/output) and model performance indexes of accuracy, recall rate and F1 score;

s302, dynamically adjusting, namely automatically adjusting key parameters in a deep learning model by applying a super-parameter optimization technology (such as Betty optimization and random search) according to the monitoring result, wherein the parameter optimization is based on the monitoring result;

S303, detecting the abnormality, namely identifying the abnormal mode in the data by using an integrated machine learning algorithm (Autoencoders), adjusting the decision strategy in time, starting a corresponding response mechanism immediately once the abnormality is detected, sending an alarm through Prometaus to inform related personnel and suspending the current operation.

The feedback optimization in step S4 of this embodiment is specifically as follows:

s401, feedback collection, namely providing a multi-mode feedback channel, collecting scores (1-5 stars), check boxes (correct/error marks) and text evaluation (NLP emotion analysis extraction satisfaction) through a User Interface (UI), processing a real-time feedback stream, receiving feedback events by using a Kafka message queue, performing de-duplication (based on UUID), normalization (mapping to a 0-1 interval) and multi-mode association through a Flink, simultaneously evaluating the feedback confidence, detecting false feedback (such as brushing behaviors) by using GAN, and filtering abnormal data;

S402, evaluating performance, namely evaluating model performance by using quantitative indexes of accuracy, recall, F1 fraction and mean square error, and generating a multidimensional evaluation report by using a Jenkins timing trigger LMMs-Eval framework;

and S403, optimizing and adjusting, namely, adjusting key parameters of the model in a targeted manner according to feedback information and performance evaluation results, searching a global optimal solution by adopting a Bayesian optimization, genetic algorithm and advanced optimization method of NAS algorithm, and continuously improving a decision effect by interaction with the environment by means of GRPO algorithm when the existing model is found to perform poorly in certain specific scenes.

Example 2:

As shown in fig. 3, the embodiment provides a large model Agent intelligent decision system integrating multi-mode data, which adopts a centralized configuration management system to uniformly manage configuration parameters of each module and realize a uniform log record and monitoring system, wherein the system comprises:

The multi-mode data fusion module in this embodiment includes:

and the feature fusion sub-module is used for fusing the features of different modes to generate a unified feature representation.

The intelligent decision engine in this embodiment includes:

And the decision generation sub-module is used for synthesizing the output of the deep learning model and the reinforcement learning algorithm and generating a final decision result.

The adaptive learning module in this embodiment includes:

an anomaly detection sub-module for identifying an anomaly pattern in the data using an integrated machine learning algorithm.

The feedback optimization module in this embodiment includes:

Example 3:

the embodiment also provides electronic equipment, which comprises a memory and a processor;

wherein the memory stores computer-executable instructions;

and the processor executes the computer execution instructions stored in the memory, so that the processor executes the large model Agent intelligent decision method fusing the multi-mode data in any embodiment of the invention.

The processor may be a Central Processing Unit (CPU), but may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store computer programs and/or modules, and the processor implements various functions of the electronic device by running or executing the computer programs and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area which may store an operating system, an application program required for at least one function, and the like, and a storage data area which may store data created according to the use of the terminal, and the like. The memory may also include high-speed random access memory, but may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, memory card only (SMC), secure Digital (SD) card, flash memory card, at least one disk storage period, flash memory device, or other volatile solid state memory device.

Example 4:

The embodiment also provides a computer readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by a processor, so that the processor executes the large model Agent intelligent decision method fusing multi-mode data in any embodiment of the invention. Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium.

In this case, the program code itself read from the storage medium may realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code form part of the present invention.

Examples of storage media for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RW, DVD-ROMs, DVD-RYM, DVD-RW, DVD+RW), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer by a communication network.

Further, it should be apparent that the functions of any of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform part or all of the actual operations based on the instructions of the program code.

Further, it is understood that the program code read out by the storage medium is written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion unit connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion unit is caused to perform part and all of actual operations based on instructions of the program code, thereby realizing the functions of any of the above embodiments.

It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims

1. A large model Agent intelligent decision method integrating multi-mode data is characterized by comprising the following steps:

Performing intelligent decision making, namely performing decision making reasoning based on the fused characteristic representation, and generating a final decision result by adopting a deep learning model and a reinforcement learning algorithm;

2. The large model Agent intelligent decision method for fusing multi-modal data according to claim 1, wherein the multi-modal data fusion is specifically as follows:

3. The large model Agent intelligent decision method fusing multimodal data as described in claim 2 wherein the preprocessing of the collected different types of data is specifically as follows:

performing word segmentation, part-of-speech tagging and named entity recognition operations by applying a natural language processing technology to text data, and performing deep semantic understanding through a BERT model;

performing size adjustment, clipping and rotation basic operation by using OpenCV, and performing object detection and classification based on a deep learning method for providing high-quality input for subsequent feature extraction;

Audio feature extraction, namely exploring and using PerceptualLinear Prediction advanced audio feature extraction technology besides the MFCC algorithm, so as to improve the understanding capability of voice signals.

4. The large model Agent intelligent decision method fusing multi-modal data according to claim 1, wherein the deep learning model is specifically as follows:

The deep learning model is excellent in natural language processing task and is expanded to processing of image and audio data, specifically, viT is used for processing image characteristics, text content is understood through a pre-trained BERT model, and an audio signal is processed by WaveNet;

Feature interaction and enhancement, namely introducing a cross-modal attention mechanism, allowing a deep learning model to dynamically adjust importance weights of different modalities according to context, and constructing a correlation map among features by using a graph neural network to further enhance feature expression capability;

the reinforcement learning algorithm is specifically as follows:

the parameter adjustment strategy considers the self-adaptive learning rate method and the latest optimizers such as LAMB besides the gradient descent algorithm;

Migration learning and fine tuning, namely aiming at a specific task, utilizing a pre-training model as a starting point, quickly adapting to new task demands through a fine tuning strategy, and using an antagonistic training technology;

Dynamic algorithm selection, namely, an intelligent agent can dynamically select the most suitable learning algorithm from a rich algorithm library based on task requirements and data characteristics, and a support vector machine or random forest is preferentially selected when structured data is processed, and high-quality image or video data is generated against a network or a diffusion model when unstructured data is faced, so that data enhancement or anomaly detection is realized.

5. The large model Agent intelligent decision method fusing multi-modal data according to claim 1, wherein the adaptive learning is specifically as follows:

The method comprises the steps of real-time monitoring and analysis, wherein data change and decision effect indexes are collected in real time through a system monitoring tool, a time sequence analysis method is adopted to predict future trend, a real-time instrument board is created by means of Grafana or other visual tools, the change condition of each key performance index is intuitively displayed, and a detailed performance analysis report is regularly generated, wherein the decision effect indexes comprise hardware performance indexes of CPU (central processing unit) utilization rate, memory occupation, network delay and disk I/O, and model performance indexes of accuracy, recall rate and F1 score;

Dynamic adjustment, namely, parameter optimization, namely, automatically adjusting key parameters in a deep learning model by applying a super-parameter optimization technology based on a monitoring result, and further, optimizing a model training process by using PPO and GRPO;

And (3) detecting the abnormality, namely identifying an abnormal mode in the data by using an integrated machine learning algorithm, adjusting a decision strategy in time, starting a corresponding response mechanism immediately once the abnormality is detected, sending an alarm through Prometheus to inform related personnel and suspending the current operation.

6. The large model Agent intelligent decision method for fusing multi-modal data according to claim 1, wherein the feedback optimization is specifically as follows:

Feedback collection, namely providing a multi-mode feedback channel, collecting scores, check boxes and text evaluation through a user interface, carrying out real-time feedback stream processing, receiving feedback events by using a Kafka message queue, carrying out deduplication, normalization and multi-mode association through a Flink, simultaneously carrying out feedback confidence evaluation, detecting false feedback by using GAN, and filtering abnormal data;

And optimizing and adjusting, namely, according to feedback information and performance evaluation results, pertinently adjusting key parameters of the model, and searching a global optimal solution by adopting a Bayesian optimization method, a genetic algorithm method and a NAS algorithm advanced optimization method.

7. A large model Agent intelligent decision system integrating multi-mode data is characterized in that the system adopts a centralized configuration management system to uniformly manage configuration parameters of each module and realize a uniform log record and monitoring system, and the system comprises:

the intelligent decision engine is used for carrying out decision reasoning based on the fused characteristic representation and generating an optimal decision by adopting a deep learning model and a reinforcement learning algorithm;

8. The large model Agent intelligent decision system fusing multimodal data as described in claim 7 wherein said multimodal data fusion module comprises:

the intelligent decision engine comprises:

The adaptive learning module includes:

the feedback optimization module comprises:

9. An electronic device comprising a memory and at least one processor;

Wherein the memory has a computer program stored thereon;

the at least one processor executing the computer program stored by the memory causes the at least one processor to perform the large model Agent intelligent decision method of fusing multimodal data as claimed in any one of claims 1 to 6.

10. A computer readable storage medium having stored therein a computer program executable by a processor to implement the large model Agent intelligent decision method of fusing multimodal data as claimed in any one of claims 1 to 6.