CN120431892A

CN120431892A - Music data processing method, device, electronic device and storage medium

Info

Publication number: CN120431892A
Application number: CN202510943180.5A
Authority: CN
Inventors: 计紫豪; 张晨; 张迪; 盖坤
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2025-07-09
Filing date: 2025-07-09
Publication date: 2025-08-05
Anticipated expiration: 2045-07-09
Also published as: CN120431892B

Abstract

The present disclosure relates to a music processing method, device, electronic device, and storage medium. The method includes: obtaining data set reference features corresponding to each of a plurality of reference data sets; determining a target reference data set corresponding to the music style of each source music data item in the source data set from the plurality of reference data sets based on the music style of each source music data item; determining target source music data from the source data set based on the data set reference features corresponding to the target reference data set and feature distance information of the music features of each source music data item; the sound quality of the target source music data is the same as or similar to the sound quality of any reference music data in the target reference data set, and the target source music data is used to construct a music data sample. The present disclosure can improve the sound quality of the music data sample, thereby improving the data quality of the music data sample, and thereby improving the generalization ability and generation quality of the music model.

Description

Music data processing method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of deep learning, and in particular relates to a music processing method, a device, electronic equipment and a storage medium.

Background

In recent years, rapid advances in artificial intelligence technology have made the field of music generation interesting. Music generation models, such as a large language model and a diffusion model, based on artificial intelligence large models show great potential in music element creation of melodies, harmony and the like. The success of these models is not only dependent on their architecture, but also requires high quality music data as a training basis. Before music generation, data preprocessing is a key step for ensuring effective learning of a model, untreated data can have problems of noise, missing values and the like, which can influence the training effect of the model, and the effective preprocessing can improve the data quality and further improve the performance of the music generation model. Although the music data processing method in the related art solves the data quality problem to a certain extent, the method still has the defects, so that a new music data processing method is explored to improve the quality of sample music data, further improve the generalization capability and the generation quality of a model, and is an important direction of current research.

Disclosure of Invention

The disclosure provides a music processing method, a device, an electronic device and a storage medium, so as to at least solve the problem of low quality of sample music data in the related art. The technical scheme of the present disclosure is as follows:

According to a first aspect of an embodiment of the present disclosure, there is provided a music data processing method including:

Acquiring data set reference characteristics corresponding to each of a plurality of reference data sets; each reference data set comprises a plurality of pieces of reference music data with the same music style, different reference data sets correspond to different music styles, and the data set reference characteristics corresponding to each reference data set are obtained by fusion processing based on the music characteristics of the plurality of pieces of reference music data in each reference data set;

Determining a target reference data set corresponding to a music style of each item of source music data from the plurality of reference data sets based on the music style of each item of source music data in the source data sets;

And determining target source music data from the source data set based on the data set reference characteristics corresponding to the target reference data set and the characteristic distance information of the music characteristics of each item of source music data, wherein the tone quality of the target source music data is the same as or similar to that of any reference music data in the target reference data set, and the target source music data is used for constructing a music data sample.

In an exemplary embodiment, each item of reference music data includes a reference audio and a reference text corresponding to the reference audio;

the method further comprises the steps of:

extracting audio characteristics of the reference audio in each item of reference music data to obtain reference audio characteristics;

extracting text characteristics from the reference text in each item of reference music data to obtain reference text characteristics;

And mapping the reference audio features and the reference text features to a shared semantic space respectively to obtain the music features of each item of reference music data.

In an exemplary embodiment, the dataset reference features for each reference dataset include an average music feature and an inverse matrix for the covariance feature matrix;

the acquiring the data set reference characteristics corresponding to the plurality of reference data sets respectively comprises the following steps:

For a plurality of pieces of reference music data in each reference data set, carrying out average processing on the music characteristics of the plurality of pieces of reference music data to obtain average music characteristics of the plurality of pieces of reference music data;

obtaining the covariance feature matrix based on the music features of the multiple pieces of reference music data and the average music features of the multiple pieces of reference music data;

And carrying out inverse transformation on the covariance feature matrix to obtain an inverse matrix corresponding to the covariance feature matrix, wherein the inverse matrix corresponding to the covariance feature matrix can eliminate the correlation interference of the music features of the multiple pieces of reference music data in different feature dimensions and eliminate the dimension difference of the music features of the multiple pieces of reference music data in different feature dimensions.

In an exemplary embodiment, before the determining the target source music data from the source data set based on the data set reference feature corresponding to the target reference data set and the feature distance information of the music feature of each item of source music data, the method further includes:

Performing feature difference processing based on the music features of each item of source music data and the average music features corresponding to the target reference data set to obtain feature difference information;

And carrying out product processing based on the transpose of the characteristic difference information, the inverse matrix corresponding to the covariance characteristic matrix and the characteristic difference information to obtain the characteristic distance information.

In an exemplary embodiment, the method further comprises:

The music aesthetic index comprises the whole music property of each item of source music data, wherein the whole music property characterizes the nature degree and harmony degree of fusion among a plurality of music components in each item of source music data estimated from the hearing sense;

Extracting audio aesthetic indexes from each item of source music data in the source data set to obtain the audio aesthetic indexes of each item of source music data, wherein the audio aesthetic indexes comprise content pleasure degrees of each item of source music data;

Obtaining a target aesthetic index of each item of source music data based on an index data fusion result of the whole music property and the content pleasure degree;

The determining target source music data from the source data set based on the data set reference feature corresponding to the target reference data set and the feature distance information of the music feature of each item of source music data includes:

And determining the target source music data from the source data set based on the characteristic distance information and the target aesthetic index of each item of source music data, wherein the target aesthetic index of the target source music data is larger than or equal to a preset aesthetic index.

In an exemplary embodiment, the music aesthetic index further comprises music consistency, memorability, voice naturalness and structural definition, and the audio aesthetic index further comprises manufacturing quality, manufacturing complexity and content practicability;

The obtaining the target aesthetic index of each item of source music data based on the index data fusion result of the whole music property and the content pleasure degree comprises the following steps:

performing index data fusion on the music property of the whole music and the pleasure degree of the content to obtain a fusion aesthetic index;

And carrying out index data fusion on the fusion aesthetic index, the whole music consistency, the memorability, the voice naturalness, the structural definition, the manufacturing quality, the manufacturing complexity and the content utility degree to obtain the target aesthetic index of each item of source music data.

and carrying out channel copying processing on any piece of source music data in the source data set under the condition that any piece of source music data corresponds to a single channel, so as to obtain dual-channel source music data.

According to a second aspect of the embodiments of the present disclosure, a training method for a music generation model is provided, which is implemented based on music sample data, where the music sample data is obtained based on the above-mentioned music data processing method, and the music data sample includes a sample music tag, sample music lyrics and sample music audio, and the training method includes:

Inputting the sample music labels and the sample music lyrics into a preset music generation model to obtain predicted music audio;

Determining loss information based on the sample music audio and the predicted music audio;

and updating the preset music generation model based on the loss information to obtain a target music generation model.

In an exemplary embodiment, the method further comprises:

The method comprises the steps of obtaining music data to be generated, wherein the music data to be generated comprises a music tag of the music to be generated and lyrics of the music to be generated, or the music data to be generated comprises the music tag of the music to be generated;

Inputting the music label of the music to be generated and the lyrics of the music to be generated into the target music generation model, or inputting the music label of the music to be generated into the target music generation model for music generation processing to obtain target audio of the music to be generated, wherein the target audio comprises the lyrics of the music to be generated when the music data to be generated comprises the lyrics of the music to be generated.

According to a third aspect of the embodiments of the present disclosure, there is provided a music data processing apparatus including:

A reference feature acquisition unit configured to perform acquisition of a data set reference feature corresponding to each of a plurality of reference data sets, each reference data set including a plurality of pieces of reference music data of the same music style, different reference data sets corresponding to different music styles, the data set reference feature corresponding to each reference data set being obtained by fusion processing based on the music feature of the plurality of pieces of reference music data in each reference data set, the plurality of pieces of reference music data in each reference data set satisfying a target sound quality condition;

a reference data set determination unit configured to perform determination of a target reference data set corresponding to a music style of each item of source music data from among the plurality of reference data sets based on the music style of each item of source music data in the source data sets;

And a sample construction unit configured to perform determination of target source music data from the source data set based on the data set reference feature corresponding to the target reference data set and the feature distance information of the music feature of each item of source music data, the target source music data being used for constructing a music data sample, the sound quality of the target source music data being the same as or similar to the sound quality of any one of the reference music data in the target reference data set.

the apparatus further includes a music feature extraction unit configured to perform:

the reference feature acquisition unit is configured to perform:

In an exemplary embodiment, the apparatus further comprises a distance determination unit configured to perform:

In an exemplary embodiment, the apparatus further comprises a target aesthetic index determination unit configured to perform:

The sample construction unit is configured to perform:

The target aesthetic index determination unit is configured to perform:

In an exemplary embodiment, the apparatus further comprises a channel processing unit configured to perform:

According to a fourth aspect of the embodiments of the present disclosure, there is provided a training apparatus for a music generation model, which is characterized in that the training apparatus is implemented based on music sample data, the music sample data is obtained based on the above-mentioned music data processing method, the music data sample includes a sample music tag, a sample music lyric, and a sample music audio, and the training apparatus includes:

an audio prediction unit configured to perform inputting the sample music tag and the sample music lyrics into a preset music generation model to obtain predicted music audio;

a loss information determination unit configured to perform determination of loss information based on the sample music audio and the predicted music audio;

And the model updating unit is configured to update the preset music generation model based on the loss information to obtain a target music generation model.

In an exemplary embodiment, the apparatus further comprises a music generation unit configured to perform:

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device comprising a processor, a memory for storing instructions executable by the processor, wherein the processor is configured to execute the instructions to implement a music data processing method or a music generation model training method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the music data processing method or the music generation model training method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the device to perform the above-described music data processing method or music generation model training method.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

The method comprises the steps of obtaining data set reference characteristics corresponding to reference data sets of different styles, determining a target reference data set matched with each source music data in a source data set in style, further determining target source music data from the source data set based on the data set reference characteristics of the target reference data set and characteristic distance information of music characteristics of each source music data, and determining the target source music data from the source data set based on the characteristic distance information. Because the multiple pieces of reference music data in each reference data set meet the target tone quality condition, when the music characteristic of the target source music data and the characteristic distance information of the reference characteristic of the data set of the target reference data set are smaller than or equal to the preset distance threshold value, the tone quality of the target source music data can be correspondingly described to be the same as or similar to the tone quality of any reference music data in the target reference data set, namely, the target source music data with the same or similar tone quality as any reference music data in the multiple reference data sets can be screened out from the source data set through the characteristic distance information, so that the target source music data with the tone quality meeting the target tone quality condition or with the tone quality close to the target tone quality condition can be screened out, a music data sample is constructed through the determined target source music data, the tone quality of the music data sample can be improved, the data quality of the music data sample can be further improved, and the generalization capability and the generation quality of the music model can be further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of an implementation environment, shown according to an example embodiment.

Fig. 2 is a flowchart illustrating a music processing method according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating a method for determining feature distance information of a dataset reference feature and a music feature according to an exemplary embodiment.

Fig. 4 is a flowchart illustrating a method of music data screening based on aesthetic indicators of source music data, according to an example embodiment.

FIG. 5 is a diagram illustrating training of a music generation model according to an exemplary embodiment.

Fig. 6 is a schematic diagram of a music data processing system according to an exemplary embodiment.

Fig. 7 is a block diagram of a music data processing apparatus according to an exemplary embodiment.

FIG. 8 is a block diagram illustrating a music generation model training apparatus, according to an example embodiment.

Fig. 9 is a schematic diagram of an electronic device according to an exemplary embodiment.

Fig. 10 is a schematic diagram of another electronic device configuration shown according to an exemplary embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Referring to fig. 1, a schematic diagram of an implementation environment provided by an embodiment of the disclosure may include at least one user terminal 110 and a music data server 120, where the user terminal 110 and the music data server 120 may communicate data through a network.

Specifically, the user terminal 110 may submit a music data processing request to the music data server 120, where the music data processing request may specifically include a source data set, where the music data processing request is used to request the music server 120 to screen out target source music data with a sound quality that meets a target sound quality condition or is close to the target sound quality condition from the source data set, and construct a music data sample based on the target source music data, where the music request server 120 may determine, for each source music data in the source data set, a target reference data set that matches a style of the reference data set by acquiring data set reference features corresponding to the reference data sets in different styles, and further determine, based on the data set reference features of the target reference data set and feature distance information of the music features of each source music data, the target source music data from the source data set based on the feature distance information. Since the plurality of pieces of reference music data in each reference data set satisfy the target sound quality condition, when the feature distance information of the music feature of the target source music data and the reference feature of the data set of the target reference data set is equal to or less than the preset distance threshold value, it can be correspondingly stated that the sound quality of the target source music data is the same as or similar to the sound quality of any one of the reference music data in the target reference data set.

The user terminal 110 may communicate with the music data Server 120 based on Browser/Server (B/S) or Client/Server (C/S) mode. The user terminal 110 may include a smart phone, a tablet computer, a notebook computer, a digital assistant, a smart wearable device, an in-vehicle terminal, or other types of physical devices, or may include software running in the physical devices, such as an application program, etc. Operating systems running on the client 110 in embodiments of the present disclosure may include, but are not limited to, android systems, IOS systems, linux, windows, and the like.

The music data server 120 and the client 110 may establish a communication connection through a wire or wirelessly, and the music data server 120 may include a server that operates independently, or a distributed server, or a server cluster formed by a plurality of servers, where the servers may be cloud servers.

In order to solve the problem of low quality of sample music data in the related art, an embodiment of the present disclosure provides a music processing method, where an execution subject of the method may be the user side or the music data server, and specifically referring to fig. 2, the method may include:

S210, acquiring data set reference characteristics corresponding to a plurality of reference data sets respectively, wherein each reference data set comprises a plurality of reference music data with the same music style, different reference data sets correspond to different music styles, the data set reference characteristics corresponding to each reference data set are obtained by fusion processing based on the music characteristics of the plurality of reference music data in each reference data set, and the plurality of reference music data in each reference data set meet target tone quality conditions.

The music styles in this embodiment may include styles such as ballad, pop, rock, rap, classical, jazz, etc., and the respective different music styles may have different metrics for sound quality, or the different music styles may have different sound quality characteristics, so that respective reference data sets may be created for the different music styles, respectively, each reference data set including a plurality of reference music data of the same music style, the different reference data sets corresponding to different music styles, and the respective reference data sets may include reference data sets corresponding to the first music style, reference data sets corresponding to the second music style, reference data sets corresponding to the third music style, etc.

By performing fusion processing on the music characteristics of a plurality of pieces of reference music data having the same music style in each reference data set, the data set reference characteristics corresponding to the corresponding reference data set can be obtained, so that the data set reference characteristics obtained by each reference data set can characterize the overall music characteristics of a plurality of pieces of reference music data in the reference data set, and each piece of reference music data is music data meeting the target tone quality condition. The sound quality refers to the perceived characteristics of sound, which relates to the purity, definition, balance, fidelity, etc. of sound, and the sound quality may include high definition, on-site, low definition, etc. in this embodiment, the music data satisfying the target sound quality condition may be all music data with high definition sound quality, or may be music data with near high definition sound quality, etc.

S220, determining a target reference data set corresponding to the music style of each item of source music data from the plurality of reference data sets based on the music style of each item of source music data in the source data sets.

For each item of source music data, a music style tag can be corresponding to each item of source music data, so that a target reference data set corresponding to the music style of each item of source music data can be determined directly based on the music style tag, for example, the music style tag is a first music style, and the reference data set corresponding to the first music style can be correspondingly determined as the target reference data set. In addition, in the embodiment, the batch input of the source music data with the same music style in the source data set can be further realized, so that batch processing of the source music data with each music style is realized, specifically, the characteristic distance information of the music characteristic of the source music data with the first music style in the source data set and the characteristic distance information of the data set reference characteristic of the reference data set with the first music style can be calculated, the target source music data can be determined from the source music data with the first music style in the source data set, the characteristic distance information of the music characteristic of the source music data with the second music style in the source data set and the characteristic distance information of the data set reference characteristic of the reference data set with the second music style in the source data set can be calculated, and the processing of the source music data with different music styles can be realized in parallel, so that the processing efficiency of the music data can be improved.

And S230, determining target source music data from the source data set based on the data set reference characteristics corresponding to the target reference data set and the characteristic distance information of the music characteristics of each item of source music data, wherein the tone quality of the target source music data is the same as or similar to that of any one of the reference music data in the target reference data set, and the target source music data is used for constructing a music data sample.

By calculating the feature distance information of the data set reference feature corresponding to the target reference data set and the music feature of each source music data, the source music data can be determined as the target source music data under the condition that the feature distance information is smaller than or equal to the preset distance threshold value, and it can be understood that the target source music data which is smaller than or equal to the feature distance information of the data set reference feature is similar or identical to the plurality of pieces of reference music data in the corresponding target reference data set, so that the sound quality of the target source music data and the sound quality of the plurality of pieces of reference music data are similar or identical, namely, the sound quality of the target source music data meets the target sound quality condition or is close to the target sound quality condition.

In an optional embodiment, each item of reference music data includes a reference audio and a reference text corresponding to the reference audio;

the method further comprises the steps of:

The reference text corresponding to the reference audio can comprise text description information of the reference audio, the text description information can specifically comprise audio style, audio lyrics and the like, and the reference audio and the corresponding reference text can be mapped to a shared semantic control to enable the audio-text pair vectors with similar semantics to be close to each other. The method comprises the steps of carrying out music feature extraction on reference music data, carrying out audio processing and text processing, segmenting the reference audio to obtain at least one section of audio, converting the at least one section of audio into a Mel spectrogram every n seconds, carrying out audio encoding on the Mel spectrogram by an audio encoder to obtain reference audio features, carrying out text encoding on the reference text by a text encoder to obtain reference text features, carrying out feature mapping on the reference audio features and the reference text features to obtain reference text features, and carrying out feature mapping on the reference audio features and the reference text features to achieve that the reference audio features and the reference text features are mapped to a shared semantic space respectively, wherein feature fusion on the reference audio features and the reference text features can be realized by carrying out weighted summation on the reference audio features and the reference text features, specifically, carrying out feature mapping on the reference pair reference audio features and the reference text features by a joint mapping layer, and carrying out fine tuning on the joint mapping layer by using contrast loss.

In this embodiment, the reference audio and the reference text corresponding to the reference audio are respectively extracted to realize multi-modal feature expression of the reference music data, so that the feature expression capability of the reference music data is improved, and further, the accuracy of the processing result of the music data can be improved under the condition that the subsequent processing is performed by using the music features of the reference music data.

In another alternative embodiment, the data set reference features corresponding to each reference data set include an average music feature and an inverse matrix corresponding to a covariance feature matrix, and accordingly, referring to fig. 3, the present embodiment provides a method for determining feature distance information between the data set reference features and the music feature, which may include:

S310, for a plurality of pieces of reference music data in each reference data set, carrying out average processing on the music characteristics of the plurality of pieces of reference music data to obtain average music characteristics of the plurality of pieces of reference music data.

For the reference data set a, n pieces of reference music data may be included therein, and the music feature of each piece of reference music data may be specifically a music feature vector, so that the average music feature may be calculated by the following formula:

(1)

Wherein x _i denotes any one of the n pieces of reference music data.

S320, obtaining the covariance characteristic matrix based on the music characteristics of the multiple pieces of reference music data and the average music characteristics of the multiple pieces of reference music data.

Specifically, the covariance feature matrix can be calculated by the following formula:

(2)

wherein, the Is a matrix of pairsIs a transpose process of (a).

S330, performing inverse transformation on the covariance feature matrix to obtain an inverse matrix corresponding to the covariance feature matrix, wherein the inverse matrix corresponding to the covariance feature matrix can eliminate correlation interference of the music features of the multiple pieces of reference music data in different feature dimensions and eliminate dimension differences of the music features of the multiple pieces of reference music data in different feature dimensions.

The inverse of the covariance feature matrix can be obtained by the following formula:

(3)

In this embodiment, the inverse matrix of the covariance feature matrix is calculated to adjust multiple pieces of reference music data in each reference data set, so that correlation and scale difference between different features are considered, specifically, if the data variables in the reference data set are highly correlated, the covariance will be very high, divided by the larger covariance, and the distance will be effectively shortened, and if the data variables are uncorrelated, the covariance will not be very high, and the distance will not be reduced too much. The inverse matrix of the covariance feature matrix can adjust the weight of each dimension, namely if the variability of a certain dimension is larger, the corresponding variance is larger, the distance in the dimension is given more weight, conversely, if the variability of a certain dimension is small, the weight is small, and even if different dimensions and scales are provided for different feature dimensions, the inverse matrix of the covariance feature matrix can enable the different feature dimensions to be compared fairly, so that the correlation among variables can be eliminated by calculating the inverse matrix of the covariance feature matrix, and the robustness of subsequent music data processing is improved.

And S340, carrying out characteristic difference processing on the basis of the music characteristics of each piece of source music data and the average music characteristics corresponding to the target reference data set to obtain characteristic difference information.

The method for extracting the music feature of each item of source music data may participate in the method for extracting the music feature of the reference music data according to the present embodiment, which is not described herein.

S350, carrying out product processing based on the transpose of the characteristic difference information, the inverse matrix corresponding to the covariance characteristic matrix and the characteristic difference information to obtain the characteristic distance information.

The feature distance information for any source music data and the dataset reference feature can be calculated based on the following formula:

(4)

wherein, the Is feature difference information between the music feature y of the source music data and the average music feature,To pair(s)Is a transpose process of (a).

In this embodiment, in the process of calculating the feature distance information between the data set reference features corresponding to each source music data and the target reference data set, the inverse matrix corresponding to the covariance feature matrix described in this embodiment may be introduced, and because the inverse matrix corresponding to the covariance feature matrix can eliminate the correlation between variables, the feature distance calculation is performed based on the correlation between variables that can be eliminated, and the accuracy of the feature distance calculation result can be improved.

Specifically, the aesthetic index extraction can be performed on each piece of source music data in the source data set, so that the source music data can be screened according to the extracted aesthetic index, and the aesthetic index of the music data can represent the audibility of the music data estimated from the human perception point of view. In addition, it should be noted that in this embodiment, the sequence of performing the aesthetic index extraction operation on each source music data and performing the sound quality analysis operation on each source music data is not limited, and the aesthetic index extraction operation may be performed first to screen out source music data with an aesthetic index greater than or equal to a preset aesthetic index, and then performing the sound quality analysis operation on the source music data with an aesthetic index greater than or equal to a preset aesthetic index to obtain source music data with a sound quality meeting the target sound quality condition or a sound quality approaching the target sound quality condition, or the sound quality analysis operation may be performed first to obtain source music data with a sound quality meeting the target sound quality condition or a sound quality approaching the target sound quality condition, and then performing the aesthetic index extraction operation on the source music data with a sound quality meeting the target sound quality condition or a sound quality approaching the target sound quality condition to screen out source music data with an aesthetic index greater than or equal to a preset aesthetic index. Namely, for the two operations of the aesthetic index extraction operation and the tone quality analysis operation, the operation result of one operation can be used as the operation basis of the other operation, so that the screening of the source music data meeting the preset tone quality condition and the preset aesthetic index can be realized. Accordingly, referring to fig. 4, a method of music data screening based on aesthetic indicators of source music data is shown, which may include:

And S410, extracting music aesthetic indexes of each item of source music data in the source data set to obtain the music aesthetic indexes of each item of source music data, wherein the music aesthetic indexes comprise the whole music property of each item of source music data, and the whole music property represents the nature degree and harmony degree of fusion among a plurality of music components in each item of source music data estimated from the hearing sense angle.

The music aesthetic index is an aesthetic index for music, and the plurality of music components in each source music data can comprise at least two of melodies, harmony, soundtracks, human voices, accompaniment and the like, so that the whole music property of the source music data can represent the nature degree and harmony degree of fusion between at least two of melodies, harmony, soundtracks, human voices, accompaniment and the like estimated from the hearing angle, namely, the comprehensive pleasure of the music can be estimated.

S420, extracting audio aesthetic indexes of each item of source music data in the source data set to obtain the audio aesthetic indexes of each item of source music data, wherein the audio aesthetic indexes comprise content pleasure degree of each item of source music data.

The audio aesthetic index is an aesthetic index for audio, where audio may include music, recitations, etc., such that the audio aesthetic index is an index for aesthetic evaluation of audio-type data for music, recitations, etc. The content pleasure of each item of source music data can be determined based on emotional impact force, artistic skill presentation, artistic expression depth, and comprehensive subjective experience.

And S430, obtaining the target aesthetic index of each item of source music data based on the index data fusion result of the whole music property and the content pleasure degree.

The music aesthetic index evaluation and the audio aesthetic index evaluation are respectively performed on the music data through two different aesthetic evaluation systems, the music data are respectively evaluated through the two different aesthetic evaluation systems, and the music data are respectively obtained through the two different aesthetic evaluation systems, so that the music data are related to the music content pleasure degree due to the fact that the music data are evaluated by the two aesthetic indexes of the music score and the content pleasure degree, the aesthetic indexes of the music score and the content pleasure degree can be fused through the two aesthetic evaluation systems, the fused aesthetic indexes obtained through the two evaluation systems can be obtained, and then the target aesthetic indexes of each source of music data can be obtained.

S440, determining the target source music data from the source data set based on the characteristic distance information and the target aesthetic index of each item of source music data, wherein the target aesthetic index of the target source music data is larger than or equal to a preset aesthetic index.

Furthermore, the aesthetic index extraction operation is carried out on each item of source music data, and the sequence of the tone quality analysis operation is not limited, namely, the aesthetic index extraction operation and the tone quality analysis operation are carried out on each item of source music data, the operation result of one operation can be used as the operation basis of the other operation, and the aesthetic index extraction operation and the tone quality analysis operation are not required to be respectively carried out on each item of source music data in the source data set, so that the processing capacity of the music data can be reduced, and the processing efficiency of the music data is improved.

In a specific embodiment, the music aesthetic index in this embodiment further includes a music score consistency, a memorability, a voice naturalness, and a structural definition, the audio aesthetic index further includes a production quality, a production complexity, and a content utility, and the corresponding step S430 is based on the result of integrating the index data of the music score consistency and the content pleasure, the obtaining the target aesthetic index of each source music data includes:

and S432, carrying out index data fusion on the music property of the whole music and the pleasure degree of the content to obtain a fusion aesthetic index.

And S434, carrying out index data fusion on the fusion aesthetic index, the whole music consistency, the memorability, the voice naturalness, the structural definition, the manufacturing quality, the manufacturing complexity and the content utility degree to obtain the target aesthetic index of each source music data.

Wherein the whole music consistency characterizes the fluency and dynamic and emotion unification of the connection among the music sections (front, main, auxiliary, bridge, end, etc.); the memorability represents the identification degree and the infection force of melody/rhythm in music data, and whether the music data has a factor which can be remembered by listening once can be melody/rhythm, etc., the naturalness of human voice represents the rationality of breathing and sentence breaking in singing, including whether the sentence breaking accords with semantic logic and rhythm, whether the breathing affects the singing fluency, and the structural definition represents the definition of the division of song paragraphs (main song, auxiliary song, bridge segment, etc.) and the logic of structural design (such as the rationality of traditional structure or innovative structure).

The method mainly comprises the steps of manufacturing quality, manufacturing complexity and content utilization degree, wherein the manufacturing quality mainly depends on definition and fidelity of audio, dynamic range control, frequency response balance and spatial presentation, the manufacturing complexity mainly depends on the number of audio components to quantify the complexity of an audio scene, and the content utilization degree comprises content recycling potential and secondary creation suitability.

In the process of fusing the index data of the fusion aesthetic index, the curve consistency, the memorability, the voice naturalness, the structural definition, the manufacturing quality, the manufacturing complexity and the content utility, the index value of each aesthetic index can be weighted and summed to obtain the target aesthetic index of each source music data.

In this embodiment, the aesthetic indexes of the source music data are evaluated by the multiple aesthetic evaluation systems respectively, so that multiple aesthetic indexes of the source music data under different aesthetic evaluation systems can be obtained, and the target aesthetic indexes corresponding to the source music data can be obtained by fusing the multiple aesthetic indexes, and the target aesthetic indexes can represent the overall aesthetic degree of the source music data, so that the comprehensiveness and accuracy of the aesthetic index evaluation of the source music data are improved.

In an optional embodiment, before the determining, according to the feature distance information of the data set reference feature corresponding to the target reference data set and the music feature of each item of source music data, the method further includes:

For the collected source data set, which may include monaural source music data, the channel reproduction processing of the monaural source music data may be performed by a channel reproduction tool or a channel reproduction program, resulting in binaural source music data.

By processing the monaural source music data into the binaural source music data, format consistency of the source music data in the source data set can be ensured, and the problem of incompatibility of monaural devices can be avoided, thereby adapting to subsequent music data processing devices.

The target source music data screened out by the music data processing method according to the embodiment can be used for constructing a music data sample, the corresponding music data sample can be used for training and generating a music generation model, and specifically, the embodiment also provides a music generation model training method, which is realized based on the music data sample, wherein the music data sample comprises a sample music label, sample music lyrics and sample music audio, and the corresponding training method comprises the following steps:

The sample music tags in this embodiment may include a style, a sub-style, a speed, a gender, a duration, a sampling rate, a year, a channel (whether or not surround stereo exists), an audio type (rap, extension, monologue), a scene, etc. of the sample music data, and the sample music tag of each sample music data may include one or more of the above tags. In the case of a plurality of sample music labels, the labels are separated in the form of separation numbers and are used for marking the characteristics of each sample data, and the labels are input into the model in a hidden vector mode at a later stage. The sample music lyrics may be lyrics of the sample music data, and may or may not be present. The sample music audio may be music audio corresponding to the sample music data, that is, in the music sample data, each piece of sample data exists in the form of a matching set of sample music tag-sample music lyrics-sample music audio.

Referring to fig. 5, a training schematic diagram of a music generation model is shown, specifically, in the training process, the input of the preset music generation model may include a plurality of pieces of sample data, each piece of sample data includes a matching set of sample music labels-sample music lyrics-sample music audios, when each piece of sample data is input, hidden vectors of the plurality of sample music labels may be input into the preset music generation model in the case that the piece of sample data includes the plurality of sample music labels, for sample music lyrics, in the case that the sample music lyrics is present, sample music lyrics may be input, default information or null information may be input in the case that the sample music lyrics is not present, and the corresponding preset music generation model may output predicted music audios.

In this embodiment, training a preset music generation model through constructed music sample data to obtain a target music generation model, where the target music generation model can generate corresponding music based on a given music tag and lyrics; because the music sample data is the high-data-quality music data screened based on the music data processing method, the preset music generation model is trained based on the music sample data, and the generalization capability and the generation quality of the music generation model can be improved.

Further, in the case of training a target music generation model, desired music may be generated based on the target music generation model, and the specific music generation method may include:

In the case that the user needs to generate music, a music generation request can be initiated, the corresponding music generation request can comprise music data to be generated, and in one example, the music data to be generated can comprise a music label of the music to be generated and lyrics of the music to be generated, so that the music with the lyrics needs to be generated. In another example, the to-be-generated music data may include a music tag of the to-be-generated music, excluding lyrics of the to-be-generated music, and indicate that the music without lyrics needs to be generated, such as pure music, in yet another example, the to-be-generated music data may include a music tag of the to-be-generated music, excluding the to-be-generated music, the corresponding target music generation model may automatically generate the music with lyrics based on the music tag of the to-be-generated music, and may be determined according to a specific implementation, such as, in a case that the to-be-generated music data does not include the lyrics of the to-be-generated music and the pure music needs to be generated, the corresponding input item may be first lyrics setting information, and in a case that the to-be-generated music data does not include the lyrics of the to-be-generated music and the lyrics needs to be generated, the corresponding input item may be second lyrics setting information, the first setting information is different from the second lyrics setting information, and the first lyrics setting information characterizes the music without lyrics is generated, and the second lyrics setting information characterizes the generation of the lyrics.

In this embodiment, the target music generation model may support providing lyrics of music to be generated when generating music, or may support not providing lyrics of music to be generated when generating music, so as to generate music as required, thereby improving flexibility and convenience of music generation.

Referring now to FIG. 6, a schematic diagram of a music data processing system is shown, which may include modules for data collection, data pre-cleaning, audio scoring, audio screening, audio description, data evaluation, etc., in which:

1. Data collection

Music datasets are constructed and to ensure diversity of data, the data may include a variety of sources, styles, languages, ages, etc.

2. Data pre-cleaning

I. duration cleaning, namely removing the too short (for example, less than 10 seconds) audio, labeling other audio with a duration label, and taking 1 minute as an interval.

And ii, channel cleaning, namely, performing channel copying processing on the mono audio to double channels, wherein the default input is double channels. For stereo surround songs, if the tag is found in the data processing, it is used as an additional channel tag.

Content cleaning, namely identifying the content of the audio by using a music source separation model, a human voice detection model and a human voice category detection model. Specifically, all the audios are processed by a music source separation model to obtain a voice track and an accompaniment track. And then inputting the voice track into a voice detection model to detect whether voice exists or not, and inputting the voice category detection model to detect the voice category, so as to distinguish which category of voice track belongs to voice, sound effect, voice music, background music and the like, and judging whether the voice of the audio belongs to rap, singing or monologue in the voice music category.

And iv, aligning lyrics, namely performing text recognition on two types of audio, namely voice music and voice by using an automatic voice recognition model, and selecting and constructing a matching pair of the lyrics and the audio.

And v, voice quality labels, namely constructing voice quality labels of the data set. The main flow is as follows:

(1) Constructing high-definition data sets of all styles of a recording studio, wherein each style of data set comprises M songs;

(2) Calculating an audio-text embedding vector of each song;

(3) For each dataset, calculating a mean and variance from the embedded vectors of all the singles in the dataset;

(4) For the music to be evaluated, calculating the vector distance between the embedded vector and the embedded vector of the same-style data set, wherein the vector distance can measure the degree of deviation of the single-curve characteristic from the reference distribution, and the larger the vector distance value is, the more abnormal the tone quality is.

(5) Finally, the tone quality of the song is judged according to the distance, for example, the distance value is larger than 0.5, and the tone quality is judged to be lower, otherwise, the tone quality is higher.

3. Audio scoring

A total of 8 indices were chosen to evaluate the audio, the first 4 indices from SongEval, the work constructed a dataset scored by the music college teacher and professional musicians and trained a regression model to obtain the corresponding indices. The 4 indexes are integral curve consistency, memorability, voice naturalness and structural definition. The middle 3 indexes come from Audiobox Aesthtic, which are respectively the manufacturing quality, the manufacturing complexity and the content practicability, and are obtained by analyzing the internal audio data by using the Meta to train a language model. The method mainly comprises the steps of manufacturing quality, manufacturing complexity, content pleasure, content utilization potential and secondary creation suitability, wherein the manufacturing quality mainly depends on definition and fidelity of audio, dynamic range control, frequency response balance and spatial presentation, the manufacturing complexity mainly depends on the number of audio components to quantify the complexity of an audio scene, the content pleasure mainly depends on emotion impact force, artistic skill presentation, artistic expression depth and comprehensive subjective experience, and the content utilization comprises content recycling potential and secondary creation suitability. The last 1 index combines the music performance of the 5 th index of SongEval and the content pleasure of the 4 th index of Audio Aesthetic, and in experiments, it is found that the music performance and the content pleasure index are linearly related, so that the average value of the music performance and the content pleasure index can be calculated as the index of the overall music performance of the song. In conclusion, 8 indexes are selected, namely, the consistency of whole music, the memorability, the naturalness of human voice, the structural definition, the manufacturing quality, the manufacturing complexity, the content practicability and the music pleasure, and the value ranges of all the indexes are set to be 0-5 minutes.

4. Audio screening

In the audio scoring step, an average score of 8 points is calculated for each song. Songs having an average score less than a predetermined fraction (e.g., less than 3 minutes) are filtered out, and the remaining songs enter a subsequent pass.

5. Audio description

The audio description is designed in a multi-tag structure, and tag categories include style, sub-style, speed, vocal gender, duration, sampling rate, age, channel (whether surround stereo is present), audio type (rap, singing, monologue), scene (under what scene the song is to be considered). These labels are separated by commas for marking the characteristics of a song. These labels will be input into the model later in the form of hidden vectors as conditions for the model.

6. Data evaluation

The data evaluation link mainly takes sampling detection as a main part. Each piece of data constructed through the above steps is a matching set containing "audio description, lyrics (if any), audio". Randomly sampling N (e.g. 1000) pieces of data from the data to check whether the three pieces of data are matched correctly. If the accuracy rate for a certain tag is lower than a preset accuracy rate (e.g., 90%), the labeling of the tag is cancelled, the model is retrained or the relevant tag is crawled, and the song party reaching the preset accuracy rate can enter the database.

The embodiment can realize the following steps through the music data processing method:

And the data quality is improved, namely the data is cleaned, so that noise and error information can be removed, and more consistent and accurate music fragments in the data set can be ensured. Model performance enhancement-cleaned data will generally better represent the target music style or type, thereby enhancing the performance of the generative model on a particular task. The model may more effectively capture features and laws in music. And the overfitting risk is reduced, namely, the overfitting phenomenon of the model during training can be reduced in the cleaning process by removing redundant and repeated data, so that the generalization capability of the model on new data is stronger. The training process is quickened, namely the scale of the training set can be reduced by cleaning data, but the representativeness of the training set is ensured, so that the calculated amount of the model to be processed in the training process is reduced, and the training speed is quickened. Optimizing the data structure the preprocessing process can standardize audio format, resolution and duration, making the data set easier to use and manage, which helps to improve the usability of the data.

It should be noted that any of the methods described in this embodiment may be combined based on actual implementation conditions, and have corresponding beneficial effects, which are not described herein.

Fig. 7 is a block diagram of a music data processing apparatus according to an exemplary embodiment. Referring to fig. 7, the apparatus includes:

A reference feature obtaining unit 710 configured to perform obtaining a dataset reference feature corresponding to each of a plurality of reference datasets, each reference dataset including a plurality of pieces of reference music data of a same music style, different reference datasets corresponding to different music styles, the dataset reference feature corresponding to each reference dataset being obtained by fusion processing based on the music feature of the plurality of pieces of reference music data in each reference dataset, the plurality of pieces of reference music data in each reference dataset satisfying a target sound quality condition;

A reference data set determination unit 720 configured to perform determination of a target reference data set corresponding to a music style of each item of source music data from among the plurality of reference data sets based on the music style of each item of source music data in the source data sets;

And a sample construction unit 730 configured to determine target source music data from the source data set based on the data set reference feature corresponding to the target reference data set and the feature distance information of the music feature of each item of source music data, the target source music data having the same or similar sound quality as any one of the reference music data in the target reference data set, the target source music data being used for constructing a music data sample.

the reference feature acquisition unit is configured to perform:

The sample construction unit is configured to perform:

The target aesthetic index determination unit is configured to perform:

Fig. 8 is a block diagram of a music generation model training apparatus according to an exemplary embodiment, which is implemented based on music sample data obtained based on the above-described music data processing method, the music data sample including a sample music tag, sample music lyrics, and sample music audio. Referring to fig. 8, the apparatus includes:

An audio prediction unit 810 configured to perform inputting the sample music tag and the sample music lyrics into a preset music generation model, resulting in predicted music audio;

A loss information determining unit 820 configured to perform determination of loss information based on the sample music audio and the predicted music audio;

and a model updating unit 830 configured to perform updating of the preset music generation model based on the loss information, so as to obtain a target music generation model.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

In an exemplary embodiment, a computer readable storage medium comprising instructions is also provided, optionally the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, etc., which when executed by a processor of an electronic device, enable the electronic device to perform any one of the methods as described above.

In an exemplary embodiment, a computer program product is also provided, the computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the device to perform any one of the methods described above.

Fig. 9 is a block diagram illustrating an electronic device for music data processing or music generation model training, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 9, according to an exemplary embodiment. The electronic device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a music data processing method or a music generation model training method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

Fig. 10 is a block diagram illustrating an electronic device for music data processing or music generation model training, which may be a server, and an internal structure diagram thereof may be as shown in fig. 10, according to an exemplary embodiment. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a music data processing method or a music generation model training method.

It will be appreciated by those skilled in the art that the structures shown in fig. 9 and 10 are merely block diagrams of portions of structures related to the disclosed aspects and do not constitute limitations of the electronic device to which the disclosed aspects are applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A music data processing method, characterized by comprising:

2. The method of claim 1, wherein each item of reference music data includes a reference audio and a reference text corresponding to the reference audio;

the method further comprises the steps of:

3. The method of claim 1 or 2, wherein the dataset reference features for each reference dataset include an average music feature and an inverse matrix for the covariance feature matrix;

4. A method according to claim 3, wherein before determining the target source music data from the source data set based on the data set reference feature corresponding to the target reference data set and the feature distance information of the music feature of each item of source music data, the method further comprises:

5. The method according to claim 1, wherein the method further comprises:

6. The method of claim 5, wherein the musical aesthetic indicators further include music uniformity, memorability, voice naturalness, structural clarity, and the audio aesthetic indicators further include quality of manufacture, complexity of manufacture, and content utility;

7. The method according to claim 1, wherein before determining the target source music data from the source data set based on the data set reference feature corresponding to the target reference data set and the feature distance information of the music feature of each item of source music data, the method further comprises:

8. A music generation model training method, characterized in that the training method is implemented based on music sample data, the music sample data is obtained based on the music data processing method according to any one of claims 1 to 7, the music data sample comprises a sample music tag, sample music lyrics and sample music audio, the training method comprises:

9. The method of claim 8, wherein the method further comprises:

10. A music data processing apparatus, characterized by comprising:

11. A music generation model training device, characterized in that the training device is realized based on music sample data, the music sample data is obtained based on the music data processing method according to any one of claims 1 to 7, the music data sample comprises a sample music tag, sample music lyrics and sample music audio, the training device comprises:

12. An electronic device, comprising:

A processor;

a memory for storing the processor-executable instructions;

Wherein the processor is configured to execute the instructions to implement the music data processing method of any one of claims 1 to 7, or the music generation model training method of any one of claims 8 to 9.

13. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the music data processing method of any one of claims 1 to 7, or the music generation model training method of any one of claims 8 to 9.

14. A computer program product, characterized in that the computer program product comprises a computer program, which is stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, such that the device performs the music data processing method according to any one of claims 1 to 7, or the music generation model training method according to any one of claims 8 to 9.